About the Company
We are an AI and Data Consulting Startup transforming how businesses leverage technology through four core service lines :
- Consulting Services : AI Strategy, Automation, and Digital Transformation for enterprises.
- SaaS Platform Development : Building a business application suite similar to Odoo and Zoho that is AI-native and user friendly.
- Data Lakehouse Solutions : Unified data pipelines for aggregation, cleaning, governance, and advanced analytics.
- Government Contracting : Developing secure, compliant AI solutions for the public sector.
Our tech stack includes Python, TypeScript, React, Next.js, Go, Rust, Azure, Kubernetes, Spark, MLflow, Postgres, graph databases, and vector stores.
We're a small, fast-moving team delivering enterprise-grade solutions with startup agility.
Role Overview
We are seeking a Senior DevOps / Site Reliability Engineer to design, scale, and optimize our cloud infrastructure. You'll directly influence our system reliability, deployment velocity, and security posture.
Roles & Responsibilities
Infrastructure & Cloud Management
Design and manage Azure Kubernetes Service (AKS) clusters for production workloads.Configure Azure networking components – VNets, Application Gateway, NSG, Load Balancing.Build and maintain Dockerized microservices and Helm chart deployments.Implement Infrastructure as Code (IaC) using Terraform for modular, reusable infrastructure.CI / CD Pipeline & Deployment Automation
Build GitHub Actions workflows for automated testing, building, and deployment.Implement blue-green and canary deployments for zero-downtime releases.Create automated rollback mechanisms and optimize build & deployment pipelines.Monitoring, Observability & Reliability
Implement Prometheus, Grafana, ELK / Datadog for system monitoring.Define alerting thresholds and dashboards for uptime and performance metrics.Lead incident response, root cause analysis, and post-mortem documentation.Performance, Scalability & Cost Optimization
Architect solutions for real-time WebSocket scalability across thousands of users.Implement auto-scaling policies (Kubernetes HPA, cluster autoscaler).Optimize infrastructure to reduce cloud costs by 20–30%.Security & Compliance
Implement secrets management with Azure Key Vault or HashiCorp Vault.Enforce container security, network segmentation, and access control.Support SOC 2, HIPAA, and CMMC L2 compliance initiatives.Collaboration & Mentorship
Work closely with the Solutions Architect and developers to ensure release reliability.Mentor junior team members and document DevOps best practices.Participate in on-call rotations and improve operational excellence.Skills & Qualifications
Must-Have Skills
Azure Kubernetes Service (AKS) cluster managementDocker, Helm, Terraform (Infrastructure as Code)CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD)Prometheus / Grafana / ELK StackAzure Networking (NSG, VNet, Load Balancer, App Gateway)Secrets Management (Azure Key Vault / HashiCorp Vault)SQL / WebSocket scalability knowledgeGood-to-Have Skills
ArgoCD / Flux (GitOps)KEDA (Event-Driven Autoscaling)OpenTelemetry (Distributed Tracing)Istio / Linkerd (Service Mesh)Python or Go scriptingFinOps and Azure Cost ManagementEducation
UG : B.Tech / B.E. – Computer Science / Information TechnologyPG : M.Tech / M.E. / Any Postgraduate (preferred)Why Join Us
Impact : You'll have significant influence on the overall architecture and scalability of our products and the solutions we provide to a diverse set of clients.Growth : Opportunities to lead team(s) as our organization expands.Learning : Work with the latest in Azure, Kubernetes, and AI infrastructure.Culture : Flat hierarchy, collaborative, and outcome-driven team.