Note : Please apply directly through LinkedIn for this position. We kindly request that candidates refrain from contacting company officials via email or messages regarding this role.
About the Company
We are an AI and Data Consulting Startup transforming how businesses leverage technology through four core service lines :
Consulting Services : AI Strategy, Automation, and Digital Transformation for enterprises.
SaaS Platform Development : Building a business application suite similar to Odoo and Zoho that is AI-native and user friendly.
Data Lakehouse Solutions : Unified data pipelines for aggregation, cleaning, governance, and advanced analytics.
Government Contracting : Developing secure, compliant AI solutions for the public sector.
Our tech stack includes Python, TypeScript, React, Next.js, Go, Rust, Azure, Kubernetes, Spark, MLflow, Postgres, graph databases, and vector stores.
We're a small, fast-moving team delivering enterprise-grade solutions with startup agility.
Role Overview
We are seeking a Senior DevOps / Site Reliability Engineer to design, scale, and optimize our cloud infrastructure. You'll directly influence our system reliability, deployment velocity, and security posture.
Roles & Responsibilities
Infrastructure & Cloud Management
Design and manage Azure Kubernetes Service (AKS) clusters for production workloads.
Configure Azure networking components – VNets, Application Gateway, NSG, Load Balancing.
Build and maintain Dockerized microservices and Helm chart deployments.
Implement Infrastructure as Code (IaC) using Terraform for modular, reusable infrastructure.
CI / CD Pipeline & Deployment Automation
Build GitHub Actions workflows for automated testing, building, and deployment.
Implement blue-green and canary deployments for zero-downtime releases.
Create automated rollback mechanisms and optimize build & deployment pipelines.
Monitoring, Observability & Reliability
Implement Prometheus, Grafana, ELK / Datadog for system monitoring.
Define alerting thresholds and dashboards for uptime and performance metrics.
Lead incident response, root cause analysis, and post-mortem documentation.
Performance, Scalability & Cost Optimization
Architect solutions for real-time WebSocket scalability across thousands of users.
Implement auto-scaling policies (Kubernetes HPA, cluster autoscaler).
Optimize infrastructure to reduce cloud costs by 20–30%.
Security & Compliance
Implement secrets management with Azure Key Vault or HashiCorp Vault.
Enforce container security, network segmentation, and access control.
Support SOC 2, HIPAA, and CMMC L2 compliance initiatives.
Collaboration & Mentorship
Work closely with the Solutions Architect and developers to ensure release reliability.
Mentor junior team members and document DevOps best practices.
Participate in on-call rotations and improve operational excellence.
Skills & Qualifications
Must-Have Skills
Azure Kubernetes Service (AKS) cluster management
Docker, Helm, Terraform (Infrastructure as Code)
CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD)
Prometheus / Grafana / ELK Stack
Azure Networking (NSG, VNet, Load Balancer, App Gateway)
Secrets Management (Azure Key Vault / HashiCorp Vault)
SQL / WebSocket scalability knowledge
Good-to-Have Skills
ArgoCD / Flux (GitOps)
KEDA (Event-Driven Autoscaling)
OpenTelemetry (Distributed Tracing)
Istio / Linkerd (Service Mesh)
Python or Go scripting
FinOps and Azure Cost Management
Education
UG : B.Tech / B.E. – Computer Science / Information Technology
PG : M.Tech / M.E. / Any Postgraduate (preferred)
Why Join Us
Impact : You'll have significant influence on the overall architecture and scalability of our products and the solutions we provide to a diverse set of clients.
Growth : Opportunities to lead team(s) as our organization expands.
Learning : Work with the latest in Azure, Kubernetes, and AI infrastructure.
Culture : Flat hierarchy, collaborative, and outcome-driven team.
Senior Engineer • Dombivali, Maharashtra, India