Key Responsibilities :
- Lead incident management , monitoring, and alerting processes to ensure timely detection and resolution of production issues.
- Ensure reliability, availability, and performance of systems by defining and maintaining SLIs, SLOs, and SLAs.
- Design and implement fault-tolerant, scalable architectures to minimize downtime and improve resiliency.
- Develop automation and tooling for monitoring, incident remediation, and infrastructure management.
- Participate in a 24x7 on-call rotation to manage production incidents and maintain system uptime.
- Create and maintain SOPs and technical documentation for processes, tools, and incident management protocols.
- Implement and manage Infrastructure as Code (IaC) using tools such as Terraform and Ansible to automate provisioning and deployments.
- Work with cloud platforms —primarily AWS (EC2, S3, VPC, RDS, EKS, ECS, CloudWatch, CloudFormation)—to support scalable system operations.
- Integrate and manage CI / CD pipelines using tools like Jenkins to enable seamless deployments.
- Utilize monitoring and alerting tools (Datadog, Site24x7, Grafana, CloudWatch) to proactively identify issues.
- Conduct performance tuning and optimization , addressing bottlenecks and improving efficiency.
- Drive cost optimization strategies while maintaining performance and reliability standards.
- Adhere to security best practices and ensure infrastructure compliance with organizational standards.
- Collaborate with development, product, and security teams to enhance system reliability and service delivery.
- Mentor junior engineers and promote a culture of reliability engineering across the organization.
Qualifications :
5–8 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.Strong hands-on expertise with AWS (experience with GCP or Azure is a plus).Proficiency in Infrastructure as Code (IaC) tools such as Terraform and Ansible .Experience with monitoring and alerting tools including Datadog, Site24x7, Grafana, and CloudWatch.Solid understanding of CI / CD tools such as Jenkins.Proven ability in incident management, root cause analysis , and implementing long-term reliability improvements.Familiarity with automation scripting (Python, Bash, or Shell scripting preferred).Knowledge of security best practices , networking , and cloud cost management .Excellent problem-solving, analytical, and collaboration skills.AWS certification or equivalent cloud certification is an advantage.Skills Required
Aws, Rds, ECS, Vpc, Cloud, Ci