You'll Make a Difference By :
- SRE L2 Support Role : Focus on maintaining and improving the reliability, availability, and performance of AWS-based infrastructure and applications.
- Incident Management : Handle and resolve L2 incidents related to AWS services (EC2, RDS, S3, Lambda, EKS, etc.), perform root cause analysis, and communicate to customers during outages or SLA breaches.
- Monitoring & Optimization : Proactively monitor infrastructure and application health in AWS, set up and fine-tune AWS monitoring and observability tools (e.g., CloudWatch, CloudTrail), create alarms, dashboards, and reports.
- Troubleshooting AWS Services : Resolve issues related to EC2 instances, Autoscaling Groups, Load Balancers (ELB / ALB / NLB), Amazon ECS, EKS, and container workloads.
- Log Management : Manage and analyze logs using AWS CloudWatch Logs, CloudTrail, and third-party solutions like ELK Stack, Datadog, Splunk.
- Disaster Recovery & Backups : Monitor AWS Backup jobs, ensure regular backups for critical infrastructure, validate DR plans, and participate in recovery testing exercises.
- Automation & Scripting : Contribute to automation of repetitive tasks using scripts and support incident recovery processes.
- Documentation & Knowledge Sharing : Create and maintain operational runbooks, SOPs, and knowledge base articles for common AWS issues.
- Collaboration : Work effectively across teams, shift ownership as required, and communicate with stakeholders during incidents.
You'd Describe Yourself As :
An experienced professional with 6 to 9 years of relevant experience in SRE , DevOps , or Cloud Infrastructure Support with strong hands-on expertise in AWS services .Proficient in monitoring tools like Prometheus, Datadog, and familiar with cloud platforms (AWS, Azure, GCP).Knowledgeable in Linux / Unix operating systems and basic scripting skills (e.g., Python, GitLab actions).Familiar with container orchestration (Kubernetes, Docker, Helmcharts), CI / CD pipelines , and GitOps workflows (e.g., ArgoCD for automated deployments).Strong analytical skills to resolve production incidents and a basic understanding of networking concepts (DNS, Load Balancers, Firewalls).Experienced with alerting systems (e.g., PagerDuty), incident tracking tools (e.g., JIRA, ServiceNow), and ability to handle high-pressure environments.A proactive problem-solver with a strong sense of urgency and excellent organizational skills to prioritize tasks effectively.Able to work as a teammate , collaborating across teams and owning tasks as needed.Preferred Certifications :
AWS Certified SysOps Administrator AssociateAWS Certified Solutions Architect AssociateAWS Certified DevOps Engineer ProfessionalSkills Required
SRE, Monitoring Tools, Cloud Infrastructure, Devops, Aws, Automation