Note : This job role is part of MetLifes Hack4Job India (a hiring hackathon).
Only shortlisted candidates will be invited.
Department : Global Overview
MetLife is seeking an experienced Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of critical systems and services.
The role involves monitoring, automation, incident management, and collaboration with engineering teams to optimize system reliability and Responsibilities :
- System Reliability & Performance : Ensure system uptime, troubleshoot issues, and optimize performance.
- Service Design & Automation : Develop automation scripts and tools to streamline operations.
- Monitoring & Alerting : Implement observability solutions using ELK, Grafana, Splunk, and Azure Monitor.
- Incident Response & Management : Lead root cause analysis, post-mortems, and corrective actions.
- Collaboration : Work with engineering teams to align system performance with business goals.
- Documentation & Knowledge Sharing : Maintain accurate system documentation and promote best & Skills :
- Experience : 3+ years as an SRE, supporting hybrid cloud platforms (On-Prem and Azure).
- Programming : Java, Python, Bash, PowerShell.
- Cloud & Containers : Azure services, Docker, Kubernetes, Terraform.
- Monitoring & Logging : ELK stack, Grafana, Splunk, Azure Application Insights.
- Database : Strong hands-on experience with SQL.
- Tools : Azure DevOps, Pipelines, Repos, ServiceNow.
- Soft Skills : Strong analytical, problem-solving, and communication skills.
- Language : Business proficiency in English; Japanese language is a plus.
(ref : hirist.tech)