The SRE will ensure the reliability, availability, and performance of critical applications and infrastructure by providing hands-on support, monitoring, and automation. The role involves incident management, problem resolution, root cause analysis, and continuous improvement of processes and systems.
Key Responsibilities :
- Provide 24x7 SRE support including incident management, problem management, RCA, and infrastructure maintenance.
- Act as a subject matter expert for portfolio applications, documenting architecture, core functionalities, and standard operating procedures.
- Develop and implement automation solutions for monitoring, alerting, self-healing, and reliability testing.
- Perform production support duties including off-hours and rotational on-call support.
- Monitor technology currency such as server patching, certificate renewal, compliance, and system optimization.
- Collaborate with engineering teams to implement best practices, improve tooling, and enhance operational efficiency.
- Analyze and optimize application and infrastructure performance, focusing on uptime, reliability, and cost-effectiveness.
- Stay updated on emerging SRE tools and technologies and provide recommendations for adoption.
Skills Required
SRE, Microsoft Azure, Rca