Responsibilities
Team Leadership and Development :
- Lead, mentor, and develop a team of SRE engineers, fostering a culture of collaboration, continuous learning, and operational excellence.
- Define team goals, key metrics, and performance objectives aligned with organizational priorities.
Operational Reliability :
Ensure the availability, reliability, and performance of Azure-hosted services through proactive monitoring and alerting.Develop and enforce incident response best practices, root cause analysis, and postmortem processes.Establish and manage SLAs, SLOs, and error budgets in partnership with product and engineering teams.Automation and Tooling :
Promote adoption of automation to reduce manual tasks and improve system reliability.Implement Infrastructure as Code (IaC) using tools such as Terraform, ARM templates, or Bicep.Performance and Scalability :
Optimize system performance, conduct capacity planning, and ensure scalability to support business growth.Use Azure Monitor, Application Insights, and Log Analytics for insights into system health and performance.Collaboration and Stakeholder Management :
Partner with development, product, and infrastructure teams to align on technical strategies and priorities.Communicate operational status, risks, and improvement opportunities to executive stakeholders.Risk and Security Management :
Ensure adherence to security best practices, standards, and policies within Azure environments.Identify, assess, and mitigate risks related to cloud infrastructure and applications.Skills Required
Azure Cloud, Site Reliability Engineering, Terraform, ARM Templates, Automation