Key Responsibilities :
- SRE Strategy and Leadership : Develop and implement a comprehensive SRE strategy aligned with the companys goals and objectives. Lead a team of SRE professionals to drive the reliability, performance, and scalability of GRC technology solutions.
- Observability and Monitoring : Establish observability practices to ensure real-time insights into system performance, availability, and customer experience. Implement monitoring tools, metrics, and dashboards to proactively identify and address potential issues.
- Production Support Optimization : Lead all aspects of the end-to-end production support process, including incident management, problem resolution, and service-level agreement (SLA) compliance. Drive continuous improvement initiatives to enhance operational effectiveness and reduce mean time to resolution (MTTR).
- GRC Customer Journeys : Collaborate with multi-functional teams to enhance customer journeys through seamless and reliable technology experiences.
- Reliability Engineering Best Practices : Promote and implement standard methodologies, including error budgeting, chaos engineering, and disaster recovery planning. Cultivate a culture of resilience and reliability within technology.
- Automation and Efficiency : Champion automation initiatives to streamline operational workflows, deployment processes, and incident response tasks. Leverage automation tools and orchestration to improve reliability and reduce manual intervention.
Eligibility Criteria
4-8 years of experienceHands-on coding of highly available distributed systemsJava or Python or JavaScript,Knowledge on monitoring tools like Splunk or Dynatrace or PrometheusKnowledge of cloud-based SRE practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud.Familiarity with containerization technologies (e.g., Kubernetes, Docker) and microservices architecture.Knowledge of ServiceNow or any other ticketing tools, ITIL experience.Demonstrated expertise in driving culture change, DevOps practices, and continuous improvement in SRE and production support functions.Deep understanding of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms.Skills Required
Java, Azure, Python, Google Cloud, Aws