Position Overview
Site Reliability Engineer (SRE) with a strong focus on observability, monitoring, and incident management within the Microsoft Azure ecosystem. The ideal candidate will have hands-on experience with Azure Monitor, Application Insights, and Azure API Management (APIM), and will play a key role in ensuring the reliability, performance, and visibility of our cloud-based systems. Experience with New Relic is a plus.
Key Responsibilities :
1. Observability & Monitoring
- Design and implement observability solutions using Azure Monitor, Log Analytics, and Application Insights.
- Develop and maintain dashboards and visualizations for real-time system health and performance metrics.
- Define and fine-tune alerting rules to proactively detect and respond to system anomalies.
- Continuously improve monitoring practices to reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
- Develop and maintain Azure B2C SSO Custom Policies for Orchestration process
- Good to have Okta knowledge
2. Azure API Management (APIM)
Support the deployment, configuration, and monitoring of APIs using Azure APIM.Implement logging, tracing, and analytics for APIs to ensure visibility and performance tracking.Collaborate with development teams to ensure APIs are observable, secure, and scalable.3. Incident Management
Respond to incidents in a timely and effective manner.Conduct root cause analysis and contribute to post-incident reviews and documentation.Work with cross-functional teams to implement long-term solutions and prevent recurrence.4. Collaboration & Documentation
Collaborate with development, QA, and operations teams to embed observability into the software development lifecycle.Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault findingCollaborate with development teams to improve services through rigorous testing and release proceduresParticipate in system design consulting, platform management, and capacity planningCreate sustainable systems and services through automation and upliftsBalance feature development speed and reliability with well-defined service-level objectivesMaintain clear and comprehensive documentation for monitoring setups, incident response procedures, and system configurations.Share knowledge through internal training, documentation, and peer reviews.Required Skills & Qualifications :
5+ years of experience in a Site Reliability, DevOps, or Cloud Engineer role.3+ years of experience with Azure Monitor, Application Insights, and Log Analytics.2+ years of experience Azure API Management (APIM) and API lifecycle management.Strong knowledge using Kusto Query Language (KQL)Strong knowledge using SQLBasic scripting skills (PowerShell, Azure CLI)Understanding of distributed systems, microservices, and cloud-native architectures.Strong analytical, troubleshooting, and communication skills.Experience with New Relic , DataDog, Splunk (one of the tools) for application performance monitoring and observability.Commitment to Diversity, Equity, Inclusion, and Belonging
At Zelis, we champion diversity, equity, inclusion, and belonging in all aspects of our operations. We embrace the power of diversity and create an environment where people can bring their authentic and best selves to work. We know that a sense of belonging is key not only to your success at Zelis, but also to your ability to bring your best each day.