Description :
Location : Pan India Except Mumbai
About the Role :
We are looking for a highly experienced Reliability Architect with strong expertise in proactive monitoring, observability, automation, AIOps / MLOps, and large-scale infrastructure management.
The ideal candidate will drive system reliability, performance optimization, and cross-functional collaboration while leading incident response and mentoring support teams.
Key Responsibilities :
Monitoring & Automation :
- Proactively monitor software systems to prevent incidents and reduce manual intervention.
- Automate routine operational tasks to maximize operational efficiency.
Effective Monitoring & Alerting :
Design intelligent monitoring systems that trigger symptom-based alerts for early issue detection.Configure alert thresholds, anomaly detection rules, and escalation workflows.Application Performance Monitoring (APM) :
Implement and manage APM tools such as New Relic, Dynatrace, AppDynamics, etc.Track application performance, identify bottlenecks, and optimize resource utilization.Log Analysis & Troubleshooting :
Leverage Splunk (or similar tools) for log analysis, anomaly detection, and incident debugging.Improve system reliability through continuous log insights and root cause analysis.Dashboards & Reporting :
Build intuitive dashboards visualizing system health, KPIs, and operational metrics.Automate scheduled reports for performance trends, reliability metrics, and risk indicators.Reliability Metrics & Observability :
Define and track SLOs, SLIs, error budgets, and other reliability benchmarks.Apply full-stack observability practices including logs, metrics, distributed tracing, and event correlation.AI-Driven Monitoring (AIOps / MLOps) :
Use AIOps to detect anomalies, automate incident response, and build self-healing workflows.Integrate ML models with observability tools for predictive insights and performance optimization.Cross-Team Collaboration :
Collaborate with development, DevOps, and support teams to enhance service reliability.Strengthen release processes through rigorous testing, reviews, and monitoring integration.Capacity Planning & Performance :
Participate in architecture and design reviews.Ensure systems are scalable, resilient, and optimized for peak performance.Debugging, Incident Response & Rollbacks :
Lead major incident response efforts with structured troubleshooting and RCA.Manage controlled rollbacks of faulty deployments and ensure minimal service impact.Mentoring & Knowledge Sharing :
Mentor L1 / L2 support teams, establishing best practices for monitoring and observability.Promote a culture of reliability engineering and continuous improvement.Infrastructure & Tooling :
Manage infrastructure using tools like Chef, Ansible, Terraform, Kubernetes, GitLab CI / CD, etc.Support automation, configuration management, and infrastructure-as-code workflows.Documentation :
Maintain detailed documentation of processes, architectures, SOPs, and troubleshooting guides.Proactive Mindset :
Drive reliability initiatives with ownership, enthusiasm, and a forward-thinking approach.Desired Skills & Tools :
AIOps / MLOps platformsSplunk, Grafana, Kibana, PrometheusNew Relic, Dynatrace, AppDynamicsTerraform, Ansible, ChefGitLab CI / CD, JenkinsKubernetes, DockerStrong debugging and RCA skillsExcellent communication and cross-functional collaboration(ref : hirist.tech)