Talent.com
Site Reliability Engineer

Site Reliability Engineer

ReyikaBangalore (division)
2 days ago
Job description

Role : Senior Site Reliability Engineer / Reliability Architect

Locations : Pune,Bengalore,Chennai,Pune,Noida

Job Description :

Reliability Architect with over 9 years of experience in proactive monitoring, automation, and observability. Skilled in AIOps / MLOps, infrastructure management, and performance optimization using modern tools and practices. Adept at leading incident response, mentoring support teams, and driving cross-functional collaboration to ensure system reliability and scalability.

Key Responsibilities :

  • Monitoring and Automation
  • Proactively monitor software systems to prevent incidents and automate routine operational tasks.
  • Effective Monitoring
  • Design monitoring systems that trigger alerts based on symptoms rather than outages, ensuring early detection and resolution.
  • Application Performance Monitoring (APM)
  • Implement and manage APM tools like New Relic or Dynatrace to track application performance, identify bottlenecks, and optimize resource usage.
  • Log Analysis with Splunk
  • Use Splunk to analyze logs for troubleshooting, anomaly detection, and improving system reliability.
  • Dashboards Preparation
  • Build intuitive dashboards to visualize system health, performance metrics, and operational KPIs.
  • Alerts Setup
  • Configure intelligent alerts based on thresholds and anomalies to ensure timely incident response.
  • Reports Scheduling
  • Automate regular reporting to provide insights into system performance, reliability, and trends.
  • Reliability Metrics
  • Define and track metrics such as SLOs, SLIs, and error budgets to measure and maintain system reliability.
  • Observability Skills
  • Apply observability practices including distributed tracing, logging, and metrics collection to gain deep insights into system behavior.
  • AI-Driven Monitoring & Automation
  • Utilize AIOps techniques to proactively detect anomalies, automate incident response, and enable self-healing systems through intelligent alerting and predictive analytics.
  • Observability & ML Integration
  • Integrate machine learning models with observability tools to enhance system insights, optimize performance, and ensure reliability of AI-powered services in production.
  • Cross-Team Collaboration
  • Work closely with development and support teams to enhance service reliability through rigorous testing and release procedures.
  • Capacity Planning
  • Participate in system design reviews and capacity planning to ensure scalability and performance.
  • Debugging and Incident Response
  • Lead incident response efforts, analyze debugging information, and manage rollbacks of faulty software deployments.
  • Mentoring Support Teams
  • Guide and mentor L1 / L2 support teams to establish best practices in monitoring and observability.
  • Infrastructure Management
  • Manage infrastructure using tools like Chef , Ansible , Terraform , GitLab CI / CD , and Kubernetes .
  • Documentation
  • Maintain comprehensive documentation of processes and procedures to ensure operational consistency and reduce redundancy.
  • Proactive Mindset
  • Approach challenges with enthusiasm, ownership, and a continuous improvement mindset.
Create a job alert for this search

Site Reliability Engineer • Bangalore (division)