Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

Resource AlgorithmIndia
1 day ago
Job description

Senior SRE (Engineering & Reliability)

Job Summary :

We are seeking an experienced and dynamic Site Reliability Engineering (SRE) Lead to oversee the reliability, scalability, and performance of our critical systems.

As an SeniorSRE, you will play a pivotal role in establishing and implementing SRE practices, leading a team of engineers, and driving automation, monitoring, and incident response strategies. This position combines software engineering and systems engineering expertise to build and maintain high-performing, reliable systems.

Experience : 7+ years

Key Responsibilities :

Reliability & Performance :

  • Lead efforts to maintain high availability and reliability of critical services.
  • Define and monitor SLIs, SLOs, and SLAs to ensure business requirements are met.
  • Proactively identify and resolve performance bottlenecks and system inefficiencies. Incident

Management & Response :

  • Establish and improve incident management processes and on-call rotations.
  • Lead incident response and root cause analysis for high-priority outages.
  • Drive post-incident reviews and ensure actionable insights are implemented.
  • Automation & Tooling :

  • Develop and implement automated solutions to reduce manual operational tasks.
  • Enhance system observability through metrics, logging, and distributed tracing tools (e.g.,
  • Prometheus, Grafana, Elastic APM).

  • Optimize CI / CD pipelines for seamless deployments.
  • Collaboration :

  • Partner with software engineering teams to improve the reliability of applications and infrastructure.
  • Work closely with product / engineering teams to design scalable and robust systems.
  • Ensure seamless integration of monitoring and alerting systems across teams. Leadership &
  • Team Building :

  • Manage, mentor, and grow a team of SREs.
  • Promote SRE best practices and foster a culture of reliability and performance across the organization.
  • Drive performance reviews, skills development, and career progression for team members.
  • Capacity Planning & Cost Optimization :

  • Perform capacity planning and implement autoscaling solutions to handle traffic spikes.
  • Optimize infrastructure and cloud costs while maintaining reliability and performance.
  • Skills & Qualifications :

    Required Skills :

  • Technical Expertise : o Experience with cloud platforms (AWS / Azure / GCP) and Kubernetes.
  • Hands-on knowledge of infrastructure-as-code tools like Terraform / Helm / Ansible.

    o Proficiency in Java o Expertise in distributed systems, databases, and load balancing.

    Monitoring & Observability :

    Proficient with tools like Prometheus, Grafana,, Elastic APM, or New relic.

    o Understanding of metrics-driven approaches for system monitoring and alerting.

  • Automation & CI / CD :
  • o Hands-on experience with CI / CD pipelines (e.g., Jenkins, Azure Pipelines etc).

    o Skilled in automation frameworks and tools for infrastructure and application deployments.

  • Incident Management :
  • o Proven track record in handling incidents, post-mortems, and implementing solutions to prevent recurrence.

    Leadership & Communication Skills :

  • Strong people management and leadership skills with the ability to inspire and motivate teams.
  • Excellent problem-solving and decision-making skills.
  • Clear and concise communication, with the ability to translate technical concepts for non-technical stakeholders.
  • Preferred Qualifications :

  • Experience with database optimization, Kafka, or other messaging systems.
  • Knowledge of autoscaling techniques
  • Previous experience in an SRE, DevOps, or infrastructure engineering leadership role.
  • Understanding of compliance and security best practices in distributed systems.
  • Create a job alert for this search

    Site Reliability Engineer • India