Talent.com
This job offer is not available in your country.
Lead Site Reliability Engineer - Java

Lead Site Reliability Engineer - Java

Landmark GroupBengaluru, Karnataka, India
26 days ago
Job description

COMPANY- LANDMARK GROUP

Job Title : SRE Lead (Engineering & Reliability)

Experience : 8-12 years

Job Summary :

We are seeking an experienced and dynamic Site Reliability Engineering (SRE) Lead to

oversee the reliability, scalability, and performance of our critical systems. As an SRE Lead,

you will play a pivotal role in establishing and implementing SRE practices, leading a team

of engineers, and driving automation, monitoring, and incident response strategies. This

position combines software engineering and systems engineering expertise to build and

maintain high-performing, reliable systems.

Key Responsibilities :

Reliability & Performance :

  • Lead efforts to maintain high availability and reliability of critical services.
  • Define and monitor SLIs, SLOs, and SLAs to ensure business requirements are met.
  • Proactively identify and resolve performance bottlenecks and system inefficiencies.

Incident Management & Response :

  • Establish and improve incident management processes and on-call rotations.
  • Lead incident response and root cause analysis for high-priority outages.
  • Drive post-incident reviews and ensure actionable insights are implemented.
  • Automation & Tooling :

  • Develop and implement automated solutions to reduce manual operational tasks.
  • Enhance system observability through metrics, logging, and distributed tracing tools
  • (e.g., Prometheus, Grafana, Elastic APM).

  • Optimize CI / CD pipelines for seamless deployments.
  • Collaboration :

  • Partner with software engineering teams to improve the reliability of applications and
  • infrastructure.

  • Work closely with product / engineering teams to design scalable and robust systems.
  • Ensure seamless integration of monitoring and alerting systems across teams.
  • Leadership & Team Building :

  • Manage, mentor, and grow a team of SREs.
  • Promote SRE best practices and foster a culture of reliability and performance across
  • the organization.

  • Drive performance reviews, skills development, and career progression for team
  • members.

    Capacity Planning & Cost Optimization :

  • Perform capacity planning and implement autoscaling solutions to handle traffic
  • spikes.

  • Optimize infrastructure and cloud costs while maintaining reliability and
  • performance.

    Skills & Qualifications :

  • Technical Expertise :
  • o Experience with cloud platforms (AWS / Azure / GCP) and Kubernetes.

    o Hands-on knowledge of infrastructure-as-code tools like Terraform / Helm / Ansible.

    o Proficiency in Java

    o Expertise in distributed systems, databases, and load balancing.

  • Monitoring & Observability :
  • o Proficient with tools like Prometheus, Grafana,, Elastic APM, or New relic.

    o Understanding of metrics-driven approaches for system monitoring and alerting.

  • Automation & CI / CD :
  • o Hands-on experience with CI / CD pipelines (e.g., Jenkins, Azure Pipelines etc).

    o Skilled in automation frameworks and tools for infrastructure and application deployments.

  • Incident Management :
  • o Proven track record in handling incidents, post-mortems, and implementing

    solutions to prevent recurrence.

    Leadership & Communication Skills :

  • Strong people management and leadership skills with the ability to inspire and motivate teams.
  • Excellent problem-solving and decision-making skills.
  • Clear and concise communication, with the ability to translate technical concepts for non-technical stakeholders.
  • Preferred Qualifications :

  • Experience with database optimization, Kafka, or other messaging systems.
  • Knowledge of autoscaling techniques
  • Previous experience in an SRE, DevOps, or infrastructure engineering leadership role.
  • Understanding of compliance and security best practices in distributed systems.
  • Why Join Us?

  • Be a key driver in building and scaling reliable systems in a fast-paced environment.
  • Work with cutting-edge technologies and influence the evolution of the infrastructure.
  • Lead a high-impact team and foster a culture of reliability and innovation.
  • Create a job alert for this search

    Site Reliability Engineer • Bengaluru, Karnataka, India