Talent.com
This job offer is not available in your country.
Lead Site Reliability Engineer

Lead Site Reliability Engineer

ConfidentialHyderabad / Secunderabad, Telangana
30+ days ago
Job description
  • Collaborate with development, operations, and product teams to define, review, and implement reliability standards and best practices.
  • Design, implement, and maintain highly available and scalable architectures for our applications and infrastructure.
  • Develop and enhance automated tools and frameworks to optimize system monitoring, deployment, and recovery.
  • Troubleshoot and resolve complex issues throughout the entire software stack, including networking, databases, and distributed systems.
  • Conduct performance analysis and capacity planning to ensure system scalability and resource optimization.
  • Take a proactive approach to continuously improving reliability.
  • Participate in incident response, root cause analysis, and postmortem activities to identify and rectify system failures.
  • Collaborate with cross-functional teams to implement and improve CI / CD pipelines, ensuring reliable and efficient software releases.
  • Stay up-to-date with emerging technologies and industry trends, actively contributing to ongoing system improvements.
  • Participate in on-call rotation.
  • Requirements :

    • Bachelors degree in Computer Science, Engineering, or equivalent practical experience.
    • Proven experience deploying and managing large-scale distributed systems successfully.
    • Understanding of SRE concepts (error budgets, SLIs / SLOs, blameless postmortems)
    • Proficiency in programming languages such as Python, C++, or Go
    • Familiarity with monitoring and observability tools.
    • Excellent problem-solving skills and ability to troubleshoot complex issues efficiently.
    • Strong organizational and communication skills, with the ability to collaborate effectively in a cross-functional team environment.
    • Desirable Qualifications :

    • Familiarity with security best practices and experience implementing security measures in a production environment.
    • Experience with modern infrastructure technologies and tools, including cloud platforms (AWS, Azure, GCP), containers (Docker, Kubernetes), and orchestration (Ansible, Chef, Puppet).
    • Solid understanding of networking protocols and technologies (TCP / IP, DNS, load balancing).
    • Demonstrated experience with infrastructure as code (IaC) and automation tools (e.g., Terraform, GitHub Actions).
    • Skills Required

      System Administration, Load Balancing, Coding, Distribution System, Dns, Networking Protocols, Python, Monitoring

    Create a job alert for this search

    Site Reliability Engineer • Hyderabad / Secunderabad, Telangana