Talent.com
Principal Site Reliability Engineer

Principal Site Reliability Engineer

ConfidentialHyderabad / Secunderabad, Telangana
30+ days ago
Job description

Roles & Responsibilities :

  • Talent Management & Team Leadership :   Lead, mentor, empower and manage 5-10 hard-working engineering team to deliver exceptional results
  • System Reliability, Performance Optimization & Cost Reduction :  Ensure the reliability, scalability, and performance of Amgens infrastructure, platforms, and applications. Proactively identify and resolve performance bottlenecks, and implement long-term fixes. Continuously evaluate system design and usage to find opportunities for cost optimization, ensuring infrastructure efficiency without compromising reliability.
  • Automation & Infrastructure as Code (IaC) :   Drive the adoption of automation and Infrastructure as Code (IaC) across the organization to streamline operations, minimize manual interventions, and enhance scalability. Implement tools and frameworks (such as Terraform, Ansible, or Kubernetes) that increase efficiency and reduce infrastructure costs through optimized resource utilization.
  • Standardization of Processes & Tools :   Establish standardized operational processes, tools, and frameworks across Amgens technology stack to ensure consistency, maintainability, and best-in-class reliability practices. Champion the use of industry standards to optimize performance and increase operational efficiency.
  • Monitoring, Incident Management & Continuous Improvement :   Implement and maintain comprehensive monitoring, alerting, and logging systems to detect issues early and ensure rapid incident response. Lead the incident management process to minimize downtime, conduct root cause analysis, and implement preventive measures to avoid future occurrences. Foster a culture of continuous improvement by demonstrating data from incidents and performance monitoring.
  • Collaboration & multi-functional Leadership :   Partner with software engineering, DevOps, and IT teams to integrate reliability, performance optimization, and cost-saving strategies throughout the development lifecycle. Act as a domain expert in SRE principles and advocate for standard methodologies across all teams.
  • Capacity Planning & Disaster Recovery :   Develop and implement capacity planning processes to support future growth, performance, and cost management. Maintain disaster recovery strategies to ensure system reliability and minimize downtime in the event of failures.

What we expect of you

We are all different, yet we all use our unique contributions to serve patients.

Basic Qualifications :

  • Masters degree and 8 to 10 years of Computer Science, Engineering, or related field experience OR
  • Bachelors degree and 10 to 14 years of Computer Science, Engineering, or related field experience OR
  • Diploma and 14 to 18 years of Computer Science, Engineering, or related field experience
  • Preferred Qualifications :

  • Performance Tuning & Cost Optimization :   Expertise in identifying performance bottlenecks in large-scale distributed systems and implementing optimization strategies. Experience with cost management in cloud environments (AWS, Azure) to drive cost-effective infrastructure decisions.
  • Automation Tools & Infrastructure as Code :   Deep expertise with automation tools such as Terraform, Ansible, or Puppet, and hands-on experience with Infrastructure as Code (IaC) to automate infrastructure provisioning and maintenance, enhancing both performance and cost efficiency.
  • Monitoring & Incident Management :   Proficient in deploying and managing monitoring solutions in production such as Dynatrace, Datadog, or New Relic to maintain high system performance and ensure rapid incident response. Proven experience with incident management
  • Standardization & Best Practices :   Strong background in creating and enforcing standardized processes, coding practices, and frameworks to ensure consistency, scalability, and improved system performance, and evangelize by collaborating across teams
  • Good-to-Have Skills :

  • Experience with containerization (Docker) and orchestration tools (Kubernetes) to optimize resource usage and improve scalability.
  • Knowledge of cloud-native technologies and strategies for cost optimization in multi-cloud environments.
  • Familiarity with distributed systems, databases, and large-scale system architectures.
  • Certifications

  • AWS Certified DevOps Engineer - Professional
  • Recognizes sophisticated knowledge of AWS and DevOps standard methodologies to automate and optimize infrastructure and applications in AWS.
  • Certified Kubernetes Administrator (CKA)
  • Validates skills required to design, build, and maintain production-grade Kubernetes clusters.
  • Skills Required

    Aws, Automation Tools, Performance Tuning, cost optimisation

    Create a job alert for this search

    Site Reliability Engineer • Hyderabad / Secunderabad, Telangana

    Related jobs
    • Promoted
    Sr Engineer, Site Reliability Engineer [T500-20464]

    Sr Engineer, Site Reliability Engineer [T500-20464]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20521]

    Engineer, Site Reliability [T500-20521]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    IntraEdgeHyderabad, IN
    Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Strategic thinking with a focus on long-term operational excellence.Champion operation...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SID Global SolutionsHyderabad, Telangana, India
    Job Role : Site Reliability Engineer (SRE) – GCP.SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortu...Show moreLast updated: 23 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    TalentiserHyderabad, Telangana, India
    Reliability, Automation, and Observability As a hybrid Site Reliability Engineer / DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-criti...Show moreLast updated: 23 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    NationsBenefits IndiaHyderabad, Telangana, India
    Site Reliability Engineer (SRE) | Fintech | Kubernetes | Datadog |.SRE team focused on maintaining the performance, reliability, and availability of our fintech platforms.Triage and resolve product...Show moreLast updated: 13 days ago
    • Promoted
    Engineer, Site Reliability [T500-20517]

    Engineer, Site Reliability [T500-20517]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20515]

    Engineer, Site Reliability [T500-20515]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20503]

    Engineer, Site Reliability [T500-20503]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CapgeminiHyderabad, IN
    Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues...Show moreLast updated: 2 days ago
    • Promoted
    • New!
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Atyeti IncHyderabad, Telangana, India
    We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team.Bachelor’s degree in computer science, Engineering, or equivalent practical experience.Site Re...Show moreLast updated: 15 hours ago
    • Promoted
    AWS Site Reliability Engineer

    AWS Site Reliability Engineer

    HTC Global ServicesHyderabad, Telangana, India
    Troy, Michigan, is a leading global Information Technology solution and BPO provider.HTC assists clients across multiple industry verticals, offering turnkey project lifecycle in, e-business, data ...Show moreLast updated: 23 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Sonata SoftwareHyderabad, Telangana, India
    Category Details Role Site Reliability Engineer (SRE) III – Data Engineering Location Hyderabad- Employment Type Full Time Experience 7–12 years in site reliability, cloud-based data infrastructur...Show moreLast updated: 13 days ago
    • Promoted
    Principal Engineer, Site Reliability [T500-20295]

    Principal Engineer, Site Reliability [T500-20295]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20266]

    Engineer, Site Reliability [T500-20266]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Principal Engineer, Site Reliability - Accounting Technology [T500-20232]

    Principal Engineer, Site Reliability - Accounting Technology [T500-20232]

    ANSRHyderabad, Telangana, India
    ANSR is hiring for one of its clients.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flags...Show moreLast updated: 30+ days ago
    • Promoted
    Engineer, Site Reliability [T500-20518]

    Engineer, Site Reliability [T500-20518]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20519]

    Engineer, Site Reliability [T500-20519]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago