Talent.com
No longer accepting applications
(Apply in 3 Minutes) Site Reliability Engineer

(Apply in 3 Minutes) Site Reliability Engineer

TalentiserHyderabad, Telangana, India
9 days ago
Job description

YOUR IMPACT :

Reliability, Automation, and Observability As a hybrid Site Reliability Engineer / DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-critical SaaS platform. You'll apply engineering principles to operational challenges, constantly striving to eliminate toil through automation.

Operational Excellence & Reliability

  • Provide day-to-day management of system alerts, check system health, and escalate issues as necessary to maintain high availability.
  • Actively participate in a 24x7 on-call rotation for critical SaaS platform incidents, and be available in case of emergencies.
  • Lead the incident response process, ensuring fast and effective mitigation and resolution of production issues.
  • Perform thorough Root Cause Analysis (RCA) and lead blameless post-mortems to identify systemic weaknesses and create a corrective action plan to prevent recurrence.
  • Collaborate with engineering teams to set and enforce error budgets (derived from SLOs, or Service Level Objectives), ensuring a healthy balance between development speed and system stability.

Platform Automation & Infrastructure Development

  • Automate routine operational tasks to reduce manual effort and "toil" and increase overall team efficiency.
  • Design, deploy, and maintain cloud infrastructure using Infrastructure as Code (IaC), specifically leveraging Terraform and Helm for deployment to EKS / K8s clusters.
  • Improve existing infrastructure health by developing and implementing checks and scripts to proactively correct known issues and self-heal the platform.
  • Maintain, develop, and evolve our Continuous Integration / Continuous Delivery (CI / CD) deployment code and pipelines.
  • Learn and maintain existing infrastructure running under Docker and Docker Swarm while driving migration strategies toward EKS / K8s.
  • Implement and integrate new technologies and services into our Cloud Infrastructure to enhance platform capabilities and resilience.
  • Monitoring & Observability

  • Design and implement comprehensive Observability strategies across all three pillars : Metrics, Logs, and Traces.
  • Proactively create and refine robust monitoring and alerting configurations within the EKS / K8s ecosystem.
  • Utilize and maintain our Observability platform, Datadog, to gather performance data, create complex synthetic tests, and visualize system health via dashboards.
  • Leverage existing monitoring solutions such as Grafana and Prometheus while planning and executing the migration or integration of data into a unified platform.
  • Document all issues, remediation steps, system architecture, and runbooks to facilitate knowledge transfer and rapid incident response.
  • Collaborate closely with Support, Customer Success, Migration, and Professional Services teams to provide the highest level of SaaS service and minimize customer impact during changes.
  • Apply a real customer focus when planning deployments / updates, always considering the impact on the end-user before making changes.
  • YOUR EXPERIENCE : Essential Skills and Qualifications

  • Hands-on AWS Cloud Engineer experience, with expert working knowledge of the AWS Cloud ecosystem, including a good understanding of AWS IAM roles and policies.
  • Proficiency with container orchestration technologies : EKS / Kubernetes (K8s).
  • Demonstrable experience with Infrastructure as Code (IaC) tools, specifically Terraform and Helm.
  • Working experience with Docker and maintaining systems using Docker Swarm.
  • Expertise in setting up and managing logging and monitoring solutions. Direct experience with Datadog is highly preferred, with experience in setting up APM, infrastructure monitoring, and custom dashboards.
  • Experience with existing monitoring solutions such as Grafana and Prometheus is required.
  • Proficient in a Linux environment and strong skills in Bash and / or Python scripting for automation and troubleshooting.
  • A strong understanding of web technologies, including REST APIs, Systems Architecture, Design, and Databases.
  • Experience in Product / Application Support for high-availability SaaS-based products.
  • Experience in designing, implementing, and operating in a DevSecOps environment.
  • Excellent oral and written communication skills, with the ability to clearly explain complex technical issues and RCAs to both technical and customer-facing audiences.
  • Create a job alert for this search

    Site Reliability Engineer • Hyderabad, Telangana, India

    Related jobs
    • Promoted
    Site reliability engineer

    Site reliability engineer

    TalentiserHyderabad, Andhra Pradesh, India
    Reliability, Automation, and Observability As a hybrid Site Reliability Engineer / Dev Ops Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-crit...Show moreLast updated: 23 hours ago
    • Promoted
    Sr Engineer, Site Reliability Engineer [T500-20464]

    Sr Engineer, Site Reliability Engineer [T500-20464]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20521]

    Engineer, Site Reliability [T500-20521]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Tata Consultancy ServicesHyderabad, Telangana, India
    We are currently seeking a for a position SRE Engineer in Hyderabad.Job ID : 375656 • • • •Apply Here : • • (TCS iBegin) • •Job Description : • • - Proven experience as a DevOps / SRE Engineer - Expertise in...Show moreLast updated: 13 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    IntraEdgeHyderabad, IN
    Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Strategic thinking with a focus on long-term operational excellence.Champion operation...Show moreLast updated: 5 days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    ConfidentialHyderabad / Secunderabad, Telangana
    Talent Management & Team Leadership : .Lead, mentor, empower and manage 5-10 hard-working engineering team to deliver exceptional results. System Reliability, Performance Optimization & Cost Reductio...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    TalentiserHyderabad, Telangana, India
    Reliability, Automation, and Observability As a hybrid Site Reliability Engineer / DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-criti...Show moreLast updated: 23 days ago
    • Promoted
    Engineer, Site Reliability [T500-20517]

    Engineer, Site Reliability [T500-20517]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20515]

    Engineer, Site Reliability [T500-20515]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20503]

    Engineer, Site Reliability [T500-20503]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    iVoyantsecunderabad, India
    One of our clients is looking for an experienced Senior Site Reliability Engineer (SRE) - Mission-Critical SaaS Cloud Products to join their team. Reliability and Performance Management : .Design, imp...Show moreLast updated: 2 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CapgeminiHyderabad, IN
    Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues...Show moreLast updated: 2 days ago
    • Promoted
    AWS Site Reliability Engineer

    AWS Site Reliability Engineer

    HTC Global ServicesHyderabad, Telangana, India
    Troy, Michigan, is a leading global Information Technology solution and BPO provider.HTC assists clients across multiple industry verticals, offering turnkey project lifecycle in, e-business, data ...Show moreLast updated: 23 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    o9 Solutions, Inc.secunderabad, India
    Be part of something revolutionary.At o9 Solutions, our mission is clear : be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate globa...Show moreLast updated: 6 days ago
    • Promoted
    Principal Site Reliability Enginee

    Principal Site Reliability Enginee

    ConfidentialBengaluru / Bangalore, Hyderabad / Secunderabad, Telangana, Chennai
    As a Principal Site Reliability Engineer, you will be responsible for developing sophisticated systems and software based on the customer s business goals, needs and general business environment.Yo...Show moreLast updated: 30+ days ago
    • Promoted
    Engineer, Site Reliability [T500-20266]

    Engineer, Site Reliability [T500-20266]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20519]

    Engineer, Site Reliability [T500-20519]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    Engineer, Site Reliability [T500-20518]

    Engineer, Site Reliability [T500-20518]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 17 days ago
    • Promoted
    • New!
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Atyeti IncHyderabad, Telangana, India
    Job Description : We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team. Bachelor’s degree in computer science, Engineering, or equivalent practi...Show moreLast updated: 21 hours ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    ConfidentialHyderabad / Secunderabad, Telangana, India
    Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.As a Lead Site Reliabi...Show moreLast updated: 30+ days ago