This job offer is not available in your country.

Engineer, Site Reliability [T500-20504]

ANSRHyderabad, India

5 days ago

Job description

ANSR is hiring for one of its clients.

About T-Mobile :

T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

About TMUS Global Solutions :

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.

About the Role :

As a Site Reliability Engineer (SRE), you will be a key member of the CFL Platform Engineering and Operations team you will be responsible for building and maintaining large-scale, distributed systems that are observable, scalable, and resilient. This role sits at the intersection of software engineering and infrastructure operations, ensuring high availability and performance of production systems through automation, monitoring, and proactive engineering. You'll work closely with development, DevOps, and cloud platform teams to improve deployment strategies, incident response, and system health insights. This is a hands-on role for engineers who are passionate about operational excellence, reducing toil, and improving system reliability through code.

What You Will Do :

Ensure high availability and performance of production platforms through monitoring, alerting, and incident management

Design and implement resiliency patterns such as circuit breakers, failovers, retries, and health checks

Develop automation to reduce manual operational work and improve system efficiency

Support CI / CD workflows and infrastructure automation using tools like Terraform and Helm

Collaborate with developers to enhance service deployment and rollback mechanisms

Build and maintain observability tooling including dashboards, logs, and metrics

Analyze performance data and use it to guide optimizations and issue detection

Participate in on-call rotations, incident triage, and post-incident analysis

Write and maintain operational documentation, including runbooks and playbooks

Support development teams in achieving service-level objectives (SLOs) and operational readiness

What You Will Bring :

Bachelor’s degree in Computer Science, Engineering, or a related technical field

2-5 years of experience in SRE, infrastructure, DevOps, or related engineering roles

Proficiency in scripting or programming (Python, Go, or Bash preferred)

Strong experience with Linux systems and cloud environments (Azure preferred; AWS / GCP also relevant)

Hands-on experience with Kubernetes and containerized services

Familiarity with observability tools such as Prometheus, Grafana, Splunk, or OpenTelemetry

Exposure to incident response frameworks, postmortems, and error budgets

Understanding of core SRE concepts : SLOs, SLIs, and service reliability metrics

Experience with CI / CD tools (e.g., GitLab CI / CD, Jenkins, Spinnaker)

Working knowledge of infrastructure tools such as HAProxy, RabbitMQ, or similar

Strong analytical and troubleshooting skills for distributed systems

Clear communication skills and ability to work cross-functionally

A continuous improvement mindset focused on reducing operational toil and enhancing developer experience

Must Have Skills :

Application & Microservice : Java, Spring boot, API & Service Design

Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI

App Platform : Docker & Containers (Kubernetes)

Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)

Any Messaging : Kafka, Rabbit MQ

Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)

Incident / Change / Problem Management

Nice To Have : Define SLIs / SLOs

Create a job alert for this search

Site Reliability Engineer • Hyderabad, India

Related jobs

Promoted

Site Reliability Engineer

GSPANN Technologies, Inchyderabad, telangana, in

GSPANN is a global IT services and consultancy provider headquartered in Milpitas, California (U.With five global delivery centers across the globe, GSPANN provides digital solutions that support t...Show moreLast updated: 7 days ago

Promoted

Sr Engineer, Site Reliability [T500-20439]

ANSRhyderabad, telangana, in

ANSR is hiring for one of its clients.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flags...Show moreLast updated: 7 days ago

Promoted

Sr Engineer, Site Reliability [T500-20444]

ANSRHyderabad, Telangana, India

Promoted

Site Reliability Engineer

HuntingCube Recruitment SolutionsHyderabad, Telangana, India

Job opening for Lead, Tech (Site Reliability Engineering) – Systems Strict Eligibility Criteria – Please Read Before Applying This role is with a leading global High-Frequency Trading (HFT) firm...Show moreLast updated: 6 days ago

Promoted

Engineer, Site Reliability [T500-20520]

ANSRhyderabad, telangana, in

Promoted