Site Reliability EngineerWhiteLotus Talent Partners • hyderabad, India

No longer accepting applications

Site Reliability Engineer

WhiteLotus Talent Partners • hyderabad, India

1 day ago

Job description

We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes . In this role, you will focus on monitoring , basic troubleshooting , and incident response , helping to maintain high system availability, reliability, and performance. You will be responsible for identifying and addressing simple issues, as well as escalating more complex problems to senior SREs when needed.

The ideal candidate should have a basic understanding of cloud infrastructure (especially OpenStack and Kubernetes ), containerized environments , and system monitoring. This position offers an excellent opportunity for someone looking to grow into a more advanced SRE or DevOps role.

Key Responsibilities :

For L0 Support (Level 0) :

Incident Monitoring & Triage :
Respond to system alerts, monitor infrastructure health using tools like Prometheus , Grafana , and Observability for both OpenStack and Kubernetes.
Identify low-level issues and follow runbooks or predefined scripts to perform first-level triage.
Document and escalate unresolved incidents to L1 or L2 based on established escalation protocols.
System Health Checks :
Perform daily health checks for Kubernetes pods, nodes, and OpenStack instances.
Verify basic functionality of VMs , containers , and network services within the environment.
Basic Troubleshooting :
Resolve simple issues such as VM reboots, pod failures, and network connectivity issues within OpenStack or Kubernetes environments.
Follow the predefined steps for basic troubleshooting tasks like restarting services or clearing logs.
Ticket Management :
Log incidents and issues into a ticketing system (e.g., JIRA , ServiceNow ) for tracking and escalation.
Update incident tickets and provide relevant information for ongoing resolution efforts.

=========================================================================================================

For L1 Support (Level 1) :

Incident Resolution :

Investigate and resolve more complex issues compared to L0, such as Kubernetes pod crashes, network misconfigurations in OpenStack, and minor service disruptions.

Work with tools like kubectl to troubleshoot Kubernetes pods and nodes, and OpenStack CLI to diagnose problems with VMs, storage, and networks.

Automation & Scripting :

Automate routine tasks, such as VM provisioning, pod deployments, or status checks, using basic scripting languages ( Python , Bash ).

Improve automation workflows based on feedback and frequently encountered issues.

Log Aggregation & Monitoring :

Review logs and metrics collected from ELK Stack , Prometheus , Grafana , or other logging tools to detect trends and potential issues.

Analyze logs and metrics from OpenStack and Kubernetes clusters to pinpoint underlying problems (e.g., high CPU usage, memory leaks).

Basic Network & Storage Management :

Investigate networking issues related to Neutron (for OpenStack) and CNI configurations (for Kubernetes).

Manage storage resources within OpenStack and Kubernetes (e.g., creating persistent volumes, debugging storage access issues).

Collaboration & Escalation :

Work closely with L2 and L3 engineers for complex troubleshooting or advanced system issues that require in-depth knowledge.

Share knowledge with the team and assist in creating new documentation or updating existing troubleshooting guides.

User and Permissions Management :

Perform basic user management tasks within OpenStack (e.g., creating and managing tenants, security groups).

Review and modify Kubernetes RBAC (Role-Based Access Control) settings based on user access needs.

Skills & Qualifications :

Required Skills :

Basic Cloud & Kubernetes Knowledge :

Familiarity with OpenStack architecture (e.g., Nova , Neutron , Cinder ).

Basic understanding of Kubernetes components, including pods , services , deployments , and namespaces .

Systems & Networking :

Knowledge of Linux / Unix-based operating systems (e.g., Ubuntu , CentOS , Red Hat ).

Understanding of networking concepts like DNS , IP routing , and VLANs in cloud environments.

Monitoring & Alerting Tools :

Familiarity with monitoring tools like Prometheus , Grafana , Zabbix , or CloudWatch for alert management and system health monitoring.

Troubleshooting & Incident Response :

Experience in using log aggregation tools ( ELK stack , Splunk ) and interpreting logs for incident detection.

Ability to perform basic troubleshooting steps (e.g., restarting services, running basic shell commands) to resolve issues.

Communication Skills :

Strong communication skills to collaborate effectively with senior SREs, developers, and other teams.

Ability to document incidents, solutions, and troubleshooting steps clearly.

Preferred Skills :

Basic Scripting & Automation :

Exposure to scripting languages such as Bash , Python , or Go to automate basic administrative tasks.

Cloud Platform Experience :

Familiarity with other cloud technologies such as AWS , Azure , or Google Cloud Platform .

Certifications :

Basic certifications such as CompTIA Linux+ , AWS Certified Solutions Architect , Kubernetes Fundamentals (CKA), or OpenStack COA are a plus.

Create a job alert for this search

Site Reliability Engineer • hyderabad, India

Related jobs

Senior Site Reliability Engineer

Zyoin Group • Hyderabad

Description : As the most senior technical individual contributor within an entire division of Engine...Show more

Last updated: 26 days ago • Promoted

Site Reliability Engineer

Prometheus consulting • Hyderabad

WHAT YOU'LL DO : - Support, maintain, and enhance the reliability, scalability, and performance of our Azure-based Data Analytics Platform. Collaborate closely with Data En...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer T500-21132

Inspire • Hyderabad, Republic Of India, IN

Inspire Brands is disrupting the restaurant industry through digital transformation and operational efficiencies.The company’s technology hub, Inspire Brands Hyderabad Support Center, India, will l...Show more

Last updated: 21 days ago • Promoted

Site Reliability Engineer

Tata Consultancy Services • Hyderabad, Telangana, India

GKE(Preferable); Kubernetes (Any cloud) + PostgresSQL, SQL(Must).Linux (Optional), Java (Optional) , Kubernetes (CLI), Prior Production support experience, Release Management, Prior Deployment expe...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

AutoRABIT • Hyderabad, Telangana, India

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce.Its unique metadata-aware capability makes Release Management, Version Control, and Backup & Recovery complete, reliable, ...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer (SRE) – Infrastructure & Automation

InstaService • Hyderabad, IN

InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show more

Last updated: 19 days ago • Promoted

Gcp Site Reliability Engineer

inTune Systems Inc • Hyderabad, Republic Of India, IN

We are looking for a Senior Site Reliability Engineer (SRE) to join our growing Engineering team.As an SRE, you will play a key role in ensuring the reliability, scalability, and performance of our...Show more

Last updated: 1 day ago • Promoted

Site Reliability Engineer [T500-21132]

Inspire • Hyderabad, Telangana, India

Last updated: 21 days ago • Promoted

Senior Site Reliability Engineer

Elios Talent • Hyderabad, Telangana, India

Senior Site Reliability Engineer.Build, scale, and optimize cloud-native infrastructure powering global, high-availability platforms. Drive automation-first engineering across AWS, Terraform, CI / CD,...Show more

Last updated: 6 days ago • Promoted

Site Reliability Engineer - Cloud Solutions

SMARTWORK IT SERVICES • Hyderabad

Description : Role : Site Reliability Engineer (SRE).Job Summary : The Site Reliability E...Show more

Last updated: 27 days ago • Promoted

Engineer, Site Reliability [T500-20266]

TMUS Global Solutions • Hyderabad, Telangana, India

About T-Mobile : T-Mobile US, Inc.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship b...Show more

Last updated: 30+ days ago • Promoted

GCP Site Reliability Engineer

inTune Systems Inc • hyderabad, telangana, in

Last updated: 1 day ago • Promoted

Sr Engineer, Site Reliability T500-20425

TMUS Global Solutions • Hyderabad, Republic Of India, IN

NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show more

Last updated: 30+ days ago • Promoted

Principal Site Reliability Engineer - IAC Terraform

Tidyhire • Hyderabad

Description : This is a pure individual contributor role.Core Responsibilities : Infrastructure Design &...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Elios Talent • Hyderabad, Telangana, India

Build, automate, and support cloud-native infrastructure powering high-availability platforms.Contribute to automation-first engineering across AWS, Terraform, CI / CD, and observability tooling.Impr...Show more

Last updated: 6 days ago • Promoted

Engineer, Site Reliability T500-20266

TMUS Global Solutions • Hyderabad, Republic Of India, IN

Last updated: 30+ days ago • Promoted

Sr Engineer, Site Reliability [T500-20425]

TMUS Global Solutions • Hyderabad, Telangana, India

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

VXI Global Solutions • Hyderabad, Telangana, India

We are looking for a Site Reliability Engineer with 3+ years for Experience into design, implement, and manage robust observability solutions across our cloud infrastructure and applications.The id...Show more

Last updated: 30+ days ago • Promoted