Talent.com
L1 / L2 Site Reliability Operations Engineer

L1 / L2 Site Reliability Operations Engineer

WhiteLotus Talent PartnersBengaluru, Republic Of India, IN
2 days ago
Job description

We are looking for a L0 and L1 Site Reliability Engineer (SRE) Support to join our Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by OpenStack and Kubernetes . In this role, you will focus on monitoring , basic troubleshooting , and incident response , helping to maintain high system availability, reliability, and performance. You will be responsible for identifying and addressing simple issues, as well as escalating more complex problems to senior SREs when needed.

The ideal candidate should have a basic understanding of cloud infrastructure (especially OpenStack and Kubernetes ), containerized environments , and system monitoring. This position offers an excellent opportunity for someone looking to grow into a more advanced SRE or DevOps role.

Key Responsibilities :

For L0 Support (Level 0) :

  • Incident Monitoring & Triage :
  • Respond to system alerts, monitor infrastructure health using tools like Prometheus , Grafana , and Observability for both OpenStack and Kubernetes.
  • Identify low-level issues and follow runbooks or predefined scripts to perform first-level triage.
  • Document and escalate unresolved incidents to L1 or L2 based on established escalation protocols.
  • System Health Checks :
  • Perform daily health checks for Kubernetes pods, nodes, and OpenStack instances.
  • Verify basic functionality of VMs , containers , and network services within the environment.
  • Basic Troubleshooting :
  • Resolve simple issues such as VM reboots, pod failures, and network connectivity issues within OpenStack or Kubernetes environments.
  • Follow the predefined steps for basic troubleshooting tasks like restarting services or clearing logs.
  • Ticket Management :
  • Log incidents and issues into a ticketing system (e.G., JIRA , ServiceNow ) for tracking and escalation.
  • Update incident tickets and provide relevant information for ongoing resolution efforts.

=========================================================================================================

For L1 Support (Level 1) :

  • Incident Resolution :
  • Investigate and resolve more complex issues compared to L0, such as Kubernetes pod crashes, network misconfigurations in OpenStack, and minor service disruptions.
  • Work with tools like kubectl to troubleshoot Kubernetes pods and nodes, and OpenStack CLI to diagnose problems with VMs, storage, and networks.
  • Automation & Scripting :
  • Automate routine tasks, such as VM provisioning, pod deployments, or status checks, using basic scripting languages ( Python , Bash ).
  • Improve automation workflows based on feedback and frequently encountered issues.
  • Log Aggregation & Monitoring :
  • Review logs and metrics collected from ELK Stack , Prometheus , Grafana , or other logging tools to detect trends and potential issues.
  • Analyze logs and metrics from OpenStack and Kubernetes clusters to pinpoint underlying problems (e.G., high CPU usage, memory leaks).
  • Basic Network & Storage Management :
  • Investigate networking issues related to Neutron (for OpenStack) and CNI configurations (for Kubernetes).
  • Manage storage resources within OpenStack and Kubernetes (e.G., creating persistent volumes, debugging storage access issues).
  • Collaboration & Escalation :
  • Work closely with L2 and L3 engineers for complex troubleshooting or advanced system issues that require in-depth knowledge.
  • Share knowledge with the team and assist in creating new documentation or updating existing troubleshooting guides.
  • User and Permissions Management :
  • Perform basic user management tasks within OpenStack (e.G., creating and managing tenants, security groups).
  • Review and modify Kubernetes RBAC (Role-Based Access Control) settings based on user access needs.
  • Skills & Qualifications :

    Required Skills :

  • Basic Cloud & Kubernetes Knowledge :
  • Familiarity with OpenStack architecture (e.G., Nova , Neutron , Cinder ).
  • Basic understanding of Kubernetes components, including pods , services , deployments , and namespaces .
  • Systems & Networking :
  • Knowledge of Linux / Unix-based operating systems (e.G., Ubuntu , CentOS , Red Hat ).
  • Understanding of networking concepts like DNS , IP routing , and VLANs in cloud environments.
  • Monitoring & Alerting Tools :
  • Familiarity with monitoring tools like Prometheus , Grafana , Zabbix , or CloudWatch for alert management and system health monitoring.
  • Troubleshooting & Incident Response :
  • Experience in using log aggregation tools ( ELK stack , Splunk ) and interpreting logs for incident detection.
  • Ability to perform basic troubleshooting steps (e.G., restarting services, running basic shell commands) to resolve issues.
  • Communication Skills :
  • Strong communication skills to collaborate effectively with senior SREs, developers, and other teams.
  • Ability to document incidents, solutions, and troubleshooting steps clearly.
  • Preferred Skills :

  • Basic Scripting & Automation :
  • Exposure to scripting languages such as Bash , Python , or Go to automate basic administrative tasks.
  • Cloud Platform Experience :
  • Familiarity with other cloud technologies such as AWS , Azure , or Google Cloud Platform .
  • Certifications :
  • Basic certifications such as CompTIA Linux+ , AWS Certified Solutions Architect , Kubernetes Fundamentals (CKA), or OpenStack COA are a plus.
  • Create a job alert for this search

    Site Reliability Engineer • Bengaluru, Republic Of India, IN

    Related jobs
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Rakuten IndiaBengaluru, Karnataka, India
    Design, develop SLA, SLO, SLI of services within the Business Unit.Involve in whole process of Development, Production System Operation including system maintenance, monitoring, automation, backend...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    London Stock Exchange GroupBangalore, India
    Engineer, Site Reliability Engineering.We are evolving our Reliability Engineering team to move beyond support and operations. As a Senior Engineer in Site Reliability, you will be part of a diverse...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer (SRE)

    Senior Site Reliability Engineer (SRE)

    Tata Consultancy ServicesBengaluru, Karnataka, India
    Senior Site Reliability Engineer (SRE).Senior Site Reliability Engineer (SRE).Desired Experience Range : 7 - 10 yrs.Notice Period : Immediate to 90Days only. We are currently planning to do a Virtual....Show moreLast updated: 26 days ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Tata Consultancy ServicesBengaluru, Republic Of India, IN
    Senior Site Reliability Engineer (SRE).Senior Site Reliability Engineer (SRE).Desired Experience Range : 7 - 10 yrs.Notice Period : Immediate to 90Days only. We are currently planning to do a Virtual....Show moreLast updated: 26 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    FlipkartBengaluru, Karnataka, India
    Hiring Site Reliability Engineers.The engineer will work in the Reliability and Productivity Engineering team and is responsible for building industry standard large scale platforms to be utilised ...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SynechronBengaluru, Karnataka, India
    We have immediate opportunity for Senior Site Reliability Engineer.Senior Site Reliability Engineer.At Synechron, we believe in the power of digital to transform businesses for the better.Our globa...Show moreLast updated: 30+ days ago
    • Promoted
    Sr Site Reliability Engineer

    Sr Site Reliability Engineer

    Media.netBengaluru, Karnataka, India
    Our proprietary contextual technology is at the forefront of enhancing Programmatic buying, the latest industry standard in ad buying for digital platforms. HQ is based in New York, and the Global H...Show moreLast updated: 25 days ago
    • Promoted
    Site Reliability Engineer IC3

    Site Reliability Engineer IC3

    OracleBengaluru, Republic Of India, IN
    Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence.Design, write, and deploy software to improve the availability, scalability, and e...Show moreLast updated: 8 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    VAYUZ TechnologiesBengaluru, Republic Of India, IN
    Execute and maintain SOPs for production operations, onboarding, and integration support.Handle incident response, troubleshoot system and data issues, and ensure timely resolution.Support partner ...Show moreLast updated: 19 hours ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    o9 Solutions, Inc.Bengaluru, Karnataka, India
    Be part of something revolutionary.At o9 Solutions, our mission is clear : be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate globa...Show moreLast updated: 6 days ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    o9 Solutions, Inc.Bengaluru, Republic Of India, IN
    Be part of something revolutionary.At o9 Solutions, our mission is clear : be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate globa...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    GREYTIP SOFTWARE PRIVATE LIMITEDBengaluru, Karnataka, India
    The ideal candidate will have hands-on experience in.You will play a key role in ensuring the reliability, availability, and performance of our production systems. Monitor production systems using e...Show moreLast updated: 2 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WhiteLotus Talent PartnersBengaluru, Karnataka, India
    L0 and L1 Site Reliability Engineer (SRE) Support.Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by. In this role, you will focu...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    super.moneyBengaluru, Karnataka, India
    Site Reliability Engineer (SRE) Level 3.A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and...Show moreLast updated: 15 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    JRD SystemsBengaluru, India
    Site Reliability Engineer (Windows / Cloud / Automation).We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments.T...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    RecRootsBangalore Urban, Karnataka, India
    Key Job Responsibilities and Duties : .The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned addressing...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Landmark GroupBengaluru, India
    Ensure reliability and high availability of Java and microservices-based applications through proactive monitoring and automation. Define and track SLIs / SLOs to maintain service performance and stab...Show moreLast updated: 6 days ago
    • Promoted
    • New!
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Delta Air LinesBengaluru, India
    Execute on the Incident, Change Management, Problem Management processes.Building and supporting reliable applications that meet development and maintenance requirements. Provide consultation and di...Show moreLast updated: 5 hours ago