SITA - Lead Site Reliability Engineer / Expert

SITA INFORMATION NETWORKING COMPUTING INDIADelhi, IN

2 days ago

Job description

PURPOSE :

Responsible for the proactive support of products so that there is high product performance that is continuously improved.

Responsible for identifying and resolving the root causes of operational incidents implementing solutions to improve stability and prevent recurrence.

Manages the creation and maintenance of the event catalog to trigger events and develops both manual remediation approaches and automated workflows to resolve alerts.

Oversees the deployment of IT services and solutions ensuring successful integration with minimal disruption.

Focuses on operational automation and integration to enhance efficiency and collaboration between development and operations within service operations.

KEY RESPONSIBILITIES :

Define, build, and maintain support systems to ensure high availability and performance.
Handle complex cases for the Operations team.
Build events to add to the event catalog for the relevant product or application.
Implement automation for system provisioning, self-healing, auto recovery, deployment, and monitoring.
Perform incident response and root cause analysis for critical system failures.
Monitor system performance and establish service-level indicators (SLIs) and objectives (SLOs).
Collaborate with development and operations to integrate reliability best practices, including moving to zero downtime architecture.
Proactively identify and remediate performance issues.
Work closely with Product, Software & Infra Engineering and Service support architects for new product productization.
Ensure Operations readiness to support new products.
Coordinate with internal and external stakeholders for feedback for continual service improvement for in scope products & drive plan till successful closure.
Accountable for the in-scope product to ensure high availability performance.

Problem Management :

Conduct thorough problem investigations and root cause analyses (RCA) to diagnose recurring incidents and service disruptions.

Coordinate with incident management teams,operations experts and collaborate with different Service Operations and Engineering teams to develop and implement permanent solutions.

Monitor the effectiveness of problem resolution activities, provide regular reports on problem management activities, and ensure continuous improvement.

Event Management :

Define and maintain an event catalog, specifying active events, thresholds, and relevant remediation, and optimize it for efficiency.

Develop event response protocols, provide training to teams, and ensure quick and efficient handling of incidents.

Collaborate with stakeholders to define events, ensure coverage across the Service Operations, and drive improvements based on post-event reviews and feedback.

Deployment Management :

Own the quality of new release deployment for the Service Operations, ensuring a clear process and responsibilities are assigned for smooth implementation.

Develop and maintain deployment schedules, conduct operational readiness assessments, and manage deployment risk assessments to ensure service stability.

Oversee the execution of deployment plans, coordinate resources & process with delivery and lifecycle engineering, communicate with stakeholders, and continuously work with different stakeholders to improve deployment processes based on feedback.

DevOps Management :

Manage continuous integration and deployment (CI / CD) pipelines, ensuring smooth integration between development and operational teams.

Automate operational processes, monitor system performance, and resolve issues related to automation scripts to increase efficiency.

Implement and manage infrastructure as code, provide ongoing support for automation tools, and continuously improve DevOps practices.

EXPERIENCE :

8+ years of experience in IT operations service management or infrastructure management or application management including roles such as Site Reliability Engineering lead or DevOps Engineer / lead.

Proven experience in managing high-availability systems and ensuring operational reliability.

Extensive experience in root cause analysis (RCA) incident management and developing permanent solutions for recurring service disruptions.

Extensive expertise in monitoring and observability implementation.

Hands-on experience with CI / CD pipelines, automation system performance monitoring and the implementation of infrastructure as code.

Strong background in collaborating with cross-functional teams (development operations engineering etc.

) to improve operational processes and service delivery.

Experience in managing deployments risk assessments and optimizing event and problem management processes.

Familiarity with cloud technologies containerization and scalable architecture including experience with zero-downtime deployment strategies.

KNOWLEDGE & SKILLS : Skills :

Collaboration.

Communication.

Problem Solving.

Incident Management.

Change Management.

Technical Skills :

Cloud Infrastructure (AWS, Azure).

Linux Administration.

Windows Administration.

Monitoring & Observability.

DevOps (CI / CD).

Programming & Scripting Languages.

Application Support.

PROFESSION COMPETENCIES :

Business Acumen.

Consultancy.

Financial Acumen.

Info Organisational Awareness.

Quality Orientation.

CORE COMPETENCIES :

Collaboration.

Communication.

Problem Solving.

Incident Management.

Change Management.

Innovation.

EDUCATION & QUALIFICATIONS : Background :

Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.

Advanced degree (Masters or equivalent) is often preferred for senior positions.

Qualifications :

Relevant certifications such as Linux Administration, Certified Kubernetes Administrator (CKA).

Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies (e.g., Certified DevOps Professional).

Certification in Windows Administration, Linux Administration.

(ref : hirist.tech)

Create a job alert for this search

Lead Site Reliability • Delhi, IN

Related jobs

Promoted

Site Reliability Engineer

super.moneyDelhi, India

Site Reliability Engineer (SRE) Level 3.Overview : A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, co...Show moreLast updated: 15 days ago

Promoted

Site Reliability Engineer (SRE) – Infrastructure & Automation

InstaServiceDelhi, IN

InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show moreLast updated: 14 days ago

Promoted
New!

Site Reliability Engineer

inTune Systems IncDelhi, India

SRE / App Support Engineer Location Hyderabad.We are looking for a Senior Site Reliability Engineer (SRE) to join our growing Engineering team. As an SRE, you will play a key role in ensuring the reli...Show moreLast updated: 16 hours ago

Promoted

Site Reliability Engineer

Datum Technologies GroupDelhi, IN

Job Title : Site Reliability Engineer (SRE) – AWS.AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, Datadog.We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experi...Show moreLast updated: 8 days ago

Promoted
New!

Site Reliability Engineer (SRE) / DevOps Engineer

Stoopa AIGhaziabad, IN

AI is building next-generation AI-driven platforms for ports and is focused on reliability, speed, and intelligent automation. As we scale our next generation smart port product Turi, we are hiring ...Show moreLast updated: 4 hours ago

Promoted
New!

Site Reliability Engineer

Elios TalentDelhi, India

Key Highlights ️ Build, automate, and support cloud-native infrastructure powering high-availability platforms ⚡ Contribute to automation-first engineering across AWS, Terraform, CI / CD, and observa...Show moreLast updated: 16 hours ago

Promoted

Site Reliability Engineer

SynamediaDelhi, India

At Synamedia, the world’s most talented innovators and trailblazers are shaping the way the world is entertained and informed. We are backed by the Permira funds and Sky.This is the age of infinite ...Show moreLast updated: 9 days ago

Promoted

Site Reliability Engineer

GREYTIP SOFTWARE PRIVATE LIMITEDDelhi, India

We are looking for a skilled Site Reliability Engineer II to join our SRE team.The ideal candidate will have hands-on experience in production monitoring, alert handling, and L1 production support....Show moreLast updated: 3 days ago

Promoted

Lead Site Reliability Engineer

Media.netDelhi, India

Our proprietary contextual technology is at the forefront of enhancing Programmatic buying, the latest industry standard in ad buying for digital platforms. HQ is based in New York, and the Global H...Show moreLast updated: 3 days ago

Promoted

Site Reliability Engineer

JRD SystemsDelhi, India

Site Reliability Engineer (Windows / Cloud / Automation).We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments.T...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

ReyikaDelhi, India

Senior Site Reliability Engineer / Reliability Architect Locations : .Pune,Bengalore,Chennai,Pune,Noida.Job Description : Reliability Architect with over 9 years of experience in proactive monitoring...Show moreLast updated: 16 hours ago

Promoted

Site Reliability Engineer

FlipkartDelhi, India

Hiring Site Reliability Engineers.Excluding internship] Location : Bangalore.The engineer will work in the Reliability and Productivity Engineering team and is responsible for building industry sta...Show moreLast updated: 5 days ago

Promoted
New!

Senior Site Reliability Engineer (SRE)

Voya IndiaGhaziabad, IN

We are seeking a strategic and technically adept leader to drive the scalability, resilience, and operational excellence of our enterprise systems. This role will set the vision for site reliability...Show moreLast updated: 4 hours ago

Promoted
New!

Senior Site Reliability Engineer

Elios TalentDelhi, India

Senior Site Reliability Engineer.Key Highlights ️ Build, scale, and optimize cloud-native infrastructure powering global, high-availability platforms ⚡ Drive automation-first engineering across AWS...Show moreLast updated: 16 hours ago

Promoted

Site Reliability Engineer

VXI Global SolutionsDelhi, India

We are looking for a Site Reliability Engineer with 3+ years for Experience into design, implement, and manage robust observability solutions across our cloud infrastructure and applications.The id...Show moreLast updated: 1 day ago

Promoted
New!

Site Reliability Engineer

Awign ExpertGhaziabad, IN

Position : SRE Observability Engineer.Mandatory Skills : Observability, Grafana and Writing queries using Prometheus and Loki. We are seeking a highly experienced and driven Senior Observability Engin...Show moreLast updated: 4 hours ago

Promoted

Sr Site Reliability Engineer

Media.netDelhi, India

Promoted

Site Reliability Engineer

PhonePeDelhi, IN

SRE We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production ...Show moreLast updated: 15 days ago