Talent.com
No longer accepting applications
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

ConfidentialThiruvananthapuram / Trivandrum
30+ days ago
Job description

As a Site Reliability Engineer (SRE) you will be responsible for improving the overall reliability of applications by ensuring its availability, performance, and scalability. Should be able to gather the technical requirements from the DevOps team and the operational requirements from the Application Support team. With the Site Reliability Engineer role being at the heart of solving production problems, should be able to take a holistic approach to troubleshooting and delve deeply into technical details and must acquire the necessary domain knowledge to effectively troubleshoot and recover from an outage as well as monitor applications in production and build alerts as required.

Working Hours : 05 : 30 AM to 1 : 30 PM IST (GMT+5 : 30)

Responsibilities include :

  • Work closely with the application support team.
  • Monitor critical applications and services to minimize downtime and ensure their availability.
  • Collaborate with DevOps teams to maintain and monitor CI / CD pipelines.
  • Deploy new versions to production environments.
  • Work with project teams to ensure the reliability and maintainability of new and modified releases.
  • Provide input to risk management practices that will anticipate reliability-related incidents that could adversely impact operations.
  • Document processes and monitor application performance metrics.
  • Continuously improve proactive monitoring alert configuration and incident response processes to increase reliability and reduce Mean Time to Recovery (MTTR ).
  • Optimize performance and cost efficiency through continuous monitoring, trend analysis, and fine-tuning.
  • Monitor any abnormal usage that can impact the cost or performance and take corrective actions.
  • Proactively implement preventive measures to improve system reliability.
  • Maintain runbooks, Standard Operating Procedures (SOPs), diagrams, and documentation for swift incident response.
  • Conduct post-incident reviews to improve reliability and contribute to the development of resilience strategies.
  • Achieve Service Level Indicators (SLIs) that are set to meet reliability objectives.

Certifications :

  • Azure Solutions Architect Expert (Microsoft)
  • AWS Certified Solutions Architect (AWS)
  • Open Group Certified Enterprise Architect (TOGAF)
  • PMP or Prince-2 in Project Management
  • Primary Skills :

    Monitoring and Analysis

  • Continuously monitor CDC dashboards to track service performance and analyze reports.
  • Oversee production and DevOps infrastructure dashboards, ensuring system stability and identifying potential issues.
  • Observe alerts from New Relic and escalate them to the respective teams as needed.
  • Identify duplicated New Relic alerts and optimize alert configurations to reduce noise and improve efficiency.
  • Track daily alerts in production to enhance alert optimization strategies.
  • Maintain and update a list of dashboards monitored, including details such as widgets, metrics, and threshold values.
  • Create and manage dashboards for validating and monitoring CPU optimizations for Rapid and CDC services.
  • Perform sanity checks on Container Memory Utilization, Missing Pods, Container Restarts, Container CPU Utilization, Active Pods, Node Resource Consumption, and Pod Network Status to ensure system health.
  • Release and Deployment Management

  • Coordinate and execute weekly production releases, ensuring services are deployed with optimized CPU values.
  • Update central repositories with the latest service configurations and CPU requests.
  • Perform post-deployment sanity checks to validate service stability after production releases.
  • Redeploy CDC services with optimized CPU values, ensuring system performance improvements.
  • Monitor new CPU optimizations for Rapid and CDC services, tracking performance improvements and resource utilization.
  • Incident Management and RCA Documentation

  • Conduct incident analysis, identifying root causes and documenting findings for continuous improvement.
  • Maintain detailed Root Cause Analysis (RCA) documentation to track incidents and resolutions.
  • Provide reports on incident trends, helping improve response times and preventive measures.
  • Collaboration and Communication

  • Participate in daily SyncUpsand internal meetings to discuss ongoing tasks, challenges, and improvements.
  • Sync up with the (NOC) team to align on monitoring strategies and escalations.
  • Collaborate with the Database (DB) team for performance tuning and issue resolution.
  • Conduct knowledge transfer (KT) sessions on Rapid Resource
  • Optimization and related best practices.
  • Optimization and Continuous Improvement

  • Track CPU optimization efforts, ensuring proper resource allocation and utilization for Rapid and CDC services.
  • Analyze performance data to refine resource allocation strategies and improve system efficiency.
  • Identify and implement best practices for reducing alert noise and optimizing monitoring configurations.
  • Secondary Skills : Technical Knowledge

  • Fluent in AWS key services (EBS, S3, AWS Compute, Storage, RDS etc).
  • Expertise in Kubernetes or any Container Orchestration System.
  • Knowledge of Infrastructure as a Code.
  • Linux system administration knowledge.
  • Knowledge of RDBMS and Document databases.
  • Knowledge of Monitoring tools including AWS CloudWatch and NewRelic.
  • Additional certification in Microsoft, Linux, Cisco, AWS or similar technologies is a plus.
  • Behavioral competencies

  • Communication
  • Customer Centricity
  • Business & Market Acumen
  • Psychological Safety
  • Empathy
  • Growth Mindset & Learning Agility
  • Ethical and Vigilant
  • Digital Mindset
  • Operational Excellence
  • Teamwork
  • Analytical thinking
  • Skills Required

    Pmp, Project Management, Incident Management, Cisco, Application Support, Risk Management

    Create a job alert for this search

    Site Reliability Engineer • Thiruvananthapuram / Trivandrum

    Related jobs
    • Promoted
    Senior Site Reliability Engineer - Azure Kubernetes Service

    Senior Site Reliability Engineer - Azure Kubernetes Service

    PeoplefyTrivandrum
    Description : Site Reliability Engineer (SRE) - Azure / AKS Lead Role Overview : This is a senior technical leadership role fo...Show moreLast updated: 4 days ago
    • Promoted
    Lead Engineer

    Lead Engineer

    HyqooThiruvananthapuram, IN
    Design, deploy, and manage AWS cloud infrastructure, including EC2 instances, S3 buckets, VPCs, RDS databases, and Lambda functions. Assist in the design, implementation, and maintenance of backup, ...Show moreLast updated: 14 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    PeoplefyThiruvananthapuram, Kerala, India
    We’re looking for an SRE who can.Define SLIs / SLOs for Tier-0 / Tier-1 services & review quarterly.Change gating via CI / CD based on error budgets. Azure Monitor / Grafana / Prometheus / App Insights da...Show moreLast updated: 4 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConfidentialThiruvananthapuram, Thiruvananthapuram / Trivandrum, India
    Job Title : Senior Site Reliability Engineer (SRE II).Location : Thiruvananthapuram, KL (Hybrid 3 days Onsite).We're looking for an experienced. Senior Site Reliability Engineer.The ideal candidate ha...Show moreLast updated: 18 days ago
    • Promoted
    Founding MLOps Engineer

    Founding MLOps Engineer

    Vectorial AIKollam, IN
    Vectorial is a simulation engine platform powered by millions of synthetic users—state-of-the-art models that capture real human behavior—to deliver instant, nuanced validation across the entire pr...Show moreLast updated: 13 days ago
    • Promoted
    Sr. Member of Technical Staff / Staff Engineer

    Sr. Member of Technical Staff / Staff Engineer

    SkyrelisThiruvananthapuram, IN
    Help Build the Security Layer for the Agentic AI Era.We’re building at the frontier of two unstoppable waves : .Autonomous AI agents are exploding in capability — planning, executing, and learning in...Show moreLast updated: 5 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Infosys Finaclekollam, kerala, in
    Role : DevSecOps Developer – Secure Coding & Automation.Strong scripting skills in Python, Shell, or similar languages for automation and tooling. Should be able to design, develop, test, and deploy...Show moreLast updated: 6 hours ago
    • Promoted
    Site Reliability Engineer - DevOps

    Site Reliability Engineer - DevOps

    Aim Plus Staffing SolutionsThiruvananthapuram
    Mandatory skills : We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in Google Cloud Platform (GCP) and CI / CD automation to lead cloud infra...Show moreLast updated: 17 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ConfidentialThiruvananthapuram, Thiruvananthapuram / Trivandrum, India
    We're looking for an SRE who can.Define SLIs / SLOs for Tier-0 / Tier-1 services & review quarterly.Change gating via CI / CD based on error budgets. Azure Monitor / Grafana / Prometheus / App Insights da...Show moreLast updated: 5 days ago
    • Promoted
    Senior Dell Boomi Integration Engineer

    Senior Dell Boomi Integration Engineer

    MaitsysKollam, IN
    Job Description : Senior Boomi Integration Engineer.Atom migration (on-prem → cloud), integration development, and ongoing support. Senior Dell Boomi Integration Engineer.Boomi Atom to a cloud-hosted...Show moreLast updated: 5 days ago
    • Promoted
    Technical Lead

    Technical Lead

    MphasisThiruvananthapuram, IN
    Looking for Senior Ingenium Developer with 10+ years' experience and following skills.Experience in Mainframe O / S and Development using COBOL programming language & JCL. Experience in development an...Show moreLast updated: 4 days ago
    • Promoted
    Lead Kernel Engineer

    Lead Kernel Engineer

    L&T Technology ServicesThiruvananthapuram, IN
    L&T Technology Services Limited (LTTS) is a global leader in Engineering and R&D (ER&D) services.With 816 patents filed for 57 of the Global Top 100 ER&D spenders, LTTS lives and breathes engineeri...Show moreLast updated: 10 days ago
    • Promoted
    Senior Site Reliability Engineer (C# / Python)

    Senior Site Reliability Engineer (C# / Python)

    EntechKollam, IN
    Senior Software Site Reliability Engineer (C# / Python).You’ll ensure enterprise systems are reliable, scalable, and performant - driving improvements, leading SRE initiatives, and mentoring teams on...Show moreLast updated: 4 days ago
    • Promoted
    Senior DevOps & Database Reliability Engineer – 100% Remote

    Senior DevOps & Database Reliability Engineer – 100% Remote

    Hyly.AIThiruvananthapuram, IN
    Remote
    AI, we’re building the first AI + Data Fabric for the multifamily industry, transforming how clients manage, secure, and scale their marketing and operational data. As the industry moves toward a co...Show moreLast updated: 11 days ago
    • Promoted
    Remote GenAI Engineer

    Remote GenAI Engineer

    EazyMLKollam, IN
    Remote
    Founded by Bell Labs research veterans, and associated with breakthrough startups like Amelia, EazyML, specializes in Transparent Machine Learning. Early on EazyML founders saw the need for Transpa...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer (SRE) – Infrastructure & Automation

    Site Reliability Engineer (SRE) – Infrastructure & Automation

    InstaServiceKollam, IN
    InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show moreLast updated: 17 days ago
    • Promoted
    Reservoir Engineer

    Reservoir Engineer

    Sofomation Energy PVT LtdKollam, IN
    Position : Senior Reservoir Engineer.Minimum 10+ years of Experience.Must have Bachelor degree or higher qualification in Petroleum Engineering from a recognized university.Plan & guide Reservoir En...Show moreLast updated: 17 days ago
    • Promoted
    Structural Design Engineer

    Structural Design Engineer

    Interarch Building Solutions LimitedKollam, IN
    Interarch Building Solutions Limited.Job Title : Design Engineer / Sr.Design Engineer - Pre-Engineered Buildings (PEB).Industry : Pre-Engineered Steel Buildings (PEB). We are seeking a talented and ex...Show moreLast updated: 6 days ago