Talent.com
This job offer is not available in your country.
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

ConfidentialThiruvananthapuram / Trivandrum
5 days ago
Job description

As a Site Reliability Engineer (SRE) you will be responsible for improving the overall reliability of applications by ensuring its availability, performance, and scalability. Should be able to gather the technical requirements from the DevOps team and the operational requirements from the Application Support team. With the Site Reliability Engineer role being at the heart of solving production problems, should be able to take a holistic approach to troubleshooting and delve deeply into technical details and must acquire the necessary domain knowledge to effectively troubleshoot and recover from an outage as well as monitor applications in production and build alerts as required.

Working Hours : 05 : 30 AM to 1 : 30 PM IST (GMT+5 : 30)

Responsibilities include :

  • Work closely with the application support team.
  • Monitor critical applications and services to minimize downtime and ensure their availability.
  • Collaborate with DevOps teams to maintain and monitor CI / CD pipelines.
  • Deploy new versions to production environments.
  • Work with project teams to ensure the reliability and maintainability of new and modified releases.
  • Provide input to risk management practices that will anticipate reliability-related incidents that could adversely impact operations.
  • Document processes and monitor application performance metrics.
  • Continuously improve proactive monitoring alert configuration and incident response processes to increase reliability and reduce Mean Time to Recovery (MTTR ).
  • Optimize performance and cost efficiency through continuous monitoring, trend analysis, and fine-tuning.
  • Monitor any abnormal usage that can impact the cost or performance and take corrective actions.
  • Proactively implement preventive measures to improve system reliability.
  • Maintain runbooks, Standard Operating Procedures (SOPs), diagrams, and documentation for swift incident response.
  • Conduct post-incident reviews to improve reliability and contribute to the development of resilience strategies.
  • Achieve Service Level Indicators (SLIs) that are set to meet reliability objectives.

Certifications :

  • Azure Solutions Architect Expert (Microsoft)
  • AWS Certified Solutions Architect (AWS)
  • Open Group Certified Enterprise Architect (TOGAF)
  • PMP or Prince-2 in Project Management
  • Primary Skills :

    Monitoring and Analysis

  • Continuously monitor CDC dashboards to track service performance and analyze reports.
  • Oversee production and DevOps infrastructure dashboards, ensuring system stability and identifying potential issues.
  • Observe alerts from New Relic and escalate them to the respective teams as needed.
  • Identify duplicated New Relic alerts and optimize alert configurations to reduce noise and improve efficiency.
  • Track daily alerts in production to enhance alert optimization strategies.
  • Maintain and update a list of dashboards monitored, including details such as widgets, metrics, and threshold values.
  • Create and manage dashboards for validating and monitoring CPU optimizations for Rapid and CDC services.
  • Perform sanity checks on Container Memory Utilization, Missing Pods, Container Restarts, Container CPU Utilization, Active Pods, Node Resource Consumption, and Pod Network Status to ensure system health.
  • Release and Deployment Management

  • Coordinate and execute weekly production releases, ensuring services are deployed with optimized CPU values.
  • Update central repositories with the latest service configurations and CPU requests.
  • Perform post-deployment sanity checks to validate service stability after production releases.
  • Redeploy CDC services with optimized CPU values, ensuring system performance improvements.
  • Monitor new CPU optimizations for Rapid and CDC services, tracking performance improvements and resource utilization.
  • Incident Management and RCA Documentation

  • Conduct incident analysis, identifying root causes and documenting findings for continuous improvement.
  • Maintain detailed Root Cause Analysis (RCA) documentation to track incidents and resolutions.
  • Provide reports on incident trends, helping improve response times and preventive measures.
  • Collaboration and Communication

  • Participate in daily SyncUpsand internal meetings to discuss ongoing tasks, challenges, and improvements.
  • Sync up with the (NOC) team to align on monitoring strategies and escalations.
  • Collaborate with the Database (DB) team for performance tuning and issue resolution.
  • Conduct knowledge transfer (KT) sessions on Rapid Resource
  • Optimization and related best practices.
  • Optimization and Continuous Improvement

  • Track CPU optimization efforts, ensuring proper resource allocation and utilization for Rapid and CDC services.
  • Analyze performance data to refine resource allocation strategies and improve system efficiency.
  • Identify and implement best practices for reducing alert noise and optimizing monitoring configurations.
  • Secondary Skills : Technical Knowledge

  • Fluent in AWS key services (EBS, S3, AWS Compute, Storage, RDS etc).
  • Expertise in Kubernetes or any Container Orchestration System.
  • Knowledge of Infrastructure as a Code.
  • Linux system administration knowledge.
  • Knowledge of RDBMS and Document databases.
  • Knowledge of Monitoring tools including AWS CloudWatch and NewRelic.
  • Additional certification in Microsoft, Linux, Cisco, AWS or similar technologies is a plus.
  • Behavioral competencies

  • Communication
  • Customer Centricity
  • Business & Market Acumen
  • Psychological Safety
  • Empathy
  • Growth Mindset & Learning Agility
  • Ethical and Vigilant
  • Digital Mindset
  • Operational Excellence
  • Teamwork
  • Analytical thinking
  • Skills Required

    Pmp, Project Management, Incident Management, Cisco, Application Support, Risk Management

    Create a job alert for this search

    Site Reliability Engineer • Thiruvananthapuram / Trivandrum

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Amicon Hub Servicesthiruvananthapuram, kerala, in
    Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 6 days ago
    • Promoted
    Equifax - Senior Site Reliability Engineer - IAC Terraform

    Equifax - Senior Site Reliability Engineer - IAC Terraform

    EquifaxTrivandrum
    About the job Site Reliability Engineering (SRE) at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distr...Show moreLast updated: 9 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    UplersKollam, IN
    Uplers is hiring for one of the clients.SRE (Oracle Cloud Infrastructure).Remote | Mon–Fri | 10 : 30 AM – 7 : 30 PM IST.Use of personal device required. OCI cloud infrastructure using Terraform and GitL...Show moreLast updated: 24 days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.Kollam, IN
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior MLOps Engineer

    Senior MLOps Engineer

    Mitchell Martin Inc.Kollam, IN
    Include, but are not limited to, the following : .Own productionizing models—from tracked experiments to governed releases—ensuring resilient services with clear SLOs, runbooks, and fast, safe rollba...Show moreLast updated: 20 days ago
    • Promoted
    Lead Sustenance Engineer - Storage

    Lead Sustenance Engineer - Storage

    DDNKollam, IN
    This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a globa...Show moreLast updated: 7 days ago
    • Promoted
    DevOps / Platform Engineer

    DevOps / Platform Engineer

    iVedha Inc.Thiruvananthapuram, IN
    Hiring a seasoned DevOps / Platform Engineer to drive automation, platform reliability, and robust.Design, deploy, and manage CI / CD pipelines and infrastructure automation, leveraging AI for.Implemen...Show moreLast updated: 30+ days ago
    • Promoted
    Senior DevOps / Site Reliability Engineer

    Senior DevOps / Site Reliability Engineer

    Scoop Technologies Pvt LtdTrivandrum
    Job Title : Senior DevOps Engineer / Site Reliability Engineer (SRE) Experience : 5 to 8 Years &...Show moreLast updated: 26 days ago
    • Promoted
    L3 O365 Engineer

    L3 O365 Engineer

    Nextbridge IT SolutionsKollam, IN
    We are seeking a highly skilled .This senior role is a critical escalation point for complex issues, driving the resolution of major incidents and ensuring the seamless operation, security, and pro...Show moreLast updated: 7 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    ExasoftThiruvananthapuram, IN
    Responsibilities and Requirements : .Experience must be at least 10+ years in SRE.Multi Cloud, Hybrid Cloud – on Data center sites. Experience with multiple operating systems (.Operating Systems, Kern...Show moreLast updated: 4 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConcordKollam, IN
    Engineers (Individual Contributors).Strong SRE (Site Reliability Engineering).CI / CD, monitoring, automation, infrastructure as code, etc.Show moreLast updated: 18 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    BayOne Solutionskollam, kerala, in
    Role : Site Reliability Engineer.The CXE Site Reliability Engineering (SRE) team manages the CI / CD pipelines and cloud infrastructure, ensuring seamless deployment, monitoring, and maintenance.Howev...Show moreLast updated: less than 1 hour ago
    • Promoted
    Site Reliability Engineer - Chaos Management

    Site Reliability Engineer - Chaos Management

    Xebiakollam, kerala, in
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 7 days ago
    • Promoted
    Deployment Engineer

    Deployment Engineer

    AvocaThiruvananthapuram, IN
    Build, launch & optimize AI agents that power the next generation of home-service customer experiences.Avoca is the all-in-one AI lead-conversion platform. Our technology boosts booking rates, slash...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    XebiaKollam, IN
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 26 days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    ZafinTrivandrum, Kerala, India
    Senior Site Reliability Engineer (SRE II).Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects...Show moreLast updated: 16 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    WSO2kollam, kerala, in
    Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) to thousands of enterprises in over 90 c...Show moreLast updated: 7 days ago
    • Promoted
    Equifax - Site Reliability Engineer

    Equifax - Site Reliability Engineer

    EquifaxTrivandrum
    Site Reliability Engineering (SRE) at Equifax SRE is a discipline that combines software and systems engineering for building and running large-scale, distrib...Show moreLast updated: 30+ days ago