Role : Site Reliability Engineer (SRE)
Location : Hyderabad
Experience : 10–15 Years
Job Summary
The Site Reliability Engineer (SRE) will play a critical role in ensuring the reliability, scalability, and performance of Citizens Bank's enterprise systems and cloud environments. The ideal candidate brings deep technical expertise across multi-cloud platforms, automation, observability, and incident management — driving reliability engineering practices and operational excellence in a complex financial services environment.
Key Responsibilities
- Manage and support cloud-based solutions across AWS, Azure, GCP, and other IaaS / PaaS / SaaS / CDN environments.
- Design, implement, and maintain reliable, scalable, and secure infrastructure, ensuring high availability and performance.
- Collaborate with DevOps and security teams to implement DevSecOps workflows using Git, Jenkins, Docker, Kubernetes (EKS / AKS).
- Automate infrastructure and configuration management using Terraform, Ansible, and scripting languages like Python, Bash, or PowerShell.
- Analyze traffic flows, system logs, and application events to troubleshoot issues and identify interdependencies across systems.
- Utilize monitoring and observability tools such as DataDog, Splunk, and CloudWatch for proactive system health management.
- Implement on-call support processes, develop and maintain runbook documentation, and work toward full automation of repetitive tasks.
- Collaborate with other SREs to build resilient systems and promote Site Reliability Engineering best practices across the enterprise.
- Handle critical application outages, perform root cause analysis, and drive incident resolution and preventive measures.
- Work within an Agile environment, partnering with cross-functional teams to continuously improve performance and reliability.
Technical Skills Required
Cloud Platforms : AWS, Azure, GCPDevOps / DevSecOps Tools : Jenkins, Git, Docker, Kubernetes (EKS, AKS)Infrastructure as Code (IaC) : Terraform, AnsibleMonitoring & Logging : DataDog, Splunk, CloudWatchScripting : Python, Bash, PowerShellNetworking : TCP / IP, DNS, HTTP, Load Balancing, RoutingOS Environments : Linux, Windows ServerFamiliarity with AMI builds, patching, and rehydration processesCore Competencies
Strong analytical and troubleshooting skillsProven ability to drive incident response and post-incident reviewsExcellent communication and stakeholder managementAbility to collaborate in global, distributed teamsFocus on automation, resilience, and continuous improvementSkills Required
Tcp, Windows Server, Dns, Terraform, Docker, Python, Aws, Powershell, Routing, Ip, Bash, Http, Datadog, Jenkins, Git, Cloudwatch, Gcp, Load Balancing, Linux, Ansible, Splunk, Azure, Kubernetes