We are seeking a seasoned Site Reliability Engineer (SRE) with a solid background in payment systems and high-availability architectures. The ideal candidate will have hands-on experience managing large-scale, distributed systems in production, with a deep understanding of reliability, scalability, and performance tuning in the financial services or payments industry.
Key Responsibilities
- Design, build, and maintain scalable, resilient, and secure infrastructure for high-volume payment platforms.
- Ensure system uptime, reliability, and performance through effective monitoring, alerting, and incident response strategies.
- Collaborate with software engineering and DevOps teams to implement CI / CD pipelines and improve deployment efficiency.
- Automate infrastructure management tasks using Infrastructure-as-Code (IaC) tools (Terraform, Ansible, etc.).
- Proactively identify and mitigate system bottlenecks, failures, and potential points of failure.
- Manage disaster recovery strategies, failover planning, and performance testing for critical payment services.
- Work with development teams to ensure services are designed for reliability, scalability, and observability from the ground up.
- Participate in root cause analysis and post-incident reviews to prevent future outages.
Required Skills & Experience
8+ years of overall experience in infrastructure engineering or SRE roles, with at least 3+ years in the payments / fintech domain.Strong understanding of payment protocols (UPI, IMPS, RTGS, NEFT, SWIFT, etc.) and transaction processing systems.Proven expertise in Linux systems administration, cloud platforms (AWS, GCP, or Azure), and container orchestration (Kubernetes).Solid experience with monitoring / logging tools like Prometheus, Grafana, ELK Stack, Splunk, etc.Proficiency in one or more scripting languages (Python, Shell, Go, etc.) for automation.Experience with incident management, SLAs, and system troubleshooting in high-pressure environments.Familiarity with security and compliance practices in the financial sector (e.g., PCI-DSS, ISO 27001).Preferred Qualifications
Previous experience supporting mission-critical applications in banking or financial services.Exposure to Kafka, Redis, or other real-time streaming and caching technologies.Experience with Site Reliability Engineering principles and implementing SLOs / SLIs.Understanding of the Error Budget (EL) concept and how it ties into availability and release decisions.Experience on any performance testing tool like K6, JMeter, LoadRunner.Familiarity with mocking tools like Mockito, WireMock, Microcks.Skills Required
Terraform, Ansible, Incident Management