We are seeking a highly skilled Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in cloud infrastructure, automation, and large-scale system operations. In this role, you will partner across engineering teams to enhance platform reliability, accelerate delivery, and ensure a world-class customer experience.
Key Responsibilities
- Drive initiatives that enhance operational efficiency, scalability, and overall platform reliability.
- Lead standardization efforts across services and disciplines in collaboration with embedded SRE teams.
- Identify and implement automation opportunities for deployment, infrastructure management, and observability.
- Apply modern security practices to ensure secure cloud-based infrastructure and software systems.
- Perform full-stack diagnostics to determine and resolve root cause issues.
- Analyze system performance and drive improvements to key operational metrics and KPIs.
- Proactively assess infrastructure and applications for enhancements rather than waiting for direction.
- Safeguard application data from unauthorized access, modification, or disclosure.
- Build and maintain high-availability, redundant systems and disaster recovery procedures.
- Develop integrated workflows to support internal support teams and cross-functional partners.
- Own the customer experience—ensure seamless digital interactions and promote user satisfaction.
- Respond to incidents promptly and support troubleshooting efforts across the stack.
Required Skills & Competencies
Cloud & Infrastructure
1+ years working with Infrastructure as Code (IaC) and DSC tools : Terraform, CDK, Chef.1+ years deploying and managing containerized workloads with Docker and Kubernetes .1+ years managing AWS infrastructure at scale : EC2, S3, ELB, Lambda, Route 53, ECS, SQS, CloudWatch.Prior experience in a DevOps or SRE environment .Automation & Scripting
Strong automation background with scripting languages including PowerShell, Ruby, Go, Python, Bash .Monitoring & Troubleshooting
Experience with large-scale monitoring and APM tools : ELK Stack, Dynatrace, New Relic, Nagios .Skilled in IIS management, troubleshooting, and performance monitoring .Experience supporting web farms in high-traffic SaaS environments.Strong analytical, diagnostic, and problem-solving skills with focus on proactive improvement.Application & CI / CD
Extensive experience with .NET application architecture (caching, CDNs, load balancing, HA).Clear understanding of SDLC processes and hands-on experience with CI / CD tools such asTeamCity, Octopus Deploy, GitHub, Jenkins, Codefresh .Additional Technologies (Preferred)
Active Directory, SSL, FTP, Big-IP F5T-SQL, MongoDB, MySQL, SQL ServerGit, Chef, SaltKafkaLinux / Windows Server AdministrationApache, Bash