Act as the primary point of contact for system security, health, performance, and capacityDevelop tools to automate deployment, monitor performance, and support system scalabilityDefine and maintain SLAs aligned with our service modelLead incident response efforts for live issues, identifying and implementing tooling and process enhancementsCollaborate with software engineers to ensure applications are designed with operability and scalabilityBuild and manage Kubernetes-based infrastructure using Infrastructure as Code (IaC) tools :Terraform, Helm, Kustomize, AWS CDK (with TypeScript)Maintain scalable cloud environments, primarily in AWSImplement and manage monitoring and alerting systems to ensure system reliabilityRequired Skills & Experience :
- 6–8+ years in a Cloud Engineering / DevOps / Site Reliability Engineering role
- Proven enterprise-level technical operations experience
- Deep knowledge of UNIX / Linux systems administration for critical deployments
- Strong troubleshooting skills across systems, networking, and application stacks
- Experience supporting and managing cloud infrastructure on AWS
- Solid programming / scripting experience in one or more of the following :
- Python, NodeJS, Java, C, Shell
- Hands-on experience with CI / CD, monitoring, and infrastructure automation tools
- Familiarity with IaC tools : Terraform, Helm, Kustomize, AWS CDK
- Ability to define, own, and maintain monitoring and alerting for cloud-based systems
Nice to Have :
- Experience with security tools like AWS Inspector, AWS Detective, Lacework, or similar
- Familiarity with financial sector compliance and security best practices
- Experience working in fast-paced production environments with mission-critical systems
Skills Required
Python, Java, Terraform, Helm, Nodejs, C, Shell