About the Role
The Site Reliability Engineer ensures digital systems are reliable, resilient, and scalable. This role automates operational processes, reduces manual intervention, and strengthens incident response across complex environments. With expertise in infrastructure, scripting, cloud services, and observability, the Site Reliability Engineer plays a key role in maintaining system uptime and driving continuous improvements in performance and deployment workflows.
What Youll Do
- Automate processes to enhance system reliability and scalability
- Implement proactive monitoring and maintenance to prevent incidents
- Streamline CI / CD and development-to-deployment workflows
- Develop tools and scripts that reduce manual operational efforts
- Respond to incidents, manage root cause analysis, and minimize service disruption
- Continuously research and adopt new technologies for performance gains
- Partner with cross-functional teams to improve end-to-end system performance
- Support other duties and technical projects as required by leadership
What Youll Bring
Bachelors degree in Computer Science, Software Engineering, or a related technical field25 years of experience in SRE, DevOps, or cloud-native infrastructure rolesProven ability to build and manage CI / CD pipelinesExperience with cloud-native platforms and technologies (e.g., AWS, Azure, GCP)Strong scripting skills (e.g., Python, Bash) and systems troubleshootingKnowledge of Agile principles and automation best practicesExcellent problem-solving and communication skillsCertifications (preferred) : CKA, AWS DevOps Engineer, SRE FoundationMust Have Skills
Programming Languages : Proficiency in at least one Python, Java, or JavaScriptCloud Platforms : Experience with any major cloud provider AWS, Azure, or GCPInfrastructure as Code (IaC) : Hands-on experience with tools like Terraform, CloudFormation, or PulumiCI / CD Pipelines : Familiarity with tools such as GitHub Actions, GitLab CI, Jenkins, or Argo CDContainerization : Experience with Docker and orchestration tools like KubernetesObservability & Monitoring : Knowledge of tools such as Prometheus, Grafana, Splunk, CloudWatch, or DatadogNice To Have
Experience with chaos engineering or resilience testingFamiliarity with service mesh (Istio, Linkerd), edge proxies, or policy enginesExposure to SRE metrics (SLOs, SLIs, Error Budgets) and golden signals monitoring.