Description
We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our Organization.
The SRE will play a crucial role in ensuring the Reliability, Scalability, Capacity Planning and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, Containerisation and cloud technologies.
Technologies
- CI / CD, Jenkins, Docker, Kubernetes, Terraform, Ansible, Python, Prometheus, Grafana, ELK stack, Splunk, Dynatrace, Datadog or similar, SLI, SLO, SLA and Error Budget concepts
Responsibilities
Design, implement, and manage scalable, reliable, and secure cloud infrastructure using tools such as Terraform, Kubernetes, and DockerDevelop and maintain monitoring and alerting systems to ensure the health and performance of applications and infrastructure. Utilize tools such as Prometheus, Grafana, and ELK stackLead the response to critical incidents, perform root cause analysis, and implement long-term fixes to prevent recurrenceDevelop, maintain, and optimize continuous integration and continuous deployment (CI / CD) pipelines using tools such as Jenkins, GitLab CI, or CircleCIAutomate routine tasks and improve efficiency through scripting and tools, utilizing languages such as Python, Bash, or GoImplement and manage security best practices for infrastructure and applications, including vulnerability assessments, penetration testing, and compliance with security standardsWork closely with development, QA, and operations teams to ensure seamless integration and deployment of new features and updatesPerform capacity planning and scaling of infrastructure to meet current and future demandsCreate and maintain comprehensive documentation for infrastructure, processes, and proceduresRequirements
5+ years of experience in a DevOps / SRE roleStrong experience with cloud platforms (AWS, GCP, Azure)Proficiency in infrastructure as code (IaC) tools (Terraform, CloudFormation, etc.)Extensive experience with containerization and orchestration (Docker, Kubernetes)Strong knowledge of CI / CD tools (Jenkins, GitLab CI, CircleCI, etc.)Proficiency in scripting languages (Python, Bash, etc.)Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack, etc.)Participate in capacity planning and scalability assessments to support business growth and requirementsWell aware of SLI, SLO, SLA, and Error Budget concepts and their implementations and provide on-call support and participate in incident management & response activities as neededSolid understanding of networking and security principlesExcellent problem-solving skills and the ability to work under pressureStrong communication and collaboration skillsWe offer
Opportunity to work on technical challenges that may impact across geographiesVast opportunities for self-development : online university, knowledge sharing opportunities globally, learning opportunities through external certificationsOpportunity to share your ideas on international platformsSponsored Tech Talks & HackathonsUnlimited access to LinkedIn learning solutionsPossibility to relocate to any EPAM office for short and long-term projectsFocused individual developmentBenefit package : Health benefits Retirement benefits Paid time off Flexible benefitsForums to explore beyond work passion (CSR, photography, painting, sports, etc.)