Role-Site Reliability Engineer
6+ Years
Permanent / Bangalore - Hybrid
Job Description
We are looking for an engineer to focus on Developer Experience and who can help us design, build, and maintain high-performance, scalable, and reliable services. As Company provides a Contact Center service, we play a very critical role in our Customer's business operations and therefore need to provide a highly available and fault tolerant service.
We believe in a DevOps philosophy where every engineering team should be responsible for the software they build and deploy and SREs play a critical role in ensuring that the teams have the tools, practices, and expertise to make that happen in a blame free culture.
Our mission is to improve developers' experience by giving them the tools to manage the entire software lifecycle and to be self-sufficient.
To help with this we are building our own internal PaaS using the latest technologies like Kubernetes, Prometheus, Kotlin and others. This platform is an important pillar engineering effort and helps us deliver better, faster and more reliable solutions for our customers.
Responsibilities :
- Design, build, harden, and maintain some key parts of our internal platform (from CI / CD to developer tools which aim increasing R&D productivity)
- Help migrate to industry leading CICD tools like GitHub Actions
- Help automate safe deployment practices by using industry leading tools like GitHub Actions, ArgoCD, Argo Rollouts, Helm Charts, etc
- Help automate infrastructure provisioning and other engineering processes by working on automations built on top of an engineering platform written in GitHub Actions
- Coach and up-skill other engineering team members
- Solve challenging technical problems and put your skills to the test every day; see an immediate impact of your work and value you created for other engineers
- Automate every aspect of our infrastructure to remove as much as possible any human intervention
- Develop effective tooling, alerts, and response to both identify and address reliability risks
- Drive and promote protocols on production readiness and operational excellence
- Partner with product engineering teams to debug production outages and carry out action items to improve reliability of those systems
- Advocate for automated testing, continuous integration and delivery, feature toggles and progressive rollouts
- Plan for growth of Company infrastructure.
Skills and Qualifications :
6+ Years of experience.Understand large-scale complex systems from a reliability perspectiveDesign, implement and maintain CI / CD processes and toolsPassion for producing clean, standards-compliant, secure codeBringing a developer mindset and applying it to infrastructureKnow your way around Linux / Unix systemsExperience with KubernetesExperience with Infrastructure as code tools like Terraform and AnsibleExperience building software with a programming language such as Java, Kotlin, Scala or any other JVM-based languagesExperience writing scripts for automating the execution of certain tasks with a programming language like Ruby, Python, Bash or any other scripting languageExperience with at least one relational and non-relational databases (ex : PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch)Ability to identify time consuming and error prone manual tasks and then build / leverage tooling to automate themAbility to identify root causes of instability in a large-scale distributed system across stacksNice to haves / Pluses :
Experience with cloud-based solutions such as Amazon AWS, Google Cloud, or Microsoft Azure
Experience with CI / CD platforms (e.g Jenkins, GitLab), Containers (Docker, Kubernetes), Artifact Management tools (e.g : Nexus, ECR)
Experience with Go programming language