This job offer is not available in your country.

Principal Site Reliability Engineer

Rakuten IndiaBengaluru, India

6 days ago

Job description

Responsibilities :

Design, develop SLA, SLO, SLI of services within the Business Unit.

Involve in whole process of Development, Production System Operation including system maintenance, monitoring, automation, backend operation, ensuring high availability, regular application release, troubleshooting, middleware performance tuning and collaborating with functional, technical team members to provide high quality services.

Involve in automation of routine manual production / non-production operation using technologies like Ansible, Chief etc. Will be the key person to propose, implement automation to increase productivity, quality.

Always improve the system performance, reliability

Should have service ownership mind & proactively able to react to the production issues.

Propose new technologies, tools etc. to improve the whole process of development, testing and production operations. Strong self-learning ability, motivation to work on new Technologies.

Work closely with developers, product manager, project manager, team lead, security, and QA team members in different location (Singapore, Japan, India etc.)

Exp : 8 Years - 14 Years

Qualifications : Must-have

Over 8 years of experience on SRE, handling high traffic production system independently, troubleshooting (middleware, infra), automation, regular operation etc.

Implement Site Reliability Engineering principles regarding performance, reliability, monitoring, alerting in Production environment

Experience in management of large-scale service.

Experience in design and construction of public cloud (Ex. GCP, Azure), preferably GCP.

Good knowledge in CI / CD / CT pipeline using tools such as Jenkins / Bamboo and VCS such as GIT / SVN

Strong knowledge in LINUX based system operation and extensive skills in Linux commands.

Hands-on experience in Unix / Linux / Shell / Python scripting

Experience with automation / configuration management, e.g., Terraform, Puppet, Chef, Ansible

Experience in developing and operating one or more of following systems : Kubernetes, Nginx, ELK stack, Hadoop, etc.

Identify process gaps and recommend on best practices based on industry standards.

Provide technical expertise on complex automation and functional issues.

Flexible emergency support timing based on the business requirement. Must adapt to business needs in terms of working hours.

Big Data technologies such as Hadoop, NoSQL - Couchbase, Cassandra

Create a job alert for this search

Site Reliability Engineer • Bengaluru, India