Design, develop SLA, SLO, SLI of services within the Business Unit.
Involve in whole process of Development, Production System Operation including system maintenance, monitoring, automation, backend operation, ensuring high availability, regular application release, troubleshooting, middleware performance tuning and collaborating with functional, technical team members to provide high quality services.
Involve in automation of routine manual production / non-production operation using technologies like Ansible, Chief etc. Will be the key person to propose, implement automation to increase productivity, quality.
Always improve the system performance, reliability
Should have service ownership mind & proactively able to react to the production issues.
Propose new technologies, tools etc. to improve the whole process of development, testing and production operations. Strong self-learning ability, motivation to work on new Technologies.
Work closely with developers, product manager, project manager, team lead, security, and QA team members in different location (Singapore, Japan, India etc.)
Exp : 8 Years - 14 Years
Qualifications :
Must-have
Over 8 years of experience on SRE, handling high traffic production system independently, troubleshooting (middleware, infra), automation, regular operation etc.
Implement Site Reliability Engineering principles regarding performance, reliability, monitoring, alerting in Production environment
Experience in management of large-scale service.
Experience in design and construction of public cloud (Ex. GCP, Azure), preferably GCP.
Good knowledge in CI / CD / CT pipeline using tools such as Jenkins / Bamboo and VCS such as GIT / SVN
Strong knowledge in LINUX based system operation and extensive skills in Linux commands.
Hands-on experience in Unix / Linux / Shell / Python scripting