Job description
What will you do :
- Applies software engineering principles to the operations domain.
- Contributes to a service's codebase, writes automation that aids in the management of a service, and performs operational engineering work to support a service's Service Level Objectives (SLO).
- Ensures service reliability meets users needs, including internally critical and externally visible services
- Uses software & systems engineering to design, build, and run large-scale, distributed, fault-tolerant systems
- Focuses on iterative improvement through toil reduction and error-budget enforcement
- Interfaces with both cloud IaaS and SaaS providers and internal stakeholders, including Support, IT, and Product Engineering, to achieve desired outcomes.
- Participates in an on-call rotation within a geographically distributed team to provide 24x7x365 production support, with responsibility to respond to urgent customer issues
- Practice sustainable incident response and blameless postmortems
- Work within a small agile team to develop and improve SRE methodologies, support your peers, plan and self-improve
- Provide feedback around bugs and feature improvements to the various Red Hat Product Engineering teams
What will you bring
Bachelor's degree in computer science or a related technical field involving software or systems engineering, or practical experience demonstrating interest in SRE
2+ years of experience of using cloud providers and technologies (Google, Azure, Amazon, OpenStack, etc.)
1+ years of experience administering a kubernetes-based production environment
2+ years of experience programming with at least one object-oriented language; Golang, or Python are a big plus
Ability to collaboratively troubleshoot and solve problems in a team setting
Basic understanding of UNIX or Linux operating systems The following will be considered a plus
Demonstrated comfort with collaboration, open communication, and reaching across functional boundaries
Passion for understanding users needs and delivering outstanding user experiences
Additional
Skills :
Demonstrated ability to quickly and accurately troubleshoot system issuesSolid understanding of standard TCP / IP networking and common protocols like DNS and HTTP2+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure1+ years of experience with enterprise systems monitoring2+ years of experience with enterprise configuration management software like Red Hat Ansible Automation Platform (AAP)Experience with static code analysis toolsSome experience with code deployment across cloud-based environmentsSome experience with continuous Integration and continuous deployment approachesSome experience working with complex distributed systemsDemonstrated ability to debug, optimize code and automate routine tasksAbility to work with minimal supervision and as part of a global team, and problem solving skillsExperience working with agile development methodologiesSkills Required
Aws, Saas, SRE