Site Reliability Engineer - NOC
Make Your Mark
The Observability Engineer – NOC, you will build and maintain tools, applications and services supporting the BlackLine SaaS application and internal teams. A successful candidate must possess solid critical thinking skills and have experience supporting large server farms, 24x7 High Availability mission-critical traffic-intensive web infrastructures and be familiar with container technologies. Large SaaS experience is very desirable.
You’ll Get To
- Ensure 99.99%+ availability of the services and infrastructure that spans across multiple global datacentres in private and public clouds.
- Troubleshoot BL container platforms and supporting automation in a highly available, high traffic environment.
- Monitor and maintain health, performance, and security of all infrastructure components.
- Build systems and perform necessary tasks to deliver against committed project timelines. Desire to automate everything
- Solve real-life problems in a bleeding-edge, high-performance, and high-traffic environment. Maintain documentation and operational knowledge base.
- Triaging first level events and incidents.
- Adhere to the change management and other established processes and procedures.
- Respond to and troubleshoot incidents (Incident Management). Conduct root cause analyses.
- Evaluate and analyse systems, performance, issues and metrics in order to provide recommendations for continuous improvements.
- Adhere to SLA compliance as defined.
- Participate in a scheduled 24 / 7 on-call rotation for second tier support escalations.
What You’ll Bring
3 - 6 years industry experience3+ years supporting Unix and / or Linux (Ubuntu, CentOS, Redhat) and / or Windows3+ years supporting a SaaS / Hosting type critical revenue-generating environment.2+ years working with development and continuous integration related tooling (Jenkins, BitBucket, GitHub)2+ years working with tools like New Relic, Jira, Foglight.1+ years of experience using container platforms and tooling (Kubernetes, Docker, Rancher, Helm, Anthos, Istio, GKE, AKS, etc...)Experience in hybrid cloud and / or multi-cloud environments (GCP, Azure, AWS, VMWARE)Understanding of software development processes and methodologies.Experience with scripting and / or systems programming languages (Bash, PowerShell, Python, Golang, C#).Hands-on problem-solving skills, technical leadership and mentoring qualities.Strong written and oral communication skills.Ability to participate in On-Call rotationA minimum of two years of experience in a 24x7 operations organization, deploying and operating complex cloud infrastructure at scaleWe’re Even More Excited If You Have
Significant experience with open source platforms and technologies.Track record of architecting, developing, implementing robust, distributed online solutions.Familiarity with Configuration Management tools (Puppet, Chef, Salt, Ansible).Experience in mixed Windows / Unix / Linux environments.