Cloud Operations Engineers are responsible for building internal tools and process automation. Day-to-day duties are creating and monitoring systems alert dashboards, reviewing critical event and system logs, accessing customer instances that underpin their production databases, and performing server administration duties including performance troubleshooting. Applicants must be critical thinkers who are quick to detect, resolve, or escalate issues that are sometimes broad in scope and difficult to trace.
We are looking to speak to candidates who are based in Bengaluru for our hybrid working model.
Responsibilities
- Help scale the Cloud Operations Engineering team with the strategic implementation and refinement of processes and tools
- Provide career development feedback and advice to direct reports
- Identify and measure team health indicators and performance metrics
- Ensure proper team focus on priorities, objectives, and related deliverables
- Collaborate with technical and non-technical teams across the company
- Balance your time between leading your team, working on customer incidents and being involved in projects
- Be a source of guidance and advice to your own team members and other teams within MongoDB
- Build a relationship with your team around trust
- Successfully coordinate with a global team of Cloud Operations Engineers who are tasked with ensuring our uptime guarantees to the MongoDB Atlas customer base
- Participate in designing and building internal tools
- Assist in scoping, designing and deploying systems that reduce Mean Time to Resolve for customer incidents
- Monitor and detect emerging customer-facing incidents on the Atlas platform; assist in their proactive resolution
- Automate internal processes, routine monitoring and troubleshooting tasks
- Diagnose live incidents, differentiate between platform issues versus usage issues, and take the next steps toward resolution
- Cooperate with our Product Management and Cloud Engineering organizations by identifying areas for improvement in the management applications powering the Atlas infrastructure
- Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (from direct surveillance or through alerts via our Technical Services Engineers)
Requirements
Management skills, with hands-on experience running small to mid-sized Engineering Teams in a rapid-growth environmentStrong diagnostic / troubleshooting process, with significant experience troubleshooting end-to-end technical issues in production environmentsExperience supervising, leading and monitoring progress of Software Development projectsPatience, empathy, and a genuine desire to help othersExcellent communication skills, both written and verbalAbility to think on your feet, remain calm under pressure, and find solutions to challenges in real-timeExperience with being an on-call DevOps, SRE, or Cloud Operations engineerExpertise with Linux system administration and networking technologiesKnowledge of database and distributed system operations and conceptsKnowledgeable about a wide range of web and internet technologiesFamiliarity with Amazon Web Services and other Cloud infrastructure platforms (e.g. GCP, Azure)Experience in monitoring, system performance data collection and analysis, and reportingCapability to write programs / scripts to solve both short-term systems problems and long-term strategic objectives for the Atlas productA CS / CE degree or equivalent experienceAt least 2 of the following programming languages : Java, Go, Python, TypescriptA keen interest in learning new skills and competenciesSkills Required
Linux Administration, Devops, Troubleshooting, Automation