Job Requirements
- Incident Response and Troubleshooting : Address and perform root cause analysis (RCA) of complex live production incidents and cross-platform issues involving OS, Networking, and Database in cloud-based SaaS / IaaS environments. Implement SRE best practices for effective resolution.
- Analysis, and Infrastructure Maintenance : Continuously monitor, analyze, and measure system health, availability, and latency using tools like Prometheus, Stackdriver, ElasticSearch, Grafana, and SolarWinds. Develop strategies to enhance system and application performance, availability, and reliability. In addition, maintain and monitor the deployment and orchestration of servers, docker containers, databases, and general backend infrastructure.
- Document system knowledge as you acquire it, create runbooks, and ensure critical system information is readily accessible.
- Security Management : Stay updated with security protocols and proactively identify, diagnose, and resolve complex security issues.
- Automation and Efficiency : Identify tasks and areas where automation can be applied to achieve time efficiencies and risk reduction. Develop software for deployment automation, packaging, and monitoring visibility.
- Issue Tracking and Resolution : Use Atlassian Jira, Google Buganizer, and Google IRM to track and resolve issues based on their priority.
- Team Collaboration and Influence : Work in tandem with other Cloud Infrastructure Engineers and developers to ensure maximum performance, reliability, and automation of our deployments and infrastructure. Additionally, consult and influence developers on new feature development and software architecture to ensure scalability.
- Debugging, Troubleshooting, and Advanced Support : Undertake debugging and troubleshooting of service bottlenecks throughout the entire software stack. Additionally, provide advanced tier 2 and 3 support for NetApp's Cloud Data Services solutions.
- Directly influence the decisions and outcomes related to solution implementation : measure and monitor availability, latency, and overall system health.
- Proficiency in Linux / Unix and CORE OS.
- Demonstrated experience in scripting and infrastructure automation using tools such as Ansible, Python, Go or Ruby.
- Deep working knowledge of Containers, Kubernetes, and Serverless computing implementation.
- DevOps development methodologies.
- Experience with distributed systems design patterns using tools such as Kubernetes.
- Experience with cloud platforms such as AWS, Azure, or Google Cloud.
Education
A minimum of 8-12 years of experience is required.A Bachelor of Science Degree in Computer Science, a master's degree; or equivalent experience is required.Skills Required
Aws, Azure, Google Cloud, Ansible, Python, Go