Description GSPANN is hiring an experienced Observability Engineer (AI Ops) with 12-15 years of expertise in monitoring, automation, and AI-driven operations. The role involves enhancing system reliability and performance through APM tools, cloud observability, scripting, and Site Reliability Engineering (SRE) practices.
Role and Responsibilities
- Use Application Performance Management (APM) tools such as Dynatrace and LogicMonitor to monitor and enhance system performance.
- Write and maintain automation scripts using Python and Bash to streamline monitoring and alerting processes.
- Deploy and manage Splunk for log analysis, real-time monitoring, and root cause troubleshooting.
- Operate and oversee Kubernetes clusters through Amazon Elastic Kubernetes Service (EKS) for high availability and scalability.
- Implement observability solutions on Amazon Web Services (AWS) and Microsoft Azure to ensure cloud-based systems are monitored and well-managed.
- Apply Site Reliability Engineering (SRE) principles to improve system resilience, scalability, and performance.
- Incorporate AI and machine learning in observability workflows to enable predictive monitoring and boost operational efficiency.
- Respond promptly to incidents and drive resolution efforts to minimize business disruptions.
- Continuously analyze and tune system performance, using proactive monitoring and feedback loops.
- Partner with development and operations teams to integrate observability tools and practices seamlessly across environments.
Skills and Experience
Hold a Bachelor’s degree in Computer Science, Information Technology, or a related discipline.Bring 12-15 years of experience in observability engineering or related technical roles.Demonstrate advanced proficiency in APM tools (e.g., Dynatrace, LogicMonitor), scripting languages (Python, Bash), and Splunk.Have hands-on experience working with EKS, AWS, and Azure platforms.Show deep understanding of SRE concepts and how to apply them in production environments.Exhibit strong problem-solving and communication skills.Thrive in a fast-paced and dynamic environment.Hold certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or equivalent.Apply knowledge of AI and machine learning techniques in operational contexts.Understand and utilize performance optimization frameworks and related best practices.