We're looking for someone who can bridge DevOps and AI to keep our systems intelligent, automated, and running at peak performance. If you enjoy solving complex operational challenges with automation, analytics, and AI, this is the role for you.
Responsibilities :
- Implement and manage AIOps platforms for intelligent monitoring, alerting, and incident response.
- Integrate AI-driven insights into our DevOps workflows to reduce downtime and improve system reliability.
- Automate operational tasks using scripts, pipelines, and orchestration tools.
- Build and maintain observability dashboards, leveraging metrics, logs, and traces.
- Collaborate with data, infrastructure, and application teams to ensure seamless AIOps adoption.
- Continuously optimize monitoring rules and anomaly detection models for evolving system needs.
Requirements :
Proven experience in AIOps and DevOps.Strong knowledge of cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).Familiarity with observability tools (Prometheus, Grafana, ELK, Splunk, Datadog).Scripting / programming skills in Python, Shell, or similar.Understanding of machine learning for operations anomaly detection, predictive maintenance, etc.Tech Stack and Tools :
AIOps and DevOps Expertise : Experience implementing and managing AIOps platforms for intelligent monitoring, alerting, and automation.Cloud Proficiency : Hands-on with AWS, Azure, or GCP for infrastructure and deployment.Containerization and Orchestration : Strong in Docker and Kubernetes for scalable deployments.Observability Tools : Proficiency with Prometheus, Grafana, ELK / EFK Stack, Splunk, or Datadog.Automation and Scripting : Skilled in Python, Shell, or similar scripting for automating operational workflows.Monitoring and Incident Response : Setting up intelligent alerts, root-cause analysis, and predictive monitoring.Nice-to-Have :
GenAI Integration : Applying generative AI for operational insights or incident response.MLOps Knowledge : Managing ML model lifecycles (CI / CD for ML, model monitoring).Big Data and Streaming Tools : Familiarity with Kafka, Spark for operational data handling.Incident Management Platforms : PagerDuty, Opsgenie, ServiceNow integrations.Security Automation : Integrating security monitoring into AIOps workflows(ref : hirist.tech)