Description :
Job Description : AI-Driven Observability Engineer
Experience : 10+ Years
About the Role :
We are seeking a highly skilled AI-Driven Observability Engineer to design, implement, and maintain end-to-end observability solutions for infrastructure and application. You will play a key role in ensuring the reliability, performance, and scalability of our distributed systems by developing monitoring, logging, and tracing capabilities. The ideal candidate will have expertise in ETL, Data Science, and Machine Learning, along with hands-on experience in OpenTelemetry, Splunk, Kafka for comprehensive observability.
Key Responsibilities :
- Design & Develop Observability Solutions : Build and enhance telemetry pipelines for logs, metrics, and traces using industry-standard tools (kafka, OpenTelemetry, Splunk)
- Instrument Applications : Implement observability best practices in infrastructure, applications and platforms.
- Design and Implement machine learning models to analyze logs, metrics and traces for anomaly detection, predictive failure analysis and root cause analysis.
- Monitor & Analyze System Performance : Build and Develop real-time data visualization dashboards and alerts to track system health, detect anomalies, and support real-time troubleshooting.
- Work with Event-Driven Architectures : Integrate observability with messaging systems like Kafka, RabbitMQ, or Pulsar for real-time monitoring.
- Collaborate Across Teams : Work closely with SREs, DevOps, and development teams to improve system reliability and incident response.
- Security & Compliance : Ensure observability data is securely stored and compliant with relevant regulations (GDPR, HIPAA, etc.).
- Optimize Performance : Conduct root cause analysis and improve system observability to reduce downtime and improve response times.
Required Skills & Experience :
Data Science & Machine Learning experience : Hands-on proficiency in Python, TensorFlow, PyTorch, Scikit-learn, Pandas, NumPy.Extensive knowledge of ETL techniques : Data extraction, transformation, and loading using Apache Airflow, Apache NiFi, Spark or similar toolsObservability Stack : Hands-on experience with Prometheus, Grafana, ELK Stack, Loki, OpenTelemetry, Jaeger, or Zipkin.Experience with Time-Series Analysis, Predictive Analytics and AI-driven Observability.Cloud & Infrastructure : Experience with AWS, Azure, or GCP observability services (e.g., CloudWatch, Azure Monitor).Distributed Systems & Microservices : Understanding of Kubernetes, Docker, and Service Mesh technologies (Istio, Linkerd).Event-Driven Architectures : Experience with Kafka, RabbitMQ, or other message brokers.Database & Storage : Familiarity with time-series databases (InfluxDB, VictoriaMetrics) and NoSQL / SQL databases.Preferred Qualifications :
Experience in AIOps and intelligent observability or anomaly detection.Knowledge of Chaos Engineering for resilience testing.Certifications in AWS, Azure, Kubernetes, or Observability tools.Knowledge of data engineering and big data technologies like Hadoop, Spark and Flink.Experience with machine learning models for predictive observability.Why Join Us ?
Work on cutting-edge observability solutions in a high-scale production environment.Opportunity to automate infrastructure monitoring and enhance system resilience.Collaborate with cross-functional teams to improve reliability engineering.Competitive salary, benefits, and growth opportunities in a fast-paced environment.(ref : hirist.tech)