Description GSPANN is hiring a Senior AI / ML Operations Engineer. The role focuses on building AIOps / MLOps systems and automating ML pipelines.
Role and Responsibilities
- Architect and drive the implementation of scalable Artificial Intelligence for IT Operations (AIOps) and Machine Learning Operations (MLOps) frameworks.
- Mentor junior engineers and data scientists by sharing best practices in model deployment and operational excellence.
- Align technical strategies with business objectives through close collaboration with product managers, Site Reliability Engineers (SREs), and other key stakeholders.
- Establish and uphold engineering standards, including Service-Level Agreements (SLAs), Service-Level Indicators (SLIs), and Service-Level Objectives (SLOs) for machine learning and AIOps services.
- Design and manage Machine Learning (ML) CI / CD (Continuous Integration / Continuous Deployment) pipelines for model training, testing, deployment, and monitoring using tools such as Kubeflow, MLflow, and Apache Airflow.
- Implement robust monitoring systems to track model performance metrics like drift, latency, and accuracy, and automate retraining workflows where necessary.
- Lead model governance efforts by ensuring reproducibility, traceability, and compliance with frameworks such as FAIR (Findable, Accessible, Interoperable, Reusable), and maintaining audit logs.
- Build AI / ML-powered solutions for proactive infrastructure monitoring, predictive alerting, and intelligent incident resolution.
- Enhance anomaly detection and root cause analysis by integrating and optimizing observability tools such as Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana), Dynatrace, Splunk, and Datadog.
- Automate response workflows using predefined playbooks, runbooks, and self-healing systems.
- Apply statistical techniques and machine learning models to analyze logs, metrics, and distributed traces at scale.
Skills and Experience
Bachelor’s or Master’s degree in Computer Science, Data Engineering, Artificial Intelligence, Machine Learning, or a related field.Certifications in AWS / GCP DevOps, Kubernetes, or MLOps is desirable.6+ years of hands-on experience in DevOps, MLOps, or AIOps, including at least 2 years in a leadership or senior engineering capacity.Demonstrate expert-level coding skills in Python and Bash, with working knowledge of Go or Java.Use Docker for containerization and Kubernetes for orchestration across major cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.Work with CI / CD tools and infrastructure-as-code technologies like Terraform, Ansible, and Helm.Possess in-depth knowledge of ML lifecycle management, performance monitoring, and pipeline orchestration.Maintain large-scale observability and telemetry platforms effectively.Work with streaming data technologies including Apache Kafka, Apache Spark, and Apache Flink.Manage service mesh architectures such as Istio or Linkerd to ensure secure and efficient service communication.Understand data privacy and regulatory standards including the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA).