Description GSPANN is hiring an AI Operations Engineer. The role focuses on deploying ML models, automating CI / CD pipelines, and implementing AIOps solutions.
Role and Responsibilities
- Build, automate, and manage continuous integration and continuous deployment (CI / CD) pipelines for machine learning (ML) models.
- Partner with data scientists to transition ML models from experimentation to production environments.
- Use tools such as Docker, Kubernetes, MLflow, or Kubeflow to deploy, monitor, and maintain scalable ML systems.
- Implement systems for model versioning, model drift detection, and performance tracking.
- Maintain reproducibility and traceability of ML experiments and outputs.
- Design and implement AIOps frameworks to enable predictive monitoring, anomaly detection, and intelligent alerting.
- Leverage observability tools, including Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana), Dynatrace, and Datadog, to collect and analyze infrastructure and application metrics.
- Utilize machine learning and statistical methods to identify patterns, automate root cause analysis, and forecast system behavior.
- Write and maintain automation scripts that support remediation and incident response.
Skills and Experience
Bachelor's or Master's degree in Computer Science, Data Science, Information Technology, or a related discipline.Certification in DevOps or MLOps from AWS or GCP is preferred.Understand Site Reliability Engineering (SRE) practices and metrics such as Service-Level Agreements (SLAs), Service-Level Indicators (SLIs), and Service-Level Objectives (SLOs).Demonstrate strong programming proficiency in Python.Work with ML lifecycle platforms such as MLflow, Kubeflow, TensorFlow Extended (TFX), and Data Version Control (DVC).Use Docker and Kubernetes for containerization and orchestration.Employ CI / CD tools including GitHub Actions, Jenkins, and GitLab CI / CD.Operate monitoring and logging systems like Prometheus, Grafana, ELK, Datadog, and Splunk.Possess hands-on experience with cloud platforms such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP).Apply DevOps principles effectively and use infrastructure-as-code tools like Terraform and Ansible.Handle projects involving Natural Language Processing (NLP), time-series forecasting, or anomaly detection models.Build and manage large-scale distributed computing environments.