Talent.com
Senior AI Data Platform Reliability & Validation Engineer 3
Senior AI Data Platform Reliability & Validation Engineer 3Oracle • Bangalore (division)
Senior AI Data Platform Reliability & Validation Engineer 3

Senior AI Data Platform Reliability & Validation Engineer 3

Oracle • Bangalore (division)
16 days ago
Job description

Responsibilities

Key Responsibilities :

  • Design, develop, and execute end-to-end (E2E) scenario validations that simulate real-world usage of complex AI data platform workflows (data ingestion, transformation, ML pipeline orchestration, etc.).
  • Collaborate closely with product, engineering, and field teams to identify gaps in coverage and propose test automation strategies.
  • Develop and maintain automated test frameworks supporting E2E, integration, performance, and regression testing for distributed data / AI services
  • Monitor system health across the stack (infrastructure, data pipelines, AI / ML workloads), proactively detect failures or SLA breaches.
  • Champion SRE best practices including observability, incident management, blameless postmortems, and runbook automation.
  • Analyze logs, traces, and metrics to identify reliability, latency, and scalability issues; drive root cause analysis and corrective actions.
  • Partner with engineering to drive high-availability, fault tolerance, and continuous delivery (CI / CD) improvements.
  • Participate in on-call rotation to support critical services, ensuring rapid resolution and minimizing customer impact.

Desired Qualifications :

  • Bachelor’s or master’s degree in computer science, Engineering, or related field (or demonstrated equivalent experience)
  • 5+ years’ experience in software QA / validation, SRE, or DevOps roles, ideally in data platforms, cloud, or AI / ML environments.
  • Proficient with DevOps automation and tools for continuous integration, deployment, and monitoring (e.g., Terraform, Jenkins, GitLab CI / CD, Prometheus).
  • Working knowledge of distributed systems, data engineering pipelines, and cloud-native architectures (OCI, AWS, Azure, GCP, etc.).
  • Strong proficiency in Java, Python and related technologies
  • Hands-on experience with test automation frameworks (e.g., Selenium, pytest, JUnit) and scripting (Python, Bash, etc.).
  • Familiarity with SRE practices : service-level objectives (SLO / SLA), incident response, observability (Prometheus, Grafana, ELK, etc.).
  • Strong troubleshooting and analytical skills with a passion for reliability engineering and process automation.
  • Excellent communication and cross-team collaboration abilities.oling / infrastructure
  • Create a job alert for this search

    Senior Data Platform • Bangalore (division)