Job Description
Job Summary
As a Data SRE Lead, you will architect and oversee end-to-end reliability strategies for data infrastructure. You’ll work with cloud, data engineering, and analytics teams to ensure that critical data services are secure, automated, observable, and resilient.
Responsibilities
- Lead design and automation of resilient data platforms across multi-cloud environments.
- Drive data reliability frameworks (SLO / SLI metrics, incident prevention, SLA tracking).
- Oversee monitoring, logging, and cost optimization for big-data environments.
- Manage data platform scalability, versioning, and fault tolerance.
- Build recovery and rollback mechanisms for critical data pipelines.
- Guide teams in observability, data validation automation, and data governance integration.
- Mentor junior engineers and define SRE standards for data reliability.
Requirements
Skills & Tools
5–8 years of experience in Data Ops / Data Engineering / SRE .Expertise in Databricks , Kafka , Airflow , Snowflake , or Spark .Strong understanding of AWS / Azure data ecosystem (S3, Glue, Redshift, ADF, Synapse).Automation using Terraform , Ansible , and Python scripting.Familiar with monitoring and observability tools (CloudWatch, Grafana, Data-dog).Experience implementing SLA / SLO frameworks for data reliability.Eligibility
Bachelor’s or Master’s in Computer Science, Data Science, or Information Systems.Proven record of leading data reliability or platform stability projects.Strong communication and leadership skills across technical and business teams.Requirements
Skills & Tools 5–8 years of experience in Data Ops / Data Engineering / SRE. Expertise in Databricks, Kafka, Airflow, Snowflake, or Spark. Strong understanding of AWS / Azure data ecosystem (S3, Glue, Redshift, ADF, Synapse). Automation using Terraform, Ansible, and Python scripting. Familiar with monitoring and observability tools (CloudWatch, Grafana, Data-dog). Experience implementing SLA / SLO frameworks for data reliability. Eligibility Bachelor’s or Master’s in Computer Science, Data Science, or Information Systems. Proven record of leading data reliability or platform stability projects. Strong communication and leadership skills across technical and business teams.