We are a team building the future of healthcare analytics. Our goal is to design, build and scale robust data pipelines that power nationwide analytics and support our machine learning systems.
This role can be fully remote with candidates based in Europe or India, with periodic team gatherings in Mountain View, California.
- Design, build and maintain scalable ETL pipelines using Python (Pandas, PySpark) and SQL, orchestrated with Airflow (MWAA).
- Develop and maintain the SAIVA Data Lake / Lakehouse on AWS, ensuring quality, governance, scalability and accessibility.
- Run and optimize distributed data processing jobs with Spark on AWS EMR and / or EKS.
- Implement batch and streaming ingestion frameworks (APIs, databases, files, event streams).
- Enforce validation and quality checks to ensure reliable analytics and ML readiness.
- Monitor and troubleshoot pipelines with CloudWatch, integrating observability tools like Grafana, Prometheus or Datadog.
- Automate infrastructure provisioning with Terraform, following AWS best practices.
- Manage SQL Server, PostgreSQL and Snowflake integrations into the Lakehouse.
- Participate in an on-call rotation to support pipeline health and resolve incidents quickly.
What We're Looking For :
5+ years in data engineering, ETL pipeline development or data platform roles (flexible for exceptional candidates).Experience designing and operating data lakes or Lakehouse architectures on AWS (S3, Glue, Lake Formation, Delta Lake, Iceberg).Strong SQL skills with PostgreSQL, SQL Server and at least one AWS cloud warehouse (Snowflake or Redshift).Proficiency in Python (Pandas, PySpark); Scala or Java a plus.Hands-on with Spark on AWS EMR and / or EKS for distributed processing.Strong background in Airflow (MWAA) for workflow orchestration.Expertise with AWS services : S3, Glue, Lambda, Athena, Step Functions, ECS, CloudWatch.Proficiency with Terraform for IaC; familiarity with Docker, ECS and CI / CD pipelines.Experience building monitoring, validation and alerting into pipelines with CloudWatch, Grafana, Prometheus or Datadog.Strong communication skills and ability to collaborate with data scientists, analysts and product teams.