Job Title
We're building the future of healthcare analytics to design, build, and scale robust data pipelines that power nationwide analytics and support our machine learning systems. Our goal is to create pipelines that are reliable, observable, and continuously improving in production.
This is a fully remote role open to candidates based in Europe or India with periodic team gatherings in Mountain View, California.
- Design, build, and maintain scalable ETL pipelines using Python, Pandas, PySpark, and SQL orchestrated with Airflow and MWAA.
- Develop and maintain the SAIVA Data Lake / Lakehouse on AWS ensuring quality, governance, scalability, and accessibility.
- Run and optimize distributed data processing jobs with Spark on AWS EMR and / or EKS.
- Implement batch and streaming ingestion frameworks, APIs, databases, files, event streams.
- Enforce validation and quality checks to ensure reliable analytics and ML readiness.
- Monitor and troubleshoot pipelines with CloudWatch integrating observability tools like Grafana, Prometheus, or Datadog.
- Automate infrastructure provisioning with Terraform following AWS best practices.
- Manage SQL Server, PostgreSQL, and Snowflake integrations into the Lakehouse.
- Participate in an on-call rotation to support pipeline health and resolve incidents quickly.
Required Skills and Qualifications
5+ years in data engineering, ETL pipeline development, or data platform roles; flexible for exceptional candidates.Experience designing and operating data lakes or Lakehouse architectures on AWS, S3, Glue, Lake Formation, Delta Lake, Iceberg.Strong SQL skills with PostgreSQL, SQL Server, and at least one AWS cloud warehouse, Snowflake, or Redshift.Proficiency in Python, Pandas, PySpark, Scala, or Java; a plus.Hands-on with Spark on AWS EMR and / or EKS for distributed processing.Strong background in Airflow, MWAA for workflow orchestration.Expertise with AWS services, S3, Glue, Lambda, Athena, Step Functions, ECS, CloudWatch.Proficiency with Terraform for IaC; familiarity with Docker, ECS, and CI / CD pipelines.Experience building monitoring, validation, and alerting into pipelines with CloudWatch, Grafana, Prometheus, or Datadog.Strong communication skills and ability to collaborate with data scientists, analysts, and product teams.A track record of delivering production-ready, scalable AWS pipelines, not just prototypes.