Data Engineering Expert for Healthcare Analytics
We are revolutionizing healthcare analytics by designing robust data pipelines that power nationwide insights and support our machine learning systems.
This is a fully remote role, open to candidates in Europe or India with periodic team gatherings in Mountain View, California.
- Design, build, and maintain scalable ETL pipelines using Python (Pandas, PySpark) and SQL, orchestrated with Airflow (MWAA).
- Develop and maintain the SAIVA Data Lake / Lakehouse on AWS, ensuring quality, governance, scalability, and accessibility.
- Run and optimize distributed data processing jobs with Spark on AWS EMR and / or EKS.
- Implement batch and streaming ingestion frameworks (APIs, databases, files, event streams).
- Enforce validation and quality checks to ensure reliable analytics and ML readiness.
- Monitor and troubleshoot pipelines with CloudWatch, integrating observability tools like Grafana, Prometheus, or Datadog.
- Automate infrastructure provisioning with Terraform, following AWS best practices.
- Manage SQL Server, PostgreSQL, and Snowflake integrations into the Lakehouse.
- Participate in an on-call rotation to support pipeline health and resolve incidents quickly.
- Write production-grade code, and contribute to design / code reviews and engineering best practices.
- 5+ years of experience in data engineering, ETL pipeline development, or data platform roles.
- Experience designing and operating data lakes or Lakehouse architectures on AWS.
- Strong SQL skills with PostgreSQL, SQL Server, and at least one AWS cloud warehouse.
- Proficiency in Python (Pandas, PySpark); Scala or Java a plus.
- Hands-on with Spark on AWS EMR and / or EKS for distributed processing.
- Strong background in Airflow (MWAA) for workflow orchestration.
- Expertise with AWS services : S3, Glue, Lambda, Athena, Step Functions, ECS, CloudWatch.
- Proficiency with Terraform for IaC; familiarity with Docker, ECS, and CI / CD pipelines.
- Experience building monitoring, validation, and alerting into pipelines with CloudWatch, Grafana, Prometheus, or Datadog.
- Strong communication skills and ability to collaborate with data scientists, analysts, and product teams.
- A track record of delivering production-ready, scalable AWS pipelines, not just prototypes.