Description :
We are building a next-generation Customer Data Platform (CDP) powered by the Databricks Lakehouse architecture and Lakehouse Engine framework.
We're looking for a skilled Data Engineer with 4-9 years of experience to help us build metadata-driven pipelines, enable real-time data processing, and support marketing campaign orchestration capabilities at scale.
The core responsibilities for the job include the following :
Lakehouse Engine Implementation :
- Configure and extend the Lakehouse Engine framework for batch and streaming pipelines.
- Implement the medallion architecture (Bronze -> Silver -> Gold) using Delta Lake.
- Develop metadata-driven ingestion patterns from various customer data sources.
- Build reusable transformers for PII handling, data standardization, and data quality enforcement.
Real-Time CDP Enablement :
Build Spark Structured Streaming pipelines for customer behavior and event tracking.Set up Debezium + Kafka for Change Data Capture (CDC) from CRM systems.Design and develop identity resolution logic across both streaming and batch datasets.DataOps and Governance :
Use Unity Catalog for managing RBAC, data lineage, and auditability.Integrate Great Expectations or similar tools for continuous data quality monitoring.Set up CI / CD pipelines for deploying Databricks notebooks, jobs, and DLT pipelines.Requirements :
4-9 years of hands-on experience in data engineering.Expertise in Databricks Lakehouse platform, Delta Lake, and Unity Catalog.Advanced PySpark skills, including Structured Streaming.Experience implementing Kafka + Debezium CDC pipelines.Strong in SQL transformations, data modeling, and analytical querying.Familiarity with metadata-driven architecture and parameterized pipelines.Understanding of data governance : PII masking, access controls, and lineage tracking.Proficiency in working with AWS, MongoDB, and PostgreSQL.Nice to Have :
Experience working on Customer 360 or Martech CDP platforms.Familiarity with Martech tools like Segment, Braze, or other CDPs.Exposure to ML pipelines for segmentation, scoring, or personalization.Knowledge of CI / CD for data workflows using GitHub Actions, Terraform, or Databricks CLI.(ref : hirist.tech)