JOB DESCRIPTION : Data Engineer
We are seeking a highly skilled Data Engineer with deep expertise in Apache Kafka integration with Databricks, structured streaming, and large-scale data pipeline design using the Medallion Architecture. The ideal candidate will demonstrate strong hands-on experience in building and optimizing real-time and batch pipelines, and will be expected to solve real coding problems during the interview.
Job Description :
- Design, develop, and maintain real-time and batch data pipelines in Databricks .
- Integrate Apache Kafka with Databricks using Structured Streaming.
- Implement robust data ingestion frameworks using Databricks Autoloader.
- Build and maintain Medallion Architecture pipelines across Bronze, Silver, and Gold layers.
- Implement checkpointing, output modes, and appropriate processing modes in structured streaming jobs.
- Design and implement Change Data Capture (CDC) workflows and Slowly Changing Dimensions (SCD) Type 1 and Type 2 logic.
- Develop reusable components for merge / upsert operations and window function based transformations.
- Handle large volumes of data efficiently through proper partitioning, caching, and cluster tuning techniques.
- Collaborate with cross-functional teams to ensure data availability, reliability, and consistency.
Must Have :
Apache Kafka : Integration, topic management, schema registry (Avro / JSON).Databricks & Spark Structured Streaming :1. Processing Modes : Append, Update, Complete
2. Output Modes : Memory, Console, File, Kafka, Delta
3. Checkpointing and fault tolerance
Databricks Autoloader : Schema inference, schema evolution, incremental loads.Medallion Architecture implementation expertise.Performance Optimization :i. Data partitioning strategies
ii. Caching and persistence
iii. Adaptive query execution and cluster configuration tuning
SQL & Spark SQL : Proficiency in writing efficient queries and transformations.Data Governance : Schema enforcement, data quality checks, and monitoring.Good to Have :
Strong coding skills in Python and PySpark.Experience working in CI / CD environments for data pipelines.Exposure to cloud platforms (AWS / Azure / GCP).Understanding of Delta Lake, time travel, and data versioning.Familiarity with orchestration tools like Airflow or Azure Data Factory.(ref : hirist.tech)