Role Summary :
Design, build, and optimize large-scale ETL and data-processing pipelines handling GB–TB volumes. Operate within the Databricks ecosystem and drive migration of selected workloads to high-performance engines such as Polars and DuckDB. Maintain strong engineering rigor across CI / CD, testing, and code-quality enforcement. Apply analytical thinking to solve data reliability, performance, and scalability problems. AI familiarity is advantageous.
Core Responsibilities :
Develop and maintain distributed data pipelines using Scala, Spark, Delta, and Databricks.
Engineer robust ETL workflows tuned for high-volume ingestion, transformation, and publishing.
Profile pipelines, remove bottlenecks, and optimize compute, storage, and job orchestration.
Lead migration of suitable workloads to Polars, DuckDB, or equivalent high-performance engines.
Implement CI / CD workflows with automated builds, tests, deployments, and environment gating.
Enforce coding standards through code coverage targets, unit / integration tests, and SonarQube rules.
Ensure pipeline observability : logging, data quality checks, lineage, and failure diagnostics.
Apply analytical reasoning to triage complex data issues and deliver root-cause clarity.
Contribute to AI-aligned initiatives when required : RAG design, fine-tuning workflows, agentic patterns.
Collaborate with product, analytics, and platform teams to operationalize data solutions
Required Skills and Experience :
3+ years in data engineering with strong command of Scala and Spark.
Proven background in ETL design, distributed processing, and high-volume data systems.
Hands-on experience with Databricks (jobs, clusters, notebooks, Delta Lake).
Proficiency in workflow optimization, performance tuning, and memory management.
Experience with Polars, DuckDB, or similar columnar / accelerated engines.
CI / CD discipline using Git-based pipelines; strong testing and code-quality practices.
Familiarity with SonarQube, coverage metrics, and static analysis.
Strong analytical and debugging capability across data, pipelines, and infra.
Exposure to AI concepts : embeddings, vector stores, retrieval-augmented generation, fine-tuning, agentic architectures.
Preferred :
Experience with Azure cloud environments .
Experience in metadata-driven or config-driven pipeline frameworks.
Engineer Spark Scala • Delhi, India