Roles & Responsibilities :
- Develop distributed data pipelines using PySpark on Databricks for ingesting, transforming, and publishing master data
- Write optimized SQL for large-scale data processing, including complex joins, window functions, and CTEs for MDM logic
- Implement match / merge algorithms and survivorship rules using Informatica MDM or Reltio APIs
- Build and maintain Delta Lake tables with schema evolution and versioning for master data domains
- Use AWS services like S3, Glue, Lambda, and Step Functions for orchestrating MDM workflows
- Automate data quality checks using IDQ or custom PySpark validators with rule-based profiling
- Integrate external enrichment sources (e. g. , D&B, LexisNexis) via REST APIs and batch pipelines
- Design and deploy CI / CD pipelines using GitHub Actions or Jenkins for Databricks notebooks and jobs
- Monitor pipeline health using Databricks Jobs API, CloudWatch, and custom logging frameworks
- Implement fine-grained access control using Unity Catalog and attribute-based policies for MDM datasets
- Use MLflow for tracking model-based entity resolution experiments if ML-based matching is applied
- Collaborate with data stewards to expose curated MDM views via REST endpoints or Delta Sharing
Basic Qualifications and Experience :
8 to 13 years of experience in Business, Engineering, IT or related fieldFunctional Skills : Must-Have Skills :
Advanced proficiency in PySpark for distributed data processing and transformationStrong SQL skills for complex data modeling, cleansing, and aggregation logicHands-on experience with Databricks including Delta Lake, notebooks, and job orchestrationDeep understanding of MDM concepts including match / merge, survivorship, and golden record creationExperience with MDM platforms like Informatica MDM or Reltio, including REST API integrationProficiency in AWS services such as S3, Glue, Lambda, Step Functions, and IAMFamiliarity with data quality frameworks and tools like Informatica IDQ or custom rule enginesExperience building CI / CD pipelines for data workflows using GitHub Actions, Jenkins, or similarKnowledge of schema evolution, versioning, and metadata management in data lakesAbility to implement lineage and observability using Unity Catalog or third-party toolsComfort with Unix shell scripting or Python for orchestration and automationHands on experience on RESTful APIs for ingesting external data sources and enrichment feedsGood-to-Have Skills :
Experience with Tableau or PowerBI for reporting MDM insights.Exposure to Agile practices and tools (JIRA, Confluence).Prior experience in Pharma / Life Sciences.Understanding of compliance and regulatory considerations in master data.Professional Certifications :
Any MDM certification (e. g. Informatica, Reltio etc)Any Data Analysis certification (SQL, Python, PySpark, Databricks)Any cloud certification (AWS or AZURE)Soft Skills :
Strong analytical abilities to assess and improve master data processes and solutions.Excellent verbal and written communication skills, with the ability to convey complex data concepts clearly to technical and non-technical stakeholders.Effective problem-solving skills to address data-related issues and implement scalable solutions.Ability to work effectively with global, virtual teamsSkills Required
Sql, Python, Pyspark, Databricks, Informatica