Overview :
We are looking for a highly skilled Python Data Engineer to join our team in an on-premise data engineering environment. The ideal candidate will have experience in ETL tools , data processing technologies , data orchestration , and relational databases . Additionally, you should be proficient in Python scripting for data engineering tasks and have experience working with Spark , PySpark , and other relevant data technologies. While cloud tools are a good-to-have , this position primarily focuses on on-premise data infrastructure .
This is an excellent opportunity to work on exciting projects that require developing scalable data pipelines, real-time data streaming, and optimizing data processing tasks using Python.
Key Responsibilities :
ETL Development & Optimization : Design, develop, and optimize ETL pipelines using open-source or cloud ETL tools (e.g., Apache Nifi , Talend , Pentaho , Airflow , AWS Glue).
Python Scripting for Data Engineering : Write Python scripts to automate data extraction, transformation, and loading (ETL) processes. Ensure that the code is optimized for performance and scalability.
Big Data Processing : Work with Apache Spark and PySpark to process large datasets in a distributed computing environment. Optimize Spark jobs for performance and resource efficiency.
Job Orchestration : Use Apache Airflow or other orchestration tools to schedule, monitor, and automate data pipeline workflows.
Data Streaming : Design and implement real-time data streaming solutions using technologies like Apache Kafka or AWS Kinesis for high-throughput, low-latency data processing.
File Formats & Table Formats : Work with open-source table formats like Apache Parquet , Apache Avro , or Delta Lake , and other structured / unstructured data formats for efficient data storage and access.
Database Management : Work with relational databases (e.g., PostgreSQL , MySQL , SQL Server ) for data storage, management, and optimization. Understand database concepts such as normalization, indexing, and query optimization.
SQL Expertise : Write and optimize complex SQL queries for data extraction, transformations, and aggregation across large datasets. Ensure queries are efficient and scalable.
BI & Data Warehouse Knowledge : Exposure to BI tools and data warehousing concepts is a plus, ensuring the data is structured in a way that supports analytics and reporting.
Required Skills & Experience :
ETL Tools : Experience working with open-source ETL tools such as Apache Nifi , Talend , or Pentaho . Cloud-based tools like AWS Glue or Azure Data Factory are good to have.
Python Scripting : Proficiency in Python for automating data processing tasks, writing data pipelines, and working with libraries such as Pandas , Dask , PySpark , etc.
Big Data Technologies : Experience with Apache Spark and PySpark for distributed data processing, along with optimization techniques.
Data Orchestration : Experience using Apache Airflow or similar tools for scheduling and automating data pipelines.
Data Streaming : Experience with Apache Kafka or AWS Kinesis for building and managing real-time data pipelines.
Open-Source File Formats : Knowledge of Apache Parquet , Apache Avro , Delta Lake , or similar open-source table formats for efficient data storage and retrieval.
Relational Databases : Strong experience with at least one relational database (e.g., PostgreSQL , MySQL , SQL Server ) and a solid understanding of database concepts like indexing, normalization, and query optimization.
SQL Expertise : Strong skills in writing and optimizing complex SQL queries for data extraction, transformations, and aggregation.
Nice to Have :
BI / Analytics Tools : Familiarity with BI tools like Power BI , Tableau , Looker , or similar reporting and data visualization platforms.
Data Warehousing : Knowledge of data warehousing principles, schema design (e.g., star / snowflake), and optimization techniques for large datasets.
Cloud Technologies : Experience with cloud data platforms like Databricks , Snowflake , or Azure Synapse is beneficial, though the role is focused on on-prem environments.
Containerization : Familiarity with containerization tools like Docker or Kubernetes for deploying data engineering workloads.
Educational Qualifications :
Bachelor’s or Master’s degree in Computer Science , Engineering , Information Systems , or a related field (or equivalent work experience).
Additional Qualities :
Excellent problem-solving and troubleshooting skills.
Ability to work both independently and in a collaborative environment.
Strong communication skills , both written and verbal.
Detail-oriented with a focus on data quality and performance optimization .
Proactive attitude and the ability to take ownership of projects.
Data Engineer • Pune, Maharashtra, India