Job Summary :
We are seeking a highly skilled Senior Data Engineer with 4 to 8 years of experience in building robust data pipelines and working extensively with PySpark to join our data engineering team.
Key Responsibilities :
Data Pipeline Development :
- Design, build, and maintain scalable data pipelines using PySpark to process large datasets and support data-driven applications and analytics.
ETL Process Automation :
Develop and automate ETL (Extract, Transform, Load) processes using PySpark, ensuring efficient data processing, transformation, and loading from diverse sources into data lakes, warehouses, or databases.Distributed Computing with PySpark :
Leverage Apache Spark and PySpark to process large-scale data in a distributed computing environment, optimizing for performance and scalability.Cloud Data Solutions :
Develop and deploy data pipelines and processing frameworks on cloud platforms (AWS, Azure, GCP) using native tools like AWS Glue, Azure Databricks, or Google Dataproc.Data Integration & Transformation :
Integrate data from various internal and external sources, ensuring data consistency, quality, and reliability throughout the pipeline.Performance Optimization :
Optimize PySpark jobs and pipelines for faster data processing, handling large volumes of data efficiently with minimal latency.Proven experience as a Data Engineer or similar role, with a strong background in database development, ETL processes, and software development.Proficiency in SQL and scripting languages such as Python, with experience working with relational databases.Proficiency in dataProc (PySpark), Pandas or other data processing librariesExperience with data modeling, schema design, and optimization techniques for scalability.Strong analytical and problem-solving skills, with the ability to troubleshoot complex data issues and optimize data processing pipelines for scaleRequired Qualifications :
4-8 years of experience in data engineering, with a strong focus on PySpark and large-scale data processing.Technical Skills :
Expertise in PySpark for distributed data processing, data transformation, and job optimization.Strong proficiency in Python and SQL for data manipulation and pipeline creation.Hands-on experience with Apache Spark and its ecosystem, including Spark SQL, Spark Streaming, and PySpark MLlib.Solid experience working with ETL tools and frameworks, such as Apache Airflow or similar orchestration tools.(ref : hirist.tech)