Key Responsibilities :
PySpark Development :
- Design, implement, and optimize PySpark solutions for large-scale data processing and analysis.
- Develop data pipelines using Spark to handle data transformations, aggregations, and other complex operations efficiently.
- Write and optimize Spark SQL queries for big data analytics and reporting.
- Handle data extraction, transformation, and loading (ETL) processes from various sources into a unified data warehouse or data lake.
Data Pipeline Design & Optimization :
Build and maintain ETL pipelines using PySpark , ensuring high scalability and performance.Implement batch and streaming processing to handle both real-time and historical data.Optimize the performance of PySpark applications by applying best practices and techniques such as partitioning , caching , and broadcast joins .Data Storage & Management :
Work with large datasets and integrate them into storage solutions such as HDFS , S3 , Azure Blob Storage , or Google Cloud Storage .Ensure efficient data storage, access, and retrieval through Spark and other tools (e.g., Parquet , ORC ).Maintain data quality, consistency, and integrity throughout the pipeline lifecycle.Cloud Platforms & Big Data Frameworks :
Deploy Spark-based applications on cloud platforms such as AWS (Amazon EMR) , Azure HDInsight , or Google Dataproc .Work with cloud-native services such as AWS Lambda , S3 , Google Cloud Storage , and Azure Data Lake to handle and process big data.Leverage cloud data processing tools and frameworks to scale and optimize the PySpark jobs.Collaboration & Integration :
Collaborate with cross-functional teams (data scientists, analysts, product managers) to understand business requirements and develop appropriate data solutions.Integrate data from multiple sources and platforms (e.g., databases, external APIs, flat files) into a unified system.Provide support for downstream applications and data consumers by ensuring timely and accurate delivery of data.Performance Tuning & Troubleshooting :
Identify bottlenecks and optimize Spark jobs to improve performance.Conduct performance tuning of both the cluster and individual Spark jobs , leveraging Spark's in-built tools for monitoring.Troubleshoot and resolve issues related to data processing, application failures, and cluster resource utilization.Documentation & Reporting :
Maintain clear and comprehensive documentation of data pipelines, architectures, and processes.Create technical documentation to guide future enhancements and troubleshooting.Provide regular updates on the status of ongoing projects and data processing tasks.Continuous Improvement :
Stay up to date with the latest trends, technologies, and best practices in big data processing and PySpark.Contribute to improving development processes, testing strategies, and code quality.Share knowledge and provide mentoring to junior team members on PySpark best practices.Required Qualifications :
2-4 years of professional experience working with PySpark and big data technologies.Strong expertise in Python programming with a focus on data processing and manipulation.Hands-on experience with Apache Spark , particularly with PySpark for distributed computing.Proficiency in Spark SQL for data querying and transformation.Familiarity with cloud platforms like AWS , Azure , or Google Cloud , and experience with cloud-native big data tools .Knowledge of ETL processes and tools.Experience with data storage technologies like HDFS , S3 , or Google Cloud Storage .Knowledge of data formats such as Parquet , ORC , Avro , or JSON .Experience with distributed computing and cluster management .Familiarity with Linux / Unix and command-line operations.Strong problem-solving skills and ability to troubleshoot data processing issues.Skills Required
Spark SQL, Pyspark, Python, Aws, Azure