Talent.com
PySpark

PySpark

ConfidentialHyderabad / Secunderabad, Telangana
30+ days ago
Job description

Key Responsibilities :

PySpark Development :

  • Design, implement, and optimize PySpark solutions for large-scale data processing and analysis.
  • Develop data pipelines using Spark to handle data transformations, aggregations, and other complex operations efficiently.
  • Write and optimize Spark SQL queries for big data analytics and reporting.
  • Handle data extraction, transformation, and loading (ETL) processes from various sources into a unified data warehouse or data lake.

Data Pipeline Design & Optimization :

  • Build and maintain ETL pipelines using PySpark , ensuring high scalability and performance.
  • Implement batch and streaming processing to handle both real-time and historical data.
  • Optimize the performance of PySpark applications by applying best practices and techniques such as partitioning , caching , and broadcast joins .
  • Data Storage & Management :

  • Work with large datasets and integrate them into storage solutions such as HDFS , S3 , Azure Blob Storage , or Google Cloud Storage .
  • Ensure efficient data storage, access, and retrieval through Spark and other tools (e.g., Parquet , ORC ).
  • Maintain data quality, consistency, and integrity throughout the pipeline lifecycle.
  • Cloud Platforms & Big Data Frameworks :

  • Deploy Spark-based applications on cloud platforms such as AWS (Amazon EMR) , Azure HDInsight , or Google Dataproc .
  • Work with cloud-native services such as AWS Lambda , S3 , Google Cloud Storage , and Azure Data Lake to handle and process big data.
  • Leverage cloud data processing tools and frameworks to scale and optimize the PySpark jobs.
  • Collaboration & Integration :

  • Collaborate with cross-functional teams (data scientists, analysts, product managers) to understand business requirements and develop appropriate data solutions.
  • Integrate data from multiple sources and platforms (e.g., databases, external APIs, flat files) into a unified system.
  • Provide support for downstream applications and data consumers by ensuring timely and accurate delivery of data.
  • Performance Tuning & Troubleshooting :

  • Identify bottlenecks and optimize Spark jobs to improve performance.
  • Conduct performance tuning of both the cluster and individual Spark jobs , leveraging Spark's in-built tools for monitoring.
  • Troubleshoot and resolve issues related to data processing, application failures, and cluster resource utilization.
  • Documentation & Reporting :

  • Maintain clear and comprehensive documentation of data pipelines, architectures, and processes.
  • Create technical documentation to guide future enhancements and troubleshooting.
  • Provide regular updates on the status of ongoing projects and data processing tasks.
  • Continuous Improvement :

  • Stay up to date with the latest trends, technologies, and best practices in big data processing and PySpark.
  • Contribute to improving development processes, testing strategies, and code quality.
  • Share knowledge and provide mentoring to junior team members on PySpark best practices.
  • Required Qualifications :

  • 2-4 years of professional experience working with PySpark and big data technologies.
  • Strong expertise in Python programming with a focus on data processing and manipulation.
  • Hands-on experience with Apache Spark , particularly with PySpark for distributed computing.
  • Proficiency in Spark SQL for data querying and transformation.
  • Familiarity with cloud platforms like AWS , Azure , or Google Cloud , and experience with cloud-native big data tools .
  • Knowledge of ETL processes and tools.
  • Experience with data storage technologies like HDFS , S3 , or Google Cloud Storage .
  • Knowledge of data formats such as Parquet , ORC , Avro , or JSON .
  • Experience with distributed computing and cluster management .
  • Familiarity with Linux / Unix and command-line operations.
  • Strong problem-solving skills and ability to troubleshoot data processing issues.
  • Skills Required

    Spark SQL, Pyspark, Python, Aws, Azure

    Create a job alert for this search

    Pyspark • Hyderabad / Secunderabad, Telangana