Key Responsibilities :
- Design, develop, and optimize big data pipelines and ETL workflows using PySpark , Hadoop (HDFS, MapReduce, Hive, HBase) .
- Develop and maintain data ingestion, transformation, and integration processes on Google Cloud Platform services such as BigQuery , Dataflow , Dataproc , and Cloud Storage .
- Ensure data quality, security, and governance across all pipelines.
- Monitor and troubleshoot performance issues in data pipelines and storage systems.
- Collaborate with data scientists and analysts to understand data needs and deliver clean, processed datasets.
- Implement batch and real-time data processing solutions.
- Write efficient, reusable, and maintainable code in Python and PySpark.
- Automate deployment and orchestration using tools like Airflow , Cloud Composer , or similar.
- Stay current with emerging big data technologies and recommend improvements.
Qualifications and Requirements :
Bachelor's or Master's degree in Computer Science, Engineering, or related field.3+ years of experience in big data engineering or related roles.Strong hands-on experience with Google Cloud Platform (GCP) services for big data processing.Proficiency in Hadoop ecosystem tools : HDFS, MapReduce, Hive, HBase, etc.Expert-level knowledge of PySpark for data processing and analytics.Experience with data warehousing concepts and tools such as BigQuery .Good understanding of ETL processes, data modeling, and pipeline orchestration.Programming proficiency in Python and scripting.Familiarity with containerization (Docker) and CI / CD pipelines.Strong analytical and problem-solving skills.Desirable Skills :
Experience with streaming data platforms like Kafka or Pub / Sub .Knowledge of data governance and compliance standards (GDPR, HIPAA).Familiarity with ML workflows and integration with big data platforms.Experience with Terraform or other infrastructure-as-code tools.Certification in GCP Data Engineer or equivalent.Skills Required
Gdpr, Hipaa, Pyspark, Python, Hadoop