We are seeking a highly skilled and motivated Data Engineer to join our team. The ideal candidate will be responsible for designing, developing, and optimizing large-scale data pipelines and data warehouse solutions, utilizing a modern, cloud-native data stack. You'll play a crucial role in transforming raw data into actionable insights, ensuring data quality, and maintaining the infrastructure required for seamless data flow.
Key Responsibilities
- Develop, construct, test, and maintain robust and scalable large-scale ETL pipelines using PySpark for processing and Apache Airflow for workflow orchestration.
- Design and implement both Batch ETL and Streaming ETL processes to handle various data ingestion requirements.
- Build and optimize data structures and schemas in cloud data warehouses like AWS Redshift .
- Work extensively with AWS data services, including AWS EMR for big data processing, AWS Glue for serverless ETL, and Amazon S3 for data storage.
- Implement and manage real-time data ingestion pipelines using technologies like Kafka and Debezium for Change Data Capture (CDC).
- Interact with and integrate data from various relational and NoSQL databases such as MySQL , PgSQL (PostgreSQL) , and MongoDB .
- Monitor, troubleshoot, and optimize data pipeline performance and reliability.
- Collaborate with data scientists, analysts, and other engineering teams to understand data needs and deliver high-quality, reliable data solutions.
- Ensure data governance, security, and quality across all data platforms.
Required Skills & Qualifications
Technical Skills
Expert proficiency in developing ETL / ELT solutions using PySpark .Strong experience in workflow management and scheduling tools, specifically Apache Airflow .In-depth knowledge of AWS data services including :AWS EMR (Elastic MapReduce)AWS GlueAWS RedshiftAmazon S3Proven experience implementing and managing data streams using Kafka .Familiarity with Change Data Capture (CDC) tools like Debezium .Hands-on experience with diverse database technologies : MySQL , PgSQL , and MongoDB .Solid understanding of data warehousing concepts, dimensional modeling, and best practices for both batch and real-time data processing.Proficiency in a scripting language, preferably Python .General Qualifications
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.Excellent problem-solving, analytical, and communication skills.Ability to work independently and collaboratively in a fast-paced, dynamic environment.Nice to Have (Preferred Skills)
Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).Knowledge of containerization technologies (Docker, Kubernetes).Familiarity with CI / CD pipelines.