Roles and Responsibilities :
- Design, build, and optimize scalable ETL pipelines using Apache Airflow or similar frameworks to process and transform large datasets efficiently.
- Utilize Spark (PySpark), Kafka, Flink, or similar tools to enable distributed data processing and real-time streaming solutions.
- Deploy, manage, and optimize data infrastructure on cloud platforms such as AWS, GCP, or Azure, ensuring security, scalability, and cost-effectiveness.
- Design and implement robust data models, ensuring data consistency, integrity, and performance across warehouses and lakes.
- Enhance query performance through indexing, partitioning, and tuning techniques for large-scale datasets.
- Manage cloud-based storage solutions (Amazon S3, Google Cloud Storage, Azure Blob Storage) and ensure data governance, security, and compliance.
- Work closely with data scientists, analysts, and software engineers to support data-driven decision-making, while maintaining thorough documentation of data processes.
Ideal candidate
- Strong proficiency in Python and SQL, with additional experience in languages such as Java or Scala.
- Hands-on experience with frameworks like Spark (PySpark), Kafka, Apache Hudi, Iceberg, Apache Flink, or similar tools for distributed data processing and real-time streaming.
- Familiarity with cloud platforms like AWS, Google Cloud Platform (GCP), or Microsoft Azure for building and managing data infrastructure.
- Strong understanding of data warehousing concepts and data modeling principles.
- Experience with ETL tools such as Apache Airflow or comparable data transformation frameworks.
- Proficiency in working with data lakes and cloud based storage solutions like Amazon S3, Google Cloud Storage, or Azure Blob Storage.
- Expertise in Git for version control and collaborative coding.
- Expertise in performance tuning for large-scale data processing, including partitioning, indexing, and query optimization.
Skills Required
Python, Sql, Pyspark, Kafka, Aws, Google Cloud Platform, Apache Airflow