The core responsibilities for the job include the following :
Data Pipeline Development :
- Design, develop, and maintain data pipelines to ingest, process, and transform data from various sources into usable formats.
- Implement data integration solutions that connect disparate data systems, including databases, APIs, and third-party data sources.
Data Storage and Warehousing :
Create and manage data storage solutions, such as data lakes, data warehouses, and NoSQL databases.Optimize data storage for performance, scalability, and cost-efficiency.Data Quality and Governance :
Establish data quality standards and implement data validation and cleansing processes.Collaborate with data analysts and data scientists to ensure data consistency and accuracy.ETL (Extract, Transform, Load) :
Develop ETL processes to transform raw data into a structured and usable format.Monitor and troubleshoot ETL jobs to ensure data flows smoothly.Data Security and Compliance :
Implement data security measures and access controls to protect sensitive data.Ensure compliance with data privacy regulations and industry standards (e. g., GDPR, HIPAA).Performance Tuning :
Optimize data pipelines and queries for improved performance and efficiency.Identify and resolve bottlenecks in data processing.Data Documentation :
Maintain comprehensive documentation for data pipelines, schemas, and data dictionaries.Create and update data lineage and metadata documentation.Scalability and Reliability :
Design data infrastructure to scale with growing data volumes and business requirements.Implement data recovery and backup strategies to ensure data availability and resilience.Collaboration :
Collaborate with cross-functional teams, including data scientists, analysts, and business stakeholders, to understand data requirements and deliver data solutions.Required Technical Skills :
Programming Languages : Proficiency in Python is essential. Experience with SQL for database querying and manipulation is also required. Knowledge of Java or Scala is a plus.Big Data Technologies : Hands-on experience with big data frameworks such as Apache Spark, Hadoop, or Flink.Cloud Platforms : Strong experience with at least one major cloud provider's data services (AWS, GCP, or Azure). This includes services like AWS S3, Redshift, Glue; GCP BigQuery, Dataflow, Cloud Storage; or Azure Synapse, Data Factory, Blob Storage.Databases and Warehousing : Deep expertise in relational databases (PostgreSQL, MySQL) and NoSQL databases (MongoDB, Cassandra). Experience with modern data warehouses (Snowflake, Redshift, BigQuery) is highly preferred.ETL / ELT Tools : Experience with ETL / ELT tools like Apache Airflow for workflow orchestration, or other similar tools.Data Governance and Security : Understanding of data governance principles and experience with implementing security measures like encryption, access controls, and data masking.Version Control : Proficiency with Git for version control.Containerization : Experience with Docker and Kubernetes is a plus.(ref : hirist.tech)