Data Engineer - Multi-source ETL & GenAI Pipelines (3+ Years)
Roles and Responsibilities :
- Build and maintain scalable, fault-tolerant data pipelines to support GenAI and analytics workloads across OCR, documents, and case data.
- Manage ingestion and transformation of semi-structured legal documents (PDF, Word, Excel) into structured formats.
- Enable RAG workflows by processing data into chunked, vectorized formats with metadata.
- Handle large-scale ingestion from multiple sources into cloud-native data lakes (S3, GCS), data warehouses (BigQuery, Snowflake), and PostgreSQL.
- Automate pipelines using orchestration tools like Airflow / Prefect, including retry logic, alerting, and metadata tracking.
- Collaborate with ML Engineers to ensure data availability, traceability, and performance for inference and training pipelines.
- Implement data validation and testing frameworks using Great Expectations or dbt.
- Integrate OCR pipelines and post-processing outputs for embedding and document search.
- Design infrastructure for streaming vs batch data needs and optimize for cost, latency, and reliability.
Qualifications :
Bachelors or Masters degree in Computer Science, Data Engineering, or equivalent.3+ years of experience in building distributed data pipelines and managing multi-source ingestion.Proficiency with Python, SQL, and data tools like Pandas, PySpark.Experience working with data orchestration tools (Airflow, Prefect), and file formats like Parquet, Avro, JSON.Hands-on experience with cloud storage / data warehouse systems (S3, GCS, BigQuery, Redshift).Understanding of GenAI and vector database ingestion pipelines is a strong plus.Bonus : Experience with OCR tools (Tesseract, Google Document AI), PDF parsing libraries (PyMuPDF), and API-based document processors.(ref : hirist.tech)