Role Overview :
We are looking for an experienced Python Data Engineer with strong expertise in PySpark, distributed data engineering, and LLM integration.
The role involves building scalable data pipelines, AI-powered workflows, and enabling data-driven automation using platforms such as Databricks, AWS EMR, and LangChain.
As an SE3 engineer, you will primarily be responsible for hands-on development and delivery, while also contributing to solution design and collaborating with cross-functional teams.
Key Responsibilities :
- Develop and optimize ETL pipelines using PySpark, SparkSQL, and distributed frameworks.
- Work with LangChain to integrate LLM-based solutions into data workflows (Agents, Toolkits, Vector Stores).
- Implement data transformations, lineage, and governance controls in data platforms.
- Support ML teams in deploying embeddings, retrieval-augmented generation (RAG), and NLP pipelines.
- Build workflows on Databricks and AWS EMR, ensuring cost and performance efficiency.
- Apply best practices in coding, testing, CI / CD, and Skills :
- Strong proficiency in Python for data engineering.
- Hands-on experience in PySpark, SparkSQL, and ETL design.
- Working knowledge of LangChain, OpenAI APIs, or Hugging Face.
- Experience with Databricks, AWS EMR, or similar cloud data platforms.
- Good understanding of SQL, data modeling, and distributed data Skills :
- Familiarity with Google ADK Prompt Engineering.
- Experience with vector databases like FAISS, Pinecone, or Chroma.
- Exposure to MLflow, Unity Catalog, or SageMaker.
- Interest in LLM-powered applications and generative AI workflows
(ref : hirist.tech)