Evaluation Frameworks – Develop reusable, automated evaluation pipelines using frameworks such as Raagas; integrate LLM-as-a-judge methods for scalable assessments.
Golden Datasets – Build and maintain high-quality benchmark datasets in collaboration with subject matter experts.
AI Output Validation – Evaluate results across text, documents, audio, and video, using both automated metrics and human-in-the-loop judgment.
Metric Evaluation – Implement and track metrics such as precision, recall, F1 score, relevance scoring, and hallucination penalties.
RAG & Embeddings – Design and evaluate retrieval-augmented generation (RAG) pipelines, vector embedding similarity, and semantic search quality.
Error & Bias Analysis – Investigate recurring errors, biases, and inconsistencies in model outputs; propose solutions.
Framework & Tooling Development – Build tools that enable large-scale model evaluation across hundreds of AI agents.
Cross-Functional Collaboration – Partner with ML engineers, product managers, and QA peers to integrate evaluation frameworks into product pipelines.
Required Qualifications
2–4 years of experience as a Software Development Engineer in AI / ML systems.
Strong coding skills in Python (evaluation pipelines, data processing, metrics computation).
Hands-on experience with evaluation frameworks (Ragas or equivalent).
Knowledge of vector embeddings, similarity search, and RAG evaluation.
Familiarity with evaluation metrics (precision, recall, F1, relevance, hallucination detection).
Understanding of LLM-as-a-judge evaluation approaches.
Strong analytical and problem-solving skills; ability to combine human judgment with automated evaluations.
Bachelor’s or Master’s degree in Computer Science, Data Science, or related field.
Strong English written and verbal communication skills.
Good to Have : Experience in data quality, annotation workflows, dataset curation, or golden set preparation.