What You'll Build
Core Responsibilities
Data Architecture & Infrastructure (40%)
- Design and implement a multi-database architecture (MongoDB, Redis, Milvus, Neo4j, BigQuery)
- Build scalable data pipelines for real-time conversation processing and personalization
- Architect ETL / ELT workflows for data migration from legacy systems
- Implement data partitioning, sharding, and optimization strategies for high-throughput systems
- Create data governance frameworks ensuring quality, security, and compliance Vector & Graph Database Systems (25%)
- Design and optimize Milvus vector collections for semantic search (1024-dim embeddings)
- Build graph schemas in Neo4j for customer journey mapping and persona relationships
- Implement HNSW indexing strategies and similarity search optimization
- Create hybrid search systems combining vector, full-text, and graph queries
- Monitor and tune database performance (query latency, throughput, resource utilization)
ML Data Infrastructure (20%)
Build data collection pipelines for LLM fine-tuning (conversation logs, tool executions)Create feature stores for GNN training (customer interactions, engagement signals)Implement data versioning and lineage tracking for ML experimentsDesign A / B testing data infrastructure with CUPED variance reductionBuild real-time feature computation pipelines for contextual banditsAnalytics & Monitoring (15%)
Design BigQuery schemas for marketing analytics and performance trackingCreate materialized views and aggregation pipelines for real-time dashboardsImplement data quality monitoring and anomaly detectionBuild observability infrastructure (Prometheus metrics, Grafana dashboards)Develop cost optimization strategies for cloud data warehousingTechnical Stack You'll Work With
Databases & Storage
MongoDB (conversation state, active sessions)Redis (caching, rate limiting, real-time data)Milvus (vector embeddings, semantic search)Neo4j (customer journey graphs, persona networks)BigQuery (analytics warehouse, historical data)Data Processing & Orchestration
Apache Airflow or Prefect (workflow orchestration)Pandas , Polars (data transformation)Apache Spark (optional - for large-scale processing)dbt (data transformation and modeling)ML / AI Data Pipeline
vLLM (LLM inference serving)MLflow (model registry, experiment tracking)Sentence Transformers (embedding generation)PyTorch , TensorFlow (ML model training)Cloud & Infrastructure
Google Cloud Platform (BigQuery, Cloud Storage, Compute)Docker & Kubernetes (containerization, orchestration)Terraform (infrastructure as code)GitHub Actions or GitLab CI (CI / CD pipelines)Programming & Tools
Python 3.10+ (primary language)SQL (complex queries, query optimization)Shell scripting (Bash / Zsh)Git (version control)Requirements
Must-Have Skills
5+ years of data engineering experience with production systemsExpert-level SQL and database design skillsStrong Python programming (async / await, type hints, testing)Experience with at least 3 different database technologies (SQL, NoSQL, Vector, Graph)Proven track record building high-scale data pipelines (>1M records / day)
Deep understanding of data modeling (dimensional, normalized, denormalized)Experience with cloud data warehouses (BigQuery, Redshift, or Snowflake)Strong knowledge of data quality, validation, and governanceExcellent debugging and optimization skillsHighly Desirable
Experience with vector databases (Milvus, Pinecone, Weaviate, Qdrant)Experience with graph databases (Neo4j, ArangoDB, Neptune)Knowledge of embedding models and semantic searchExperience with ML data pipelines (feature stores, model training data)Understanding of A / B testing and experimental designExperience with real-time streaming (Kafka, Pub / Sub, Kinesis)Knowledge of LLMs and conversational AI systemsExperience with data migration projects (especially large-scale)Background in marketing technology or customer data platformsNice-to-Have
Experience with PyTorch Geometric or graph neural networksKnowledge of marketing analytics (attribution, segmentation, personalization)Familiarity with LangChain , LangGraph , or agent frameworksExperience with cost optimization in cloud environmentsContributions to open-source data engineering projectsExperience with data compliance (GDPR, CCPA)Key Projects You'll Own
Phase 1 : Foundation
Migrate 10M+ conversation vectors from Pinecone to MilvusDesign and implement MongoDB schemas for real-time agent stateSet up Neo4j graph database with customer journey modelsCreate BigQuery data warehouse with partitioned tablesPhase 2 : Optimization
Build automated data quality monitoring systemImplement caching strategies (Redis) for 10x latency reductionOptimize vector search queries (target :Create real-time analytics dashboards (Grafana)Phase 3 : ML Infrastructure
Build LLM fine-tuning data pipelineImplement feature store for GNN trainingCreate A / B testing data infrastructureDesign multi-armed bandit state managementWork Environment
Collaborative team : Work with ML engineers, backend developers, and data scientistsModern stack : Latest technologies and toolsImpact : Your work directly affects millions of marketing interactionsAutonomy : Own your projects end-to-endGrowth : Clear path to Senior / Lead / Principal roles