About the Project
We’re developing a next-generation intelligent web crawling system capable of exploring deep and dynamic web data sources — including sites behind authentication, infinite scrolls, and JavaScript-heavy pages.
The crawler will be integrated with an AI-driven pipeline for automated data understanding, classification, and transformation.
We’re looking for a highly experienced engineer who has previously built large-scale, distributed crawling frameworks and integrated AI or NLP / LLM-based components for contextual data extraction.
Key Responsibilities
- Design, develop, and deploy scalable deep web crawlers capable of bypassing common anti-bot mechanisms.
- Implement AI-integrated pipelines for data processing, entity extraction, and semantic categorization.
- Develop dynamic scraping systems for sites that rely on JavaScript, infinite scrolling, or APIs.
- Integrate with vector databases , LLM-based data labeling, or automated content enrichment modules.
- Optimize crawling logic for speed, reliability, and stealth across distributed environments.
- Collaborate on data pipeline orchestration using tools like Airflow, Prefect, or custom async architectures.
Required Expertise
Proven experience building deep or dark web crawlers (Playwright, Scrapy, Puppeteer, or custom async frameworks).Strong understanding of browser automation, session management, and anti-detection mechanisms .Experience integrating AI / ML / NLP pipelines — e.g., text classification, entity recognition, or embedding-based similarity.Skilled in asynchronous Python (asyncio, aiohttp, Playwright async API).Familiar with database and pipeline systems — PostgreSQL, MongoDB, Elasticsearch, or similar.Ability to design robust data flows that connect crawling → AI inference → storage / visualization.Nice to Have
Knowledge of LLMs (OpenAI, Hugging Face, LangChain, or custom fine-tuned models) .Experience with data cleaning, deduplication, and normalization pipelines .Familiarity with distributed crawling frameworks (Ray, Celery, Kafka) .Prior experience integrating real-time analytics dashboards or monitoring tools.What We Offer
Competitive freelance pay based on expertise and delivery.Flexible, async-first remote collaboration.Opportunity to shape an AI-first data platform from the ground up.Potential for long-term partnership if the collaboration is successful.