The mission
Build a desktop JavaFX application that lets non-technical users drop in invoices (PDF / image) and export clean CSV data. Invoices come in all shapes—your job is to make the app read them like a human : robust parsing, confidence scoring, and easy validation.
What you’ll build
- JavaFX desktop app (Java 17+) with a clean, responsive UI
- Invoice ingestion : PDF, PNG / JPG, multi-page, batches, drag-and-drop
- AI / OCR pipeline (choose best fit; hybrid is fine) :
- Classical OCR (e.g., Tesseract ) + layout analysis or
- Cloud OCR (e.g., AWS Textract , Google Vision ) or
- LLM-assisted parsing (prompting / JSON schema) with guardrails
- Field extraction (line-items + headers) : vendor, invoice #, dates, currency, taxes, subtotals / totals, PO, line descriptions, qty, unit price, amounts
- Validation & review UI : highlight zones, flag low-confidence fields, quick fixes, autocomplete
- CSV export : stable schema, locale / number / date normalization
- Rules & heuristics : vendor templates, regex fallbacks, learned patterns
- Quality metrics : confidence scores, per-field accuracy, reject reasons, simple analytics
- Operate offline where possible with optional cloud connectors
You’re a great fit if you have
4+ years Java; 2+ years JavaFX building production desktop appsReal-world OCR / NLP or document understanding experience (invoices, receipts, forms)Hands-on with one or more : Tesseract , Textract , Google Vision , Azure Form Recognizer , OpenCV , spaCy , LLM JSON extractionComfortable designing parsing pipelines : pre-processing, layout detection, table extraction, post-processing, and human-in-the-loop reviewStrong data wrangling : CSV schemas, date / currency parsing, edge casesSolid testing : golden files, fixture PDFs, deterministic pipelinesNice to have
Prompt engineering for structured outputs with LLMsVendor-specific templateing and auto-learningExperience with Maven / Gradle , native packaging, code signingKnowledge of ONNX / TensorFlow Lite models for document layoutBasic DevOps for OCR services and model hostingTech we expect to use (flexible)
Java 17+, JavaFX, Gradle , Tesseract / OpenCV or Textract / Vision , optional Python micro-services for ML bits, SQLite for local cache, JUnit + test fixtures, GitHub Actions CI.
Success looks like
≥95% header-field accuracy on a mixed test set≥90% line-item recall on clear tabular invoicesReview UI fixes a typical invoice inOne-click CSV export that matches our schema and loads cleanlyWhat we provide
Labeled sample invoices (PDFs / images) across vendorsTarget CSV schema + acceptance testsDesign mocks for the core screensFast feedback loop with a technical product owner