The UK holds 12 million+ planning applications in diverse formats (PDF, TIF, DOCX, RTF, TXT)βplus noisy files (.msg, .html, .mp4, .jpeg, .gif). I needed to extract all text, generate sentence- and paragraph-level embeddings, and power land chatbot (LandGPT) with precise citations linked back to source docs.
- S3 bucket ingestion with MIME-type filtering to weed out noise
- Text extraction via pure-Python libs (pymupdf, docx, pypdandoc) to keep LLM extraction costs to a minimum
- OCR of scanned PDFs/TIFs using Google Gemini Flash 2.0 (1M token context, multimodal, fast, incredibly cheap)
- Embedding with open source nomic-embed-text model (768Β-dimensional vectors) via ollama for cost-effective search
- Storage in clientβs Postgres DB, leveraging pgvector, pgvectorScale & pgai extensions
- Scheduler-worker pattern with Docker + AWS SQS, horizontally scalable on Kubernetes spot instances
- Environment management via Pydantic BaseSettings, modular clean code with idempotent upsert logic
Python Β· AWS S3 Β· Google Gemini Flash 2.0 Β· ollama Β· nomic-embed-text Β· Postgres Β· pgvector Β· pgvectorscale Β· pgai Β· Docker Β· AWS SQS Β· Kubernetes Β· Pydantic Β· OCR Β· Embeddings Β· Clean Architecture Β· Cost Optimisation
β’ High noise ratio in S3 β rigorous MIME-type & extension filtering
β’ Scanned PDF/TIF files β Gemini Flash Flash 2.0 OCR for reliable multimodal extraction
β’ Massive scale & cost constraints β open-source embeddings + spot-instance K8s scaling
β’ Avoiding redundant work β SQL-backed dedupe checks before extract/embed
β’ Client DB consistency β integrated with existing Postgres stack (deliberately didnβt use a vector DB such as Qdrant to keep tech stack simple and maintainable)
β’ Processed 12 million docs (avg. 10 pages) in under 48 hours (β 250k pages/hr)
β’ Reduced extraction & embedding cost by 65% vs. GPT-4 baseline
β’ Enables daily ingestion of 1k+ new planning apps, keeping LandGPT data fresh
β’ LandGPT platform engagement up 22%
Note: this was one project of many from a long-term contract with the above client
