Skip to content

codeananda/planning_doc_text_extract_embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 

Repository files navigation

πŸ“‘ Planning Document Text Extraction & Embedding Pipeline

generated-image (2) copy

🎯 Problem Statement

The UK holds 12 million+ planning applications in diverse formats (PDF, TIF, DOCX, RTF, TXT)β€”plus noisy files (.msg, .html, .mp4, .jpeg, .gif). I needed to extract all text, generate sentence- and paragraph-level embeddings, and power land chatbot (LandGPT) with precise citations linked back to source docs.

βš™οΈ Technical Approach

  • S3 bucket ingestion with MIME-type filtering to weed out noise
  • Text extraction via pure-Python libs (pymupdf, docx, pypdandoc) to keep LLM extraction costs to a minimum
  • OCR of scanned PDFs/TIFs using Google Gemini Flash 2.0 (1M token context, multimodal, fast, incredibly cheap)
  • Embedding with open source nomic-embed-text model (768Β­-dimensional vectors) via ollama for cost-effective search
  • Storage in client’s Postgres DB, leveraging pgvector, pgvectorScale & pgai extensions
  • Scheduler-worker pattern with Docker + AWS SQS, horizontally scalable on Kubernetes spot instances
  • Environment management via Pydantic BaseSettings, modular clean code with idempotent upsert logic

πŸ›  Skills

Python Β· AWS S3 Β· Google Gemini Flash 2.0 Β· ollama Β· nomic-embed-text Β· Postgres Β· pgvector Β· pgvectorscale Β· pgai Β· Docker Β· AWS SQS Β· Kubernetes Β· Pydantic Β· OCR Β· Embeddings Β· Clean Architecture Β· Cost Optimisation

πŸ”§ Challenges & Solutions

β€’ High noise ratio in S3 β†’ rigorous MIME-type & extension filtering
β€’ Scanned PDF/TIF files β†’ Gemini Flash Flash 2.0 OCR for reliable multimodal extraction
β€’ Massive scale & cost constraints β†’ open-source embeddings + spot-instance K8s scaling
β€’ Avoiding redundant work β†’ SQL-backed dedupe checks before extract/embed
β€’ Client DB consistency β†’ integrated with existing Postgres stack (deliberately didn’t use a vector DB such as Qdrant to keep tech stack simple and maintainable)

πŸ“Š Quantifiable Business Impact

β€’ Processed 12 million docs (avg. 10 pages) in under 48 hours (β‰ˆ 250k pages/hr)
β€’ Reduced extraction & embedding cost by 65% vs. GPT-4 baseline
β€’ Enables daily ingestion of 1k+ new planning apps, keeping LandGPT data fresh
β€’ LandGPT platform engagement up 22%

⭐ Client Review

Screenshot 2025-05-09 at 13 14 02

Note: this was one project of many from a long-term contract with the above client

About

πŸ“‘ Scaled document processing pipeline that chomped through 12 million planning documents in 48 hours using open-source embeddings and Kubernetes spot instances, powering LandGPT's precise citations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors