Machine-readable project context for AI assistants. Human-readable docs: README.md
These rules are mandatory for every commit/push:
- CLAUDE.md ↔ README.md sync: When you change project structure, commands, architecture, or workflows, update BOTH
CLAUDE.md(machine-readable) andREADME.md(human-readable). Never let them drift apart. - CHANGELOG.md: Every user-facing or structural change gets an entry in CHANGELOG.md. Follow the existing format (Keep a Changelog). Add entries under
[Unreleased]. Categories: Added, Changed, Fixed, Removed. - Sub-READMEs: If changes affect
scraper/,pipeline/, ormcp-server/, update their respectiveREADME.mdfiles too. - Verify before push: Before pushing, confirm that file paths, command names, and structure descriptions in docs match the actual repository state.
Semantic search over public documents from the municipality of Nordstemmen (Gemeinde Nordstemmen).
Three components: OParl Scraper, Document Pipeline, MCP Server.
Live at: https://nordstemmen-mcp.levinkeller.de/mcp
nordstemmen-ai/
├── scraper/ # OParl Scraper (TypeScript, Effect)
│ ├── src/
│ │ ├── index.ts # CLI entry point
│ │ ├── scraper.ts # Main scraper logic
│ │ ├── client.ts # HTTP client
│ │ ├── schema.ts # OParl type definitions
│ │ └── __tests__/ # Tests (vitest + nock fixtures)
│ ├── package.json
│ ├── tsconfig.json
│ ├── vitest.config.ts
│ └── README.md # Detailed OParl data model docs
├── pipeline/ # Document Pipeline (TypeScript, async/await)
│ ├── src/
│ │ ├── index.ts # CLI entry point (parseArgs)
│ │ ├── pipeline.ts # Orchestrator: per-document processing
│ │ ├── types.ts # TypeScript interfaces
│ │ ├── config.ts # Constants (API URLs, models, limits)
│ │ ├── discovery.ts # Document discovery + metadata parsing
│ │ ├── hash.ts # SHA256 hashing, LFS pointer detection
│ │ ├── cache.ts # .fulltext.json / .embeddings.json / .completed I/O
│ │ ├── ocr.ts # Gemini API: PDF → page-level text
│ │ ├── jina.ts # Jina API: text → 1024D vectors (with semaphore)
│ │ ├── sparse.ts # BM25-TF sparse vectors (FNV-1a hash, German stopwords)
│ │ ├── qdrant.ts # Qdrant upload (named vectors: dense + sparse)
│ │ ├── retry.ts # Retry + concurrency helpers
│ │ ├── rebuild-qdrant.ts # Standalone: rebuild Qdrant from cached embeddings
│ │ ├── migrate-sparse.ts # One-time migration script (delete after use)
│ │ └── __tests__/ # Tests (vitest)
│ │ └── jina-load.test.ts # Jina API load test (run manually)
│ ├── package.json
│ ├── tsconfig.json
│ └── vitest.config.ts
├── mcp-server/ # MCP Server (Cloudflare Pages)
│ ├── functions/
│ │ └── mcp.js # Core MCP implementation (4 tools)
│ ├── src/
│ │ ├── index.html # Landing page
│ │ └── style.css # Tailwind CSS
│ ├── mcp-server.test.js # MCP protocol tests
│ ├── package.json
│ ├── vite.config.js # Build config
│ ├── vitest.config.js # Test config
│ ├── tailwind.config.js
│ ├── postcss.config.js
│ ├── wrangler.test.jsonc # Cloudflare test config
│ └── README.md # API docs, deployment guide
├── embeddings/ # Legacy (Python, deprecated — use pipeline/)
├── documents/ # Downloaded PDFs + metadata (Git LFS)
│ ├── metadata.json # Master index (all files)
│ ├── papers/ # ~1578 paper directories
│ │ └── DS_<num>-<year>/
│ │ ├── metadata.json # OParl paper metadata
│ │ ├── *.pdf # Main + auxiliary files
│ │ ├── *.fulltext.json # Gemini OCR output (cached)
│ │ ├── *.embeddings.json # Cached embeddings (LFS)
│ │ └── *.completed # Completion flag (all steps succeeded)
│ └── meetings/ # ~1087 meeting directories
│ └── <date>_<name>/
│ ├── metadata.json # OParl meeting metadata
│ ├── *.pdf # Invitation, protocol, attachments
│ ├── *.fulltext.json # Gemini OCR output (cached)
│ ├── *.embeddings.json # Cached embeddings (LFS)
│ └── *.completed # Completion flag (all steps succeeded)
├── scripts/
│ ├── lfs-repair.sh # Detect/fix LFS pointer files
│ └── update-hashes-to-sha256.py
├── .github/workflows/
│ ├── data-sync.yml # Hourly CI: scraper + pipeline
│ └── claude.yml # Claude Code Action (@claude in issues/PRs)
├── .devcontainer/
│ └── devcontainer.json # Dev container: Node 22, Python, Git LFS
├── .env.example # All env vars (Qdrant, Jina, Gemini)
├── .gitattributes # LFS tracking: *.pdf, *.embeddings.json
├── .lfsconfig # Custom LFS server: git-lfs.nordstemmen-ai.levinkeller.de
├── biome.json # Linter/formatter config
├── CHANGELOG.md # Project changelog (Keep a Changelog format)
└── package.json # Root workspace (scraper, pipeline, mcp-server)
# Root (workspaces: scraper, pipeline, mcp-server)
npm test # Run all workspace tests
npm run lint # Biome check
npm run lint:fix # Biome auto-fix
npm run format # Biome format
npm run lfs-pull # Download all LFS files
npm run lfs:repair # Detect/repair LFS pointer files
# Scraper
cd scraper && npm run scrape # Run OParl scraper
cd scraper && npm test # Run scraper tests
# Pipeline
npm run migrate:sparse -w pipeline # One-time: rebuild Qdrant with sparse vectors (no API calls)
npm run pipeline # Process all unprocessed documents
npm run pipeline -- --limit 500 # Limit to 500 documents
npm run pipeline -- --force # Re-process everything (ignore .completed)
npm run pipeline -- --dry-run # List files without processing
npm run pipeline -- --skip-qdrant # Skip Qdrant upload
npm run pipeline -- --only DS_1-2007 # Only matching documents
npm run pipeline -- --concurrency 10 # 10 parallel (default 5)
npm run pipeline -- --max-pdf-size 100 # Max PDF size in MB (default 50)
# MCP Server
cd mcp-server && npm run dev # Local dev server (Vite + Wrangler)
cd mcp-server && npm test # Run tests
cd mcp-server && npm run build # Production build- Scraper: TypeScript + Effect library. Crawls OParl API (
/paper+/meetingcollections), downloads PDFs, saves structured metadata per entity - Pipeline: TypeScript, plain async/await. Document-oriented processing: PDF → Gemini OCR → Jina Embeddings + local Sparse Vectors → Qdrant. No build step (
node --experimental-strip-types). No partial cache reuse — each run always does fresh OCR + embeddings for unprocessed files..completedflag per PDF is only written after ALL steps succeed - MCP Server: Cloudflare Pages Functions. Four MCP tools:
search_documents(hybrid search: dense + sparse with RRF fusion),get_paper_by_reference(direct DS lookup),search_papers(filtered metadata search),get_document_text(fulltext by hash). Fulltext is served from Cloudflare static assets (bundled.txtfiles), not from an external storage service - Vector DB: Qdrant (self-hosted at qdrant.levinkeller.de). Named vectors:
dense(Jina 1024D, Cosine) +sparse(BM25-TF) - Hybrid Search: MCP Server uses Qdrant Query API with
prefetch(dense + sparse) and RRF (Reciprocal Rank Fusion). Combines semantic similarity with keyword matching - Sparse Vectors: Locally computed BM25-TF weights with FNV-1a token hashing. Same tokenizer in pipeline (
sparse.ts) and MCP server (mcp.js). German stopwords, no external API needed - Embeddings API: Jina AI v3 —
retrieval.passagefor indexing (pipeline),retrieval.queryfor search (MCP server) - OCR: Gemini 2.5 Flash — sends entire PDF as inline data, page-level text extraction via
--- Page N ---markers - Git LFS: Custom server at git-lfs.nordstemmen-ai.levinkeller.de; tracks PDFs + embedding caches.
.lfsconfighasfetchexclude = *(opt-in download)
- OCR — Gemini 2.5 Flash: Chosen after comparing Gemini, GPT-4o, and GPT-4o-mini. Gemini is cheapest (~$0.0001/page vs ~$0.0014 for GPT-4o), produces no hallucinations, and handles rotated PDFs correctly. GPT-4o paraphrases instead of transcribing (rewrites content in its own words, distorting meaning). GPT-4o-mini refuses rotated PDFs and misses content blocks
- Embeddings — Jina v3 (1024D): Multilingual model with German support, task-specific LoRA adapters (
retrieval.passagefor indexing,retrieval.queryfor search). Used via API for both pipeline and MCP server query-time embeddings
- Document-oriented pipeline: Each document goes through the complete chain (OCR → Embeddings → Qdrant) before moving to the next. Simpler error handling, resumable
- No partial cache reuse: When processing, always do fresh OCR + embeddings.
.fulltext.jsonand.embeddings.jsonare saved as artifacts but never read back by the pipeline. This ensures consistency .completedtracking: A.completedfile per PDF is written ONLY after all steps (OCR → Embeddings → Qdrant) succeed. On re-run, only files without.completed(or with mismatched hash) are processed- Gemini OCR: Sends entire PDF as inline data to Gemini 2.5 Flash — no pdf2image, no poppler dependency. Page-level text via
--- Page N ---markers in prompt - Page-level embeddings: 1 dense embedding + 1 sparse vector per page (Jina v3, 1024D).
chunk_indexis always 0. No sub-page chunking - Hybrid search via RRF: Dense vectors (semantic) + sparse vectors (keyword/BM25) combined via Reciprocal Rank Fusion. Improves results for exact names, numbers, street names
- Hash-based change detection: SHA256 per PDF, checked in
.completedfile - Fulltext as static assets: The MCP server serves fulltext from Cloudflare static assets (
.txtfiles bundled at deploy time), not from external object storage - Custom LFS server: Separate from GitHub LFS for cost/control
- OParl metadata preserved: Full paper/meeting context (DS-numbers, consultations, agenda items) stored in Qdrant payload for rich search results
See .env.example:
QDRANT_URL— Qdrant server URLQDRANT_API_KEY— Qdrant API keyQDRANT_PORT— Qdrant port (443)QDRANT_COLLECTION— Collection name (nordstemmen)GOOGLE_API_KEY— Gemini API key (for pipeline OCR)JINA_API_KEY— Jina AI API key (for pipeline embeddings + MCP query)
MCP Server additionally needs (set in Cloudflare dashboard):
JINA_API_KEY— Jina AI API key for query embeddingsQDRANT_URL,QDRANT_API_KEY,QDRANT_PORT,QDRANT_COLLECTION
cd scraper && npm run scrape— Download new/changed PDFs + metadata from OParl APInpm run pipeline— Process all new documents (OCR → Embeddings → Qdrant)git add documents/ && git commit && git push— Commit new data (PDFs + caches via LFS)
This runs automatically every hour via GitHub Actions (data-sync.yml).
MCP Server deployment is automatic via Cloudflare Pages (on push to main).