🔬 Image-to-Structured Data Pipeline

"Turning visual form data into structured, queryable records — automatically, at scale, and at a fraction of the cost of real-time inference."

Important

This pipeline is built around the Gemini Batch API — not real-time inference. Processing 300+ form images costs under $1.50 and requires zero infrastructure. See why this matters ↓

The Problem This Solves

Structured data often lives trapped inside web-based forms — clinical records, surveys, inspection sheets, registration portals. Exporting it programmatically is rarely an option. The data exists visually, and someone has to read it.

This pipeline automates that entire workflow end-to-end:

Captures form pages as screenshots via a headless browser
Pre-processes each image for maximum LLM readability
Extracts structured data using Gemini's vision capabilities
Outputs clean Excel files ready for analysis
Generates consolidated line listings across all records

Everything runs inside Google Colab with Drive as persistent storage — no servers, no infrastructure, no maintenance.

🏗️ Pipeline Architecture

flowchart TD
    A["§1 · SCREENSHOT COLLECTION\nPlaywright headless Chromium\nCLAHE preprocessing · timestamp watermark"]
    B["§2 · FILE DISCOVERY\nWalks Drive output tree\nBuilds pending_tasks.json job queue"]
    C["§3 · FILES API UPLOAD\nUploads PNGs to Gemini Files API\nRecords file_uri · skips if already done"]
    D["§4 · BATCH SUBMISSION\nBuilds JSONL input\nSubmits to Gemini Batch API"]
    E["⏳ ASYNC PROCESSING\nGemini processes all images\nin the background · ~50% cost vs real-time"]
    F["§5 · EXCEL GENERATION\nDownloads JSONL results\nWrites per-form .xlsx"]
    G["§6 · COST ANALYSIS\nToken audit · $/image report\nConfigurable pricing"]
    H["§7 · History Listing"]
    I["§8 · Medication Listing"]
    J["§9 · Events Listing"]
    K[("📊 Master_Line_Listing.xlsx\nmulti-sheet · auto-filter")]

    A --> B --> C --> D --> E --> F --> G
    F --> H & I & J --> K

    style E fill:#f0a500,color:#000,stroke:#c47d00
    style K fill:#1a7f37,color:#fff,stroke:#145d28
    style D fill:#1a56db,color:#fff,stroke:#1239a0

⚙️ How Each Stage Works

#	Stage	Key Mechanism	Idempotent?
1	Screenshot Collection	Playwright headless Chromium + CLAHE contrast enhancement + timestamp watermark	✅ Skips if PNG exists
2	File Discovery & Snapshot	Walks Drive output tree, builds `pending_tasks.json` job queue	✅ Skips if snapshot exists
3	Files API Upload	Uploads PNGs to Gemini Files API, records `file_uri` per task	✅ Skips if `file_uri` recorded
4	Batch Preparation & Submission	Builds JSONL, checks for active job before creating new one	✅ Reuses existing job
5	Output Download & Excel Generation	Downloads JSONL results, writes per-form `.xlsx` next to each PNG	✅ Skips existing `.xlsx`
6	Token Usage & Cost Analysis	Parses `usageMetadata`, computes `$/image` with configurable pricing	—
7–9	Line Listings	Field-mapping dicts aggregate all forms into a master multi-sheet Excel	✅ `if_sheet_exists='replace'`

🧠 Why Batch API, Not Real-Time Inference?

This is the most consequential architectural decision in the pipeline, and it was made deliberately.

When you have hundreds or thousands of form images to process, calling the Gemini API image-by-image in real time creates three compounding problems: cost, rate limits, and fragility. A single API timeout mid-run means partial results and manual recovery. Rate limits cap your throughput. And real-time pricing adds up fast.

The Gemini Batch API sidesteps all three. You submit a JSONL file with every request, walk away, and download structured results when processing is done. The pipeline handles job state tracking, resumption, and idempotent retries automatically.

	Real-Time API	Batch API
Latency	Seconds per image	Async — hours for full batch
Cost	Standard pricing	~50% discount
Rate limits	Hit quickly at scale	None — async queue
Failure handling	Manual per-request	Job-level — retry the whole batch
Infrastructure needed	Keep process running	None — fire and forget
Best for	Interactive, small-scale	Bulk processing ✅

For any dataset beyond ~20 images, the batch approach is the only economically and operationally viable path. Processing 312 form images in this pipeline costs under $1.50 total.

🎯 Use Cases

This pipeline is domain-agnostic. Any workflow where structured data is locked inside web-based form images is a candidate:

Clinical & Research

eCRF (electronic Case Report Form) digitization
Clinical trial data extraction across multiple visit forms
Patient history and adverse event line listing generation

Enterprise & Operations

Web portal inspection / audit form capture
Survey digitization at scale
Legacy system screenshot-to-spreadsheet conversion

Data Engineering

Building training datasets from labeled web forms
Ground-truth generation for OCR/document AI benchmarking
Automated QA on form submission accuracy

💰 Cost Analysis

The pipeline includes a built-in token usage and cost reporting stage (Section 6). After each batch run, you get a breakdown like:

════════════════════════════════════════════════
         TOKEN USAGE & COST REPORT
════════════════════════════════════════════════
  Images processed       : 312
────────────────────────────────────────────────
  Input tokens           : 4,821,504
  Output tokens (incl.)  : 187,392
  Total tokens           : 5,008,896
────────────────────────────────────────────────
  Input cost             : $1.205376
  Output cost            : $0.281088
  TOTAL COST             : $1.486464
────────────────────────────────────────────────
  Avg tokens / image     : 16,054.2
  Avg cost   / image     : $0.004764
════════════════════════════════════════════════

Pricing is configurable via PRICE_INPUT_1M and PRICE_OUTPUT_1M constants in Section 6. Values above use Gemini 3 Flash Batch pricing as of 2025.

🚀 Quick Start

Prerequisites

Google account with Drive access
Gemini API key (free tier works for small batches)
Google Colab (free or Pro)

1. Clone & Open

git clone https://github.com/YOUR_USERNAME/image-to-structured-data-pipeline.git

Open vision_extraction_pipeline.ipynb in Google Colab.

2. Set Secrets

In Colab, go to Tools → Secrets and add:

Secret Key	Description
`GEMINI_API_KEY`	Your Gemini API key
`APP_USERNAME`	Login credential for the target web app
`APP_PASSWORD`	Login credential for the target web app
`BASE_URL`	Base URL of the web application
`TARGET_IDS`	Comma-separated list of entity IDs to process

⚠️ Never hardcode credentials. All secrets are read via userdata.get() — they never appear in the notebook.

3. Configure Path

In each section, update the Drive path to match your project folder:

DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/YourProject/output"

4. Configure Field Mapping (Sections 7–9)

The line listing sections use a FIELD_MAP dict to map your form's exact field labels to column names. Update these to match your form:

FIELD_MAP = {
    "diagnosis":  "History Term",   # your form's label → output column
    "onset date": "Start Date",
    "ongoing":    "Ongoing",
    "end date":   "End Date",
}

5. Run Sections in Order

Each section is self-contained and idempotent — you can re-run safely at any point. The recommended sequence:

§1 Screenshot → §2 Discovery → §3 Upload → §4 Submit
                                              ↓
                                         (wait for batch)
                                              ↓
§6 Cost Report ← §5 Download & Excel ← §4.3 Status Check
                                              ↓
                               §7 History + §8 Medication + §9 Events

📁 Project Structure

image-to-structured-data-pipeline/
├── vision_extraction_pipeline.ipynb   # Main pipeline notebook (9 sections)
└── README.md

Runtime artifacts (auto-created by the pipeline, stored in Drive):

Google Drive / YourProject/
├── output/
│   └── {entity_id}/
│       └── {visit}/
│           └── {form_name}/
│               ├── {form_name}.png       # Captured screenshot
│               └── {form_name}.xlsx      # Extracted structured data
└── Line_Listings/
    └── Master_Line_Listing.xlsx          # Consolidated output (multi-sheet)
        ├── History_Listing
        ├── Medication_Listing
        └── Events_Listing

🛡️ Design Principles

Idempotency throughout — Every stage checks whether its output already exists before doing work. You can interrupt and resume the pipeline at any point without duplicating effort or cost.

Secrets-only credentials — No credentials, IDs, or sensitive strings appear in the notebook. Everything is read from Colab Secrets at runtime.

Separation of concerns — Each numbered section has exactly one responsibility. The job queue (pending_tasks.json) is the single source of truth shared between stages.

Cost-aware by design — The batch API, the idempotency guards, and the built-in cost reporting all exist to make large-scale runs predictable and affordable.

👤 Author

Kerem Dağlı · GitHub · LinkedIn

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
vision_extraction_pipeline.ipynb		vision_extraction_pipeline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 Image-to-Structured Data Pipeline

The Problem This Solves

🏗️ Pipeline Architecture

⚙️ How Each Stage Works

🧠 Why Batch API, Not Real-Time Inference?

🎯 Use Cases

💰 Cost Analysis

🚀 Quick Start

Prerequisites

1. Clone & Open

2. Set Secrets

3. Configure Path

4. Configure Field Mapping (Sections 7–9)

5. Run Sections in Order

📁 Project Structure

🛡️ Design Principles

👤 Author

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 Image-to-Structured Data Pipeline

The Problem This Solves

🏗️ Pipeline Architecture

⚙️ How Each Stage Works

🧠 Why Batch API, Not Real-Time Inference?

🎯 Use Cases

💰 Cost Analysis

🚀 Quick Start

Prerequisites

1. Clone & Open

2. Set Secrets

3. Configure Path

4. Configure Field Mapping (Sections 7–9)

5. Run Sections in Order

📁 Project Structure

🛡️ Design Principles

👤 Author

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages