VLM YOLO Detector

VLM-powered image extraction and semantic search for equipment manuals. This repository provides the image data pipeline for the agentic-rag system.

Overview

This tool processes PDF manuals to:

Extract images from PDF documents (embedded images + rendered pages with diagrams)
Generate VLM descriptions for each image using LLaVA via Ollama
Create semantic embeddings for intelligent image search
Integrate with agentic-rag for visual content retrieval

Data Pipeline Status

Step	Output
PDF Extraction	2,827 images from 30 PDFs
VLM Descriptions	Contextual descriptions in image_index.json
Semantic Embeddings	384-dim embeddings in image_embeddings.npy

Getting Started

Prerequisites

Python 3.10+ with pip
Ollama installed and running (download from ollama.com/download)
LLaVA model for VLM descriptions (pulled automatically by install script)

Quick Setup (Windows)

For a fully automated installation:

git clone https://github.com/morkev/vlm-yolo-detector.git
cd vlm-yolo-detector
install.bat

This will:

Install uv package manager and Python dependencies
Install PyMuPDF and sentence-transformers
Pull the LLaVA 13B model for VLM descriptions
Optionally process any PDFs in data/manuals/

Manual Setup Steps

1. Clone and Navigate

git clone https://github.com/morkev/vlm-yolo-detector.git
cd vlm-yolo-detector

2. Install Dependencies

pip install uv
uv sync

# Or using pip directly
pip install -r requirements.txt
pip install pymupdf sentence-transformers

3. Pull Ollama VLM Model

ollama pull llava:13b

4. Add PDF Manuals

Place your PDF files in the data/manuals/ directory:

cp /path/to/your/manuals/*.pdf data/manuals/

5. Run the Processing Pipeline

# Step 1: Extract images from PDFs
python scripts/extract_all_images.py --pdf-dir data/manuals --output-dir data/processed --min-size 150

# Step 2: Generate VLM descriptions (requires Ollama with llava:13b)
python scripts/describe_images_vlm.py --index data/processed/image_index.json

# Step 3: Build semantic embeddings
python scripts/build_embeddings.py

Directory Structure

vlm-yolo-detector/
├── data/
│   ├── manuals/                    # Source PDF files
│   └── processed/
│       ├── images/                 # Extracted images (train/val split)
│       ├── labels/                 # YOLO format labels
│       ├── image_index.json        # Image metadata + VLM descriptions
│       ├── image_embeddings.npy    # Semantic embeddings (384-dim)
│       └── embedding_mapping.json  # Filename to index mapping
├── scripts/
│   ├── extract_all_images.py       # PDF to images
│   ├── describe_images_vlm.py      # Generate VLM descriptions
│   ├── build_embeddings.py         # Create semantic embeddings
│   ├── auto_label_images.py        # YOLO auto-labeling
│   └── train_classifier.py         # Train YOLO classifier
├── runs/                           # Trained model weights
├── configs/                        # YOLO configuration files
├── yologen/                        # Python package
├── install.bat                     # Automated installation script
├── requirements.txt                # Python dependencies
└── pyproject.toml                  # Project configuration

Integration with Agentic RAG

The agentic-rag repository uses this data for semantic image search:

# agentic-rag/app/backend/api/tools/image_search.py
YOLOGEN_DIR = Path("..") / "vlm-yolo-detector"
IMAGE_INDEX_PATH = YOLOGEN_DIR / "data" / "processed" / "image_index.json"
EMBEDDING_NPY_PATH = YOLOGEN_DIR / "data" / "processed" / "image_embeddings.npy"

Setup for integration:

Clone this repository alongside agentic-rag in the same parent directory
Run install.bat to process PDFs and generate embeddings
The agentic-rag system will automatically find the embeddings

Example directory structure:

Repositories/
├── agentic-rag/
└── vlm-yolo-detector/

Processing Scripts

extract_all_images.py

Extracts all visual content from PDFs including embedded images, vector graphics rendered as images, and full pages with diagrams.

python scripts/extract_all_images.py --pdf-dir data/manuals --output-dir data/processed --min-size 150

Options:

--pdf-dir: Directory containing PDF files
--output-dir: Output directory for extracted images
--min-size: Minimum image dimension in pixels (default: 150)
--render-all: Render all pages as images

describe_images_vlm.py

Uses LLaVA via Ollama to generate contextual descriptions for each image.

python scripts/describe_images_vlm.py --index data/processed/image_index.json

Options:

--index: Path to image_index.json
--batch-size: Number of images to process in parallel (default: 5)
--model: Ollama VLM model to use (default: llava:13b)

build_embeddings.py

Creates semantic embeddings from VLM descriptions using sentence-transformers.

python scripts/build_embeddings.py

Options:

--model: Embedding model (default: sentence-transformers/all-MiniLM-L6-v2)

Output Files

image_index.json

Contains metadata for each extracted image:

{
  "filename": "APSX-PIM_page_5_img_1.png",
  "pdf_source": "APSX-PIM-Manual.pdf",
  "pdf_name": "APSX-PIM-Manual",
  "page": 5,
  "width": 800,
  "height": 600,
  "extraction_type": "embedded",
  "page_text": "...",
  "vlm_description": "This image shows the control panel of the APSX-PIM injection molding machine..."
}

image_embeddings.npy

NumPy array of 384-dimensional embeddings for each image, indexed by filename in embedding_mapping.json.

embedding_mapping.json

Maps image filenames to their index in the embeddings array:

{
  "APSX-PIM_page_5_img_1.png": 0,
  "APSX-PIM_page_6_img_2.png": 1,
  ...
}

Troubleshooting

Ollama Connection Error

Problem: Failed to connect to Ollama

Solution:

Verify Ollama is running: ollama list
Start Ollama: ollama serve
Pull the LLaVA model: ollama pull llava:13b

No Images Extracted

Problem: extract_all_images.py finds no images

Solutions:

Verify PDFs are in data/manuals/
Try with --render-all flag to render pages as images
Lower the --min-size threshold

Embedding Generation Fails

Problem: build_embeddings.py fails

Solutions:

Verify sentence-transformers is installed: pip install sentence-transformers
Check that image_index.json exists and has VLM descriptions
Ensure sufficient disk space for embeddings

Missing VLM Descriptions

Problem: image_index.json has empty vlm_description fields

Solutions:

Verify Ollama is running and LLaVA model is pulled
Re-run describe_images_vlm.py
Check Ollama logs for errors

Requirements

Core Dependencies

Python 3.10+
PyMuPDF (for PDF processing)
sentence-transformers (for embeddings)
Pillow (for image processing)
NumPy

Optional Dependencies (for YOLO training)

ultralytics
torch
torchvision

VLM Requirements

Ollama with LLaVA model

Related Repositories

agentic-rag: Main RAG pipeline that uses this repository for image search
- Repository: https://github.com/Manufacturing-Demonstration-Facility/agentic-rag
- Uses the embeddings from this repository for semantic image retrieval

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data		data
scripts		scripts
yologen		yologen
.gitignore		.gitignore
README.md		README.md
install.bat		install.bat
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
yolov8n-cls.pt		yolov8n-cls.pt
yolov8n.pt		yolov8n.pt

Folders and files

Latest commit

History

Repository files navigation

VLM YOLO Detector

Overview

Data Pipeline Status

Getting Started

Prerequisites

Quick Setup (Windows)

Manual Setup Steps

1. Clone and Navigate

2. Install Dependencies

3. Pull Ollama VLM Model

4. Add PDF Manuals

5. Run the Processing Pipeline

Directory Structure

Integration with Agentic RAG

Processing Scripts

extract_all_images.py

describe_images_vlm.py

build_embeddings.py

Output Files

image_index.json

image_embeddings.npy

embedding_mapping.json

Troubleshooting

Ollama Connection Error

No Images Extracted

Embedding Generation Fails

Missing VLM Descriptions

Requirements

Core Dependencies

Optional Dependencies (for YOLO training)

VLM Requirements

Related Repositories

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages