VLM-powered image extraction and semantic search for equipment manuals. This repository provides the image data pipeline for the agentic-rag system.
This tool processes PDF manuals to:
- Extract images from PDF documents (embedded images + rendered pages with diagrams)
- Generate VLM descriptions for each image using LLaVA via Ollama
- Create semantic embeddings for intelligent image search
- Integrate with agentic-rag for visual content retrieval
| Step | Output |
|---|---|
| PDF Extraction | 2,827 images from 30 PDFs |
| VLM Descriptions | Contextual descriptions in image_index.json |
| Semantic Embeddings | 384-dim embeddings in image_embeddings.npy |
- Python 3.10+ with pip
- Ollama installed and running (download from ollama.com/download)
- LLaVA model for VLM descriptions (pulled automatically by install script)
For a fully automated installation:
git clone https://github.com/morkev/vlm-yolo-detector.git
cd vlm-yolo-detector
install.batThis will:
- Install uv package manager and Python dependencies
- Install PyMuPDF and sentence-transformers
- Pull the LLaVA 13B model for VLM descriptions
- Optionally process any PDFs in data/manuals/
git clone https://github.com/morkev/vlm-yolo-detector.git
cd vlm-yolo-detectorpip install uv
uv sync
# Or using pip directly
pip install -r requirements.txt
pip install pymupdf sentence-transformersollama pull llava:13bPlace your PDF files in the data/manuals/ directory:
cp /path/to/your/manuals/*.pdf data/manuals/# Step 1: Extract images from PDFs
python scripts/extract_all_images.py --pdf-dir data/manuals --output-dir data/processed --min-size 150
# Step 2: Generate VLM descriptions (requires Ollama with llava:13b)
python scripts/describe_images_vlm.py --index data/processed/image_index.json
# Step 3: Build semantic embeddings
python scripts/build_embeddings.pyvlm-yolo-detector/
├── data/
│ ├── manuals/ # Source PDF files
│ └── processed/
│ ├── images/ # Extracted images (train/val split)
│ ├── labels/ # YOLO format labels
│ ├── image_index.json # Image metadata + VLM descriptions
│ ├── image_embeddings.npy # Semantic embeddings (384-dim)
│ └── embedding_mapping.json # Filename to index mapping
├── scripts/
│ ├── extract_all_images.py # PDF to images
│ ├── describe_images_vlm.py # Generate VLM descriptions
│ ├── build_embeddings.py # Create semantic embeddings
│ ├── auto_label_images.py # YOLO auto-labeling
│ └── train_classifier.py # Train YOLO classifier
├── runs/ # Trained model weights
├── configs/ # YOLO configuration files
├── yologen/ # Python package
├── install.bat # Automated installation script
├── requirements.txt # Python dependencies
└── pyproject.toml # Project configuration
The agentic-rag repository uses this data for semantic image search:
# agentic-rag/app/backend/api/tools/image_search.py
YOLOGEN_DIR = Path("..") / "vlm-yolo-detector"
IMAGE_INDEX_PATH = YOLOGEN_DIR / "data" / "processed" / "image_index.json"
EMBEDDING_NPY_PATH = YOLOGEN_DIR / "data" / "processed" / "image_embeddings.npy"Setup for integration:
- Clone this repository alongside agentic-rag in the same parent directory
- Run
install.batto process PDFs and generate embeddings - The agentic-rag system will automatically find the embeddings
Example directory structure:
Repositories/
├── agentic-rag/
└── vlm-yolo-detector/
Extracts all visual content from PDFs including embedded images, vector graphics rendered as images, and full pages with diagrams.
python scripts/extract_all_images.py --pdf-dir data/manuals --output-dir data/processed --min-size 150Options:
--pdf-dir: Directory containing PDF files--output-dir: Output directory for extracted images--min-size: Minimum image dimension in pixels (default: 150)--render-all: Render all pages as images
Uses LLaVA via Ollama to generate contextual descriptions for each image.
python scripts/describe_images_vlm.py --index data/processed/image_index.jsonOptions:
--index: Path to image_index.json--batch-size: Number of images to process in parallel (default: 5)--model: Ollama VLM model to use (default: llava:13b)
Creates semantic embeddings from VLM descriptions using sentence-transformers.
python scripts/build_embeddings.pyOptions:
--model: Embedding model (default: sentence-transformers/all-MiniLM-L6-v2)
Contains metadata for each extracted image:
{
"filename": "APSX-PIM_page_5_img_1.png",
"pdf_source": "APSX-PIM-Manual.pdf",
"pdf_name": "APSX-PIM-Manual",
"page": 5,
"width": 800,
"height": 600,
"extraction_type": "embedded",
"page_text": "...",
"vlm_description": "This image shows the control panel of the APSX-PIM injection molding machine..."
}NumPy array of 384-dimensional embeddings for each image, indexed by filename in embedding_mapping.json.
Maps image filenames to their index in the embeddings array:
{
"APSX-PIM_page_5_img_1.png": 0,
"APSX-PIM_page_6_img_2.png": 1,
...
}Problem: Failed to connect to Ollama
Solution:
- Verify Ollama is running:
ollama list - Start Ollama:
ollama serve - Pull the LLaVA model:
ollama pull llava:13b
Problem: extract_all_images.py finds no images
Solutions:
- Verify PDFs are in
data/manuals/ - Try with
--render-allflag to render pages as images - Lower the
--min-sizethreshold
Problem: build_embeddings.py fails
Solutions:
- Verify sentence-transformers is installed:
pip install sentence-transformers - Check that image_index.json exists and has VLM descriptions
- Ensure sufficient disk space for embeddings
Problem: image_index.json has empty vlm_description fields
Solutions:
- Verify Ollama is running and LLaVA model is pulled
- Re-run describe_images_vlm.py
- Check Ollama logs for errors
- Python 3.10+
- PyMuPDF (for PDF processing)
- sentence-transformers (for embeddings)
- Pillow (for image processing)
- NumPy
- ultralytics
- torch
- torchvision
- Ollama with LLaVA model
- agentic-rag: Main RAG pipeline that uses this repository for image search
- Repository: https://github.com/Manufacturing-Demonstration-Facility/agentic-rag
- Uses the embeddings from this repository for semantic image retrieval
MIT License