Minimalistic Vector Database in Python

A lightweight, pure-Python vector database built from scratch.

This project explores the mechanics of vector similarity search by implementing a custom indexer based on the Vamana Graph algorithm (DiskANN). It is designed for educational purposes and lightweight use cases, including semantic search and Retrieval Augmented Generation (RAG).

Key Features ✨

⚠️ Note: This project is a work in progress. Some features are incomplete and are subject to change.

🧭 Vamana Graph Indexing: Implements the algorithm behind DiskANN. Vamana optimizes the graph with long-range shortcuts, allowing the search to navigate huge datasets efficiently by jumping quickly toward the target rather than stepping slowly between neighbors.
⚡ Pure Python, C-Level Speed: By leveraging Numba JIT compilation, this project achieves indexing and search performance comparable to C while maintaining a readable, hackable Python codebase.
💾 Persistence: Data isn't just dumped into binary files. Metadata and vectors are stored reliably in SQLite, orchestrated by SQLAlchemy and Alembic, ensuring portability and crash-safety.
🐍 Data Science Ready SDK: A lightweight client designed for the Python ecosystem. It supports NumPy arrays natively and handles automatic request batching behind the scenes to maximize throughput.
🔌 Modern REST API: Powered by FastAPI, providing asynchronous request handling, rigorous type safety, and automatic interactive documentation (Swagger UI).

📊 Benchmarks

Metric	Dataset	Result
Indexing Throughput	10k vectors (384d)	1176.29 vec/s
Query Latency (Avg)	10k vectors (384d)	6.84 ms
Query Latency (P95)	10k vectors (384d)	9.87 ms
Recall@10	SIFT-Small (10k vectors, 128d)	0.9960
Recall@10	SIFT1M (1M vectors, 128d)	0.9573

These benchmarks were achieved using the following Vamana graph indexing config:

Parameter	Value	Description
`VAMANA_R`	`40`	Maximum graph degree
`VAMANA_L_BUILD`	`100`	Search list size during index building
`VAMANA_L_SEARCH`	`60`	Search list size during querying
`VAMANA_ALPHA_FIRST_PASS`	`1.0`	Distance multiplier (First pass)
`VAMANA_ALPHA_SECOND_PASS`	`1.2`	Distance multiplier (Second pass)

Note: These parameters can be modified in the src/common/config.py file.

🚀 Getting Started

Option A: Using Docker (Recommended)

This is the fastest way to get the server running without installing dependencies.

docker compose up --build

Option B: Local Development

If you prefer running the server natively:

Install Dependencies:

pip install -r requirements.txt

Start the Server:

uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload

The API will be available at the default url http://localhost:8000.

2. Install the Client

Install the Python SDK from the sdk directory:

pip install -e sdk/

3. Client Usage

For detailed documentation, see the SDK README.

from vectordb.client import Client

with Client() as client:
    # Create
    collection = client.get_or_create_collection("demo", dimension=3, metric="cosine")
    
    # Insert
    collection.upsert(
        ids=["1", "2", "3"], 
        vectors=[
            [0.1, 0.2, 0.3], 
            [0.9, 0.8, 0.7],
            [0.2, 0.4, 0.4]
        ]
    )
    
    # Search
    results = collection.search(query=[0.1, 0.2, 0.3])
    print(results)

📂 Examples

Check out the examples folder in the root of the repository for detailed usage:

Tutorial Notebook: Interactive guide using Pandas and HuggingFace models.
Large Dataset Benchmark: A stress test loading 50,000+ DBpedia articles for RAG.

Acknowledgements 📖

This project was built with reference to the following research and implementations:

DiskANN: Subramanya, S. J., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. Advances in Neural Information Processing Systems (NeurIPS).
FreshDiskANN: Singh, A., et al. (2021). FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. arXiv preprint arXiv:2105.09613.
Vamana Visualization: sushrut141/vamana - A helpful repo demonstrating the Vamana algorithm.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
alembic		alembic
benchmarks		benchmarks
examples		examples
sdk		sdk
src		src
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimalistic Vector Database in Python

Key Features ✨

📊 Benchmarks

🚀 Getting Started

Option A: Using Docker (Recommended)

Option B: Local Development

2. Install the Client

3. Client Usage

📂 Examples

Acknowledgements 📖

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Minimalistic Vector Database in Python

Key Features ✨

📊 Benchmarks

🚀 Getting Started

Option A: Using Docker (Recommended)

Option B: Local Development

2. Install the Client

3. Client Usage

📂 Examples

Acknowledgements 📖

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages