This repo contains a from-scratch RAG pipeline built in Google Colab, using:
- A 1200+ page human nutrition PDF as the knowledge source
- Sentence-based chunking (10 sentences per chunk)
all-mpnet-base-v2embeddings from Sentence-Transformers- A local instruction-tuned LLM (
google/gemma-2b-it) loaded with 4-bit quantization - A retrieval → augmentation → generation loop closely following the “Production Level RAG Workshop (Part 2)” by Vizuara
The goal is to show the full RAG workflow without LangChain / LlamaIndex.
-
Ingestion
- Load the Human Nutrition PDF with
PyMuPDF - Clean page text (remove newlines, extra spaces)
- Optionally skip the first 41 pages (front matter)
- Load the Human Nutrition PDF with
-
Chunking
- Use
spacy’sEnglish()+sentencizerto split each page into sentences - Group sentences into chunks of 10 sentences
- Store for each chunk: page number, chunk index, and sentence text (
sentence_chunk)
- Use
-
Embedding
- Load
all-mpnet-base-v2fromsentence-transformers - Encode all chunks into 768-dim embeddings
- Keep embeddings as a torch tensor on GPU for fast retrieval
- Load
-
Retrieval
- For a user query:
- Embed the query with the same MPNet model
- Compute dot-product (cosine) similarity between query and all chunk embeddings
- Take top-k chunks as context
- For a user query:
-
Augmentation (Prompt Formatting)
- Combine the top-k chunks into a bullet list
- Insert them into a long instruction prompt with a few example Q&A pairs
- Wrap the final text using
tokenizer.apply_chat_template(...)for Gemma
-
Generation (Local LLM)
- Load
google/gemma-2b-itwith:- 4-bit quantization via
bitsandbytes flash_attention_2when available, otherwisesdpa
- 4-bit quantization via
- Tokenize the prompt, send to GPU, and call
generate(...) - Decode the output and strip the original prompt to get the final RAG answer
- Load
git clone https://github.com/sgundala/RAG-from-Scratch.git
cd rag-nutrition-rag