This repository contains two interconnected challenges for building a complete document processing and RAG (Retrieval-Augmented Generation) system for academic research papers.
This project is split into two main challenges:
- Document Parsing: Convert academic PDFs into clean, structured markdown, preserving academic sections, citations, and logical hierarchy. This prepares your data for downstream AI tasks.
- RAG System: Build a Retrieval-Augmented Generation (RAG) pipeline that uses your parsed documents to answer academic questions with accurate, cited responses.
twiga-challenge-1/
├── README.md # This file - Main overview
├── parsing-challenge/ # Challenge 1: Document Parsing
│ ├── README.md # Parsing challenge documentation
│ └── strategy1_llamaparse_direct.ipynb
├── rag-challenge/ # Challenge 2: RAG Implementation
│ ├── README.md # RAG challenge documentation
│ └── strategy1_chromadb_basic.ipynb RAG
├── data/
│ ├── papers/ # Original PDF files
│ ├── input_papers/ # Parsed markdown files
│ └── vector_store/ # Vector database storage
└── LICENSE
- Notebook:
parsing-challenge/strategy1_llamaparse_direct.ipynb - Goal: Parse academic PDFs into structured markdown for RAG.
- Notebook:
rag-challenge/strategy1_chromadb_basic.ipynb - Goal: Build a RAG pipeline for question answering over your parsed documents.
# Core dependencies
pip3 install llama_parse pypdf together pydantic
# RAG dependencies
pip3 install chromadb sentence-transformers langchain openai
pip3 install faiss-cpu numpy pandas matplotlibCreate a .env file in the project root based on .env.example:
- Navigate to
parsing-challenge/ - Open
strategy1_llamaparse_direct.ipynb - Set up API keys and dependencies
- Parse your research paper(s)
- Validate output quality
- Navigate to
rag-challenge/ - Open
strategy1_chromadb_basic.ipynb - Set up API keys and dependencies
- Use parsed markdown from Phase 1
- Test with academic questions
- Mobile-Based_Deep_Learning_Models_for_Banana_Disease.pdf - Deep learning for banana disease detection
- Examining_the_Awareness_of_Mobile_Money_Users_on_S.pdf - Mobile money user awareness study
- Practical Machine Learning_25_05_04_14_32_34.pdf - Practical machine learning applications
Both papers contain rich academic content including:
- Complex figures and tables
- Mathematical equations
- Extensive citations and references
- Multi-level section hierarchies
- Content Accuracy (25%) - Text extraction quality
- Structure Preservation (25%) - Academic section identification
- Figure/Table Detection (20%) - Visual element handling
- Citation Handling (15%) - Reference preservation
- Chunk Quality (15%) - Logical segmentation
- Retrieval Accuracy (30%) - Relevant chunk identification
- Answer Quality (25%) - Generated response accuracy
- Context Preservation (20%) - Academic context maintenance
- System Performance (15%) - Query response time
- User Experience (10%) - Interface and usability
✅ Successfully parse all research papers
✅ Generate clean, structured markdown output
✅ Preserve academic formatting and citations
✅ Create logical, RAG-ready text chunks
✅ Build functional vector database from parsed content
✅ Implement semantic search capabilities
✅ Generate accurate, contextual answers
✅ Handle complex academic queries effectively
- Focus on academic structure preservation
- Test different prompt engineering approaches
- Validate output against source PDFs
- Optimize for downstream RAG usage
- Experiment with chunk size and overlap
- Try different embedding models
- Implement query expansion techniques
- Focus on citation preservation in responses
- Start with Phase 1: Open the parsing notebook and process your PDFs
- Move to Phase 2: Open the RAG notebook and build your Q&A system
- Test end-to-end with academic questions
- Check individual challenge READMEs for detailed instructions
- Review baseline implementations before optimizing
- Test incrementally and document your findings
- Focus on one challenge at a time for best results
Ready to build the future of academic document processing? Let's go! 🚀