You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h3align="center">LLM-Powered Wiki Knowledge Base</h3>
10
-
11
-
<palign="center"><i>Scale to long documents ◦ Reasoning-based retrieval ◦ Native multimodality support ◦ No Vector DB</i></p>
9
+
<palign="center"><i>Scale to long documents • Reasoning-based retrieval • Native multi-modality • No Vector DB</i></p>
12
10
13
11
</div>
14
12
15
13
---
16
14
17
15
# 📑 Introduction to OpenKB
18
16
19
-
Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wiki — summaries, concept pages, cross-references — all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
17
+
Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wiki; summaries, concept pages, cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
20
18
21
-
**OpenKB** (Open Knowledge Base) is an open-source CLI that implements this workflow, powered by [PageIndex](https://github.com/VectifyAI/PageIndex) for long document understanding and [markitdown](https://github.com/microsoft/markitdown) for broad format support.
19
+
**OpenKB (Open Knowledge Base)** is an open-source CLI that implements this workflow, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.
22
20
23
-
### Why not just RAG?
21
+
### Why not just traditional RAG?
24
22
25
-
RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
23
+
Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
26
24
27
25
### Features
28
26
29
27
-**Any format** — PDF, Word, PowerPoint, Excel, HTML, Markdown, text, CSV, and more via markitdown
30
-
-**Long documents** — Books and reports that exceed LLM context windows are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing
28
+
-**Scale to long documents** — Long and complex documents are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing, enabling better long-context retrieval
29
+
-**Native multi-modality** — Retrieves and understands figures, tables, and images, not just text
31
30
-**Auto wiki** — LLM generates summaries, concept pages, and cross-links. You curate sources; the LLM does the rest
32
31
-**Query** — Ask questions against your wiki. The LLM navigates your compiled knowledge to answer
33
32
-**Lint** — Health checks find contradictions, gaps, orphans, and stale content
@@ -47,38 +46,40 @@ pip install openkb
47
46
```bash
48
47
# 1. Create a knowledge base
49
48
mkdir my-kb &&cd my-kb
49
+
50
+
# 2. Initialize
50
51
okb init
51
52
52
-
#2. Add documents
53
+
#3. Add documents
53
54
okb add paper.pdf
54
55
okb add ~/papers/ # Add a whole directory
55
56
okb add article.html
56
57
57
-
#3. Ask questions
58
+
#4. Ask questions
58
59
okb query "What are the main findings?"
59
60
60
-
#4. Check wiki health
61
+
#5. Check wiki health
61
62
okb lint
62
63
```
63
64
64
65
### Set up your LLM
65
66
66
-
Create a `.env` file with your API key:
67
+
OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).
68
+
69
+
Create a `.env` file with your LLM API key. Choose your LLM during `okb init` or edit [`.okb/config.yaml`](#configuration).
67
70
68
71
```bash
69
-
OPENAI_API_KEY=sk-...
72
+
LLM_API_KEY=your_llm_api_key
70
73
```
71
74
72
-
OpenKB uses [LiteLLM](https://docs.litellm.ai/docs/providers) — any provider works. Set the model during `okb init` or edit `.okb/config.yaml`.
73
-
74
75
# 🧩 How It Works
75
76
76
77
```
77
78
raw/ You drop files here
78
79
│
79
80
├─ Short docs ──→ markitdown ──→ LLM reads full text
80
81
│ │
81
-
├─ Long PDFs ──→ PageIndex ────→ LLM reads tree summaries
82
+
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
82
83
│ │
83
84
│ ▼
84
85
│ Wiki Compilation
@@ -98,38 +99,14 @@ wiki/
98
99
99
100
### Two paths, one wiki
100
101
101
-
|| Short documents | Long documents (PDF ≥ 50 pages) |
102
+
|| Short documents | Long documents (PDF ≥ 20 pages) |
102
103
|---|---|---|
103
104
|**Convert**| markitdown → Markdown | PageIndex → tree index + summaries |
104
105
|**Images**| Extracted inline (pymupdf) | Extracted by PageIndex |
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries — the LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
109
-
110
-
111
-
# PageIndex integration
112
-
For long documents, relying solely on summaries often leads to information loss.
113
-
We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents—avoiding the information loss common in summary-based approaches.
114
-
115
-
By default, PageIndex runs locally using the open-source version, with no external dependencies required.
116
-
117
-
### Optional: Cloud Support
118
-
119
-
For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
120
-
121
-
- OCR support for scanned PDFs (via hosted VLM models)
122
-
- Faster structure generation
123
-
- Scalable indexing for large documents
124
-
125
-
126
-
Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
127
-
128
-
```
129
-
PAGEINDEX_API_KEY=your_api_key
130
-
```
131
-
132
-
---
109
+
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
133
110
134
111
### The wiki compiles knowledge
135
112
@@ -140,7 +117,7 @@ When you add a document, the LLM:
140
117
3. Creates or updates concepts with cross-document synthesis
141
118
4. Updates the **index** and **log**
142
119
143
-
A single source might touch 10-15 wiki pages. Knowledge accumulates — each document enriches the existing wiki rather than sitting in isolation.
120
+
A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.
144
121
145
122
# 📦 Usage
146
123
@@ -164,21 +141,42 @@ Generated by `okb init`, stored in `.okb/config.yaml`:
164
141
165
142
```yaml
166
143
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
167
-
api_key_env: OPENAI_API_KEY# Environment variable for API key
144
+
api_key_env: LLM_API_KEY# Environment variable for LLM API key
168
145
language: en # Wiki output language
169
-
pageindex_threshold: 50# PDF pages threshold for PageIndex
146
+
pageindex_threshold: 20# PDF pages threshold for PageIndex
170
147
pageindex_api_key_env: ""# Env var name for PageIndex Cloud API key (default: auto-detect PAGEINDEX_API_KEY)
171
148
```
172
149
150
+
### PageIndex integration
151
+
152
+
For long documents, relying solely on summaries often leads to information loss.
153
+
We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents, avoiding the information loss common in summary-based approaches.
154
+
155
+
By default, PageIndex runs locally using the open-source version, with no external dependencies required.
156
+
157
+
#### Optional: Cloud Support
158
+
159
+
For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
160
+
161
+
- OCR support for scanned PDFs (via hosted VLM models)
162
+
- Faster structure generation
163
+
- Scalable indexing for large documents
164
+
165
+
Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
166
+
167
+
```
168
+
PAGEINDEX_API_KEY=your_pageindex_api_key
169
+
```
170
+
173
171
### AGENTS.md
174
172
175
173
The `wiki/AGENTS.md` file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.
176
174
177
-
At runtime, the LLM reads `AGENTS.md` from disk — your edits take effect immediately.
175
+
At runtime, the LLM reads `AGENTS.md` from disk, so your edits take effect immediately.
178
176
179
177
### Using with Obsidian
180
178
181
-
OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian renders it natively.
179
+
OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]`. Obsidian renders it natively.
182
180
183
181
1. Open `wiki/` as an Obsidian vault
184
182
2. Browse summaries, concepts, and explorations
@@ -196,7 +194,6 @@ OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian
196
194
| Supported formats | Web clipper → .md | PDF, Word, PPT, Excel, HTML, text, CSV, .md |
Contributions are welcome! Please submit a pull request, or open an [issue](https://github.com/VectifyAI/OpenKB/issues) for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.
0 commit comments