You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<palign="center"><i>Scale to long documents • Reasoning-based retrieval • Native multi-modality • No Vector DB</i></p>
10
10
11
11
</div>
12
12
13
13
---
14
14
15
-
# 📑 Introduction to OpenKB
15
+
# 📑 What is OpenKB
16
16
17
-
Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wiki; summaries, concept pages, cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
17
+
**OpenKB (Open Knowledge Base)** is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.
18
18
19
-
**OpenKB (Open Knowledge Base)**is an open-source CLI that implements this workflow, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.
19
+
The idea is based on a [concept](https://x.com/karpathy/status/2039805659525644595) described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
20
20
21
21
### Why not just traditional RAG?
22
22
@@ -25,7 +25,7 @@ Traditional RAG rediscovers knowledge from scratch on every query. Nothing accum
25
25
### Features
26
26
27
27
-**Any format** — PDF, Word, PowerPoint, Excel, HTML, Markdown, text, CSV, and more via markitdown
28
-
-**Scale to long documents** — Long and complex documents are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing, enabling better long-context retrieval
28
+
-**Scale to long documents** — Long and complex documents are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing, enabling accurate, vectorless long-context retrieval
29
29
-**Native multi-modality** — Retrieves and understands figures, tables, and images, not just text
30
30
-**Auto wiki** — LLM generates summaries, concept pages, and cross-links. You curate sources; the LLM does the rest
31
31
-**Query** — Ask questions against your wiki. The LLM navigates your compiled knowledge to answer
@@ -44,36 +44,40 @@ pip install openkb
44
44
### Quick start
45
45
46
46
```bash
47
-
# 1. Create a knowledge base
47
+
# 1. Create a directory for your knowledge base
48
48
mkdir my-kb &&cd my-kb
49
49
50
-
# 2. Initialize
51
-
okb init
50
+
# 2. Initialize the knowledge base
51
+
openkb init
52
52
53
53
# 3. Add documents
54
-
okb add paper.pdf
55
-
okb add ~/papers/ # Add a whole directory
56
-
okb add article.html
54
+
openkb add paper.pdf
55
+
openkb add ~/papers/ # Add a whole directory
56
+
openkb add article.html
57
57
58
58
# 4. Ask questions
59
-
okb query "What are the main findings?"
59
+
openkb query "What are the main findings?"
60
60
61
61
# 5. Check wiki health
62
-
okb lint
62
+
openkb lint
63
63
```
64
64
65
65
### Set up your LLM
66
66
67
-
OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).
67
+
OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).
68
68
69
-
Create a `.env` file with your LLM API key. Choose your LLM during `okb init` or edit [`.okb/config.yaml`](#configuration).
69
+
Set your model during `openkb init`, or in [`.openkb/config.yaml`](#configuration), using `provider/model` LiteLLM format (like `anthropic/claude-sonnet-4-6`). OpenAI models can omit the prefix (like `gpt-5.4`).
70
+
71
+
Create a `.env` file with your LLM API key:
70
72
71
73
```bash
72
74
LLM_API_KEY=your_llm_api_key
73
75
```
74
76
75
77
# 🧩 How It Works
76
78
79
+
### Architecture
80
+
77
81
```
78
82
raw/ You drop files here
79
83
│
@@ -82,8 +86,7 @@ raw/ You drop files here
82
86
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
83
87
│ │
84
88
│ ▼
85
-
│ Wiki Compilation
86
-
│ (single LLM session)
89
+
│ Wiki Compilation (using LLM)
87
90
│ │
88
91
▼ ▼
89
92
wiki/
@@ -97,16 +100,16 @@ wiki/
97
100
└── reports/ Lint reports
98
101
```
99
102
100
-
### Two paths, one wiki
103
+
### Short vs. long document handling
101
104
102
105
|| Short documents | Long documents (PDF ≥ 20 pages) |
103
106
|---|---|---|
104
107
|**Convert**| markitdown → Markdown | PageIndex → tree index + summaries |
105
108
|**Images**| Extracted inline (pymupdf) | Extracted by PageIndex |
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
112
+
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents.
110
113
111
114
### The wiki compiles knowledge
112
115
@@ -125,34 +128,41 @@ A single source might touch 10-15 wiki pages. Knowledge accumulates: each docume
125
128
126
129
| Command | Description |
127
130
|---|---|
128
-
|`okb init`| Initialize a new knowledge base (interactive) |
129
-
|`okb add <file_or_dir>`| Add documents and compile to wiki |
130
-
|`okb query "question"`| Ask a question against the knowledge base |
131
-
|`okb query "question" --save`| Ask and save the answer to `wiki/explorations/`|
132
-
|`okb watch`| Watch `raw/` and auto-compile new files |
133
-
|`okb lint`| Run structural + knowledge health checks |
134
-
<!-- | `okb lint --fix` | Auto-fix what it can | -->
135
-
|`okb list`| List indexed documents and concepts |
136
-
|`okb status`| Show knowledge base stats |
131
+
|`openkb init`| Initialize a new knowledge base (interactive) |
132
+
|`openkb add <file_or_dir>`| Add documents and compile to wiki |
133
+
|`openkb query "question"`| Ask a question against the knowledge base |
134
+
|`openkb query "question" --save`| Ask and save the answer to `wiki/explorations/`|
135
+
|`openkb watch`| Watch `raw/` and auto-compile new files |
136
+
|`openkb lint`| Run structural + knowledge health checks |
137
+
|`openkb list`| List indexed documents and concepts |
138
+
|`openkb status`| Show knowledge base stats |
139
+
140
+
<!-- | `openkb lint --fix` | Auto-fix what it can | -->
137
141
138
142
### Configuration
139
143
140
-
Generated by `okb init`, stored in `.okb/config.yaml`:
144
+
Settings are initialized by `openkb init`, and stored in `.openkb/config.yaml`:
141
145
142
146
```yaml
143
147
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
144
-
api_key_env: LLM_API_KEY # Environment variable for LLM API key
145
148
language: en # Wiki output language
146
149
pageindex_threshold: 20# PDF pages threshold for PageIndex
147
-
pageindex_api_key_env: ""# (Optional) Environment variable for PageIndex Cloud API key
148
150
```
149
151
152
+
Model names use `provider/model` LiteLLM [format](https://docs.litellm.ai/docs/providers) (OpenAI models can omit the prefix):
153
+
154
+
| Provider | Model example |
155
+
|---|---|
156
+
| OpenAI | `gpt-5.4` |
157
+
| Anthropic | `anthropic/claude-sonnet-4-6` |
158
+
| Gemini | `gemini/gemini-3.1-pro-preview` |
159
+
150
160
### PageIndex integration
151
161
152
-
For long documents, relying solely on summaries often leads to information loss.
153
-
We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents, avoiding the information loss common in summary-based approaches.
162
+
Long documents are challenging for LLMs due to context limits, context rot, and summarization loss.
163
+
[PageIndex](https://github.com/VectifyAI/PageIndex) solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.
154
164
155
-
By default, PageIndex runs locally using the open-source version, with no external dependencies required.
165
+
PageIndex runs locally by default using the [open-source version](https://github.com/VectifyAI/PageIndex), with no external dependencies required.
156
166
157
167
#### Optional: Cloud Support
158
168
@@ -183,27 +193,35 @@ OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]`. Obsidian re
183
193
3. Use graph view to see knowledge connections
184
194
4. Use Obsidian Web Clipper to add web articles to `raw/`
Contributions are welcome! Please submit a pull request, or open an [issue](https://github.com/VectifyAI/OpenKB/issues) for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.
@@ -214,7 +232,7 @@ Apache 2.0. See [LICENSE](LICENSE).
214
232
215
233
### Support Us
216
234
217
-
Leave us a star 🌟 if you like our project. Thank you!
235
+
If you find OpenKB useful, give us a star 🌟 — and check out [PageIndex](https://github.com/VectifyAI/PageIndex) too!
0 commit comments