Skip to content

Commit dde8dfc

Browse files
committed
feat: unified LLM_API_KEY and docs updates
1 parent d4c4b8d commit dde8dfc

8 files changed

Lines changed: 82 additions & 59 deletions

File tree

.env.example

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1-
# LLM API key (required)
2-
OPENAI_API_KEY=sk-your-key-here
1+
# LLM API key (required) — works with any LiteLLM-supported provider
2+
# OpenAI: LLM_API_KEY=sk-...
3+
# Anthropic: LLM_API_KEY=sk-ant-...
4+
# Gemini: LLM_API_KEY=AIza...
5+
LLM_API_KEY=your-key-here
36

47
# PageIndex Cloud API key (optional, leave empty for local PageIndex)
58
# Get your key at https://pageindex.dev

README.md

Lines changed: 52 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,32 @@
11
<div align="center">
22

33
<a href="https://openkb.ai">
4-
<img src="https://docs.pageindex.ai/images/general/openkb.png" alt="OpenKB (by PageIndex)" />
4+
<img src="https://docs.pageindex.ai/images/openkb.png" alt="OpenKB (by PageIndex)" />
55
</a>
66

7-
# OpenKB (Open Knowledge Base)
7+
# OpenKB: Open LLM Knowledge Base
88

9-
<h3 align="center">LLM-Powered Wiki Knowledge Base</h3>
10-
11-
<p align="center"><i>Scale to long documents&nbsp;&nbsp;Reasoning-based retrieval&nbsp;&nbsp;Native multimodality support&nbsp;&nbsp;No Vector DB</i></p>
9+
<p align="center"><i>Scale to long documents&nbsp;&nbsp;Reasoning-based retrieval&nbsp;&nbsp;Native multi-modality&nbsp;&nbsp;No Vector DB</i></p>
1210

1311
</div>
1412

1513
---
1614

1715
# 📑 Introduction to OpenKB
1816

19-
Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wikisummaries, concept pages, cross-references all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
17+
Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wiki; summaries, concept pages, cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
2018

21-
**OpenKB** (Open Knowledge Base) is an open-source CLI that implements this workflow, powered by [PageIndex](https://github.com/VectifyAI/PageIndex) for long document understanding and [markitdown](https://github.com/microsoft/markitdown) for broad format support.
19+
**OpenKB (Open Knowledge Base)** is an open-source CLI that implements this workflow, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.
2220

23-
### Why not just RAG?
21+
### Why not just traditional RAG?
2422

25-
RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
23+
Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
2624

2725
### Features
2826

2927
- **Any format** — PDF, Word, PowerPoint, Excel, HTML, Markdown, text, CSV, and more via markitdown
30-
- **Long documents** — Books and reports that exceed LLM context windows are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing
28+
- **Scale to long documents** — Long and complex documents are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing, enabling better long-context retrieval
29+
- **Native multi-modality** — Retrieves and understands figures, tables, and images, not just text
3130
- **Auto wiki** — LLM generates summaries, concept pages, and cross-links. You curate sources; the LLM does the rest
3231
- **Query** — Ask questions against your wiki. The LLM navigates your compiled knowledge to answer
3332
- **Lint** — Health checks find contradictions, gaps, orphans, and stale content
@@ -47,38 +46,40 @@ pip install openkb
4746
```bash
4847
# 1. Create a knowledge base
4948
mkdir my-kb && cd my-kb
49+
50+
# 2. Initialize
5051
okb init
5152

52-
# 2. Add documents
53+
# 3. Add documents
5354
okb add paper.pdf
5455
okb add ~/papers/ # Add a whole directory
5556
okb add article.html
5657

57-
# 3. Ask questions
58+
# 4. Ask questions
5859
okb query "What are the main findings?"
5960

60-
# 4. Check wiki health
61+
# 5. Check wiki health
6162
okb lint
6263
```
6364

6465
### Set up your LLM
6566

66-
Create a `.env` file with your API key:
67+
OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).
68+
69+
Create a `.env` file with your LLM API key. Choose your LLM during `okb init` or edit [`.okb/config.yaml`](#configuration).
6770

6871
```bash
69-
OPENAI_API_KEY=sk-...
72+
LLM_API_KEY=your_llm_api_key
7073
```
7174

72-
OpenKB uses [LiteLLM](https://docs.litellm.ai/docs/providers) — any provider works. Set the model during `okb init` or edit `.okb/config.yaml`.
73-
7475
# 🧩 How It Works
7576

7677
```
7778
raw/ You drop files here
7879
7980
├─ Short docs ──→ markitdown ──→ LLM reads full text
8081
│ │
81-
├─ Long PDFs ──→ PageIndex ────→ LLM reads tree summaries
82+
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
8283
│ │
8384
│ ▼
8485
│ Wiki Compilation
@@ -98,38 +99,14 @@ wiki/
9899

99100
### Two paths, one wiki
100101

101-
| | Short documents | Long documents (PDF ≥ 50 pages) |
102+
| | Short documents | Long documents (PDF ≥ 20 pages) |
102103
|---|---|---|
103104
| **Convert** | markitdown → Markdown | PageIndex → tree index + summaries |
104105
| **Images** | Extracted inline (pymupdf) | Extracted by PageIndex |
105106
| **LLM reads** | Full text | Tree summaries only |
106107
| **Result** | summary + concepts | summary + concepts |
107108

108-
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries — the LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
109-
110-
111-
# PageIndex integration
112-
For long documents, relying solely on summaries often leads to information loss.
113-
We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents—avoiding the information loss common in summary-based approaches.
114-
115-
By default, PageIndex runs locally using the open-source version, with no external dependencies required.
116-
117-
### Optional: Cloud Support
118-
119-
For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
120-
121-
- OCR support for scanned PDFs (via hosted VLM models)
122-
- Faster structure generation
123-
- Scalable indexing for large documents
124-
125-
126-
Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
127-
128-
```
129-
PAGEINDEX_API_KEY=your_api_key
130-
```
131-
132-
---
109+
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
133110

134111
### The wiki compiles knowledge
135112

@@ -140,7 +117,7 @@ When you add a document, the LLM:
140117
3. Creates or updates concepts with cross-document synthesis
141118
4. Updates the **index** and **log**
142119

143-
A single source might touch 10-15 wiki pages. Knowledge accumulates each document enriches the existing wiki rather than sitting in isolation.
120+
A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.
144121

145122
# 📦 Usage
146123

@@ -164,21 +141,42 @@ Generated by `okb init`, stored in `.okb/config.yaml`:
164141

165142
```yaml
166143
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
167-
api_key_env: OPENAI_API_KEY # Environment variable for API key
144+
api_key_env: LLM_API_KEY # Environment variable for LLM API key
168145
language: en # Wiki output language
169-
pageindex_threshold: 50 # PDF pages threshold for PageIndex
146+
pageindex_threshold: 20 # PDF pages threshold for PageIndex
170147
pageindex_api_key_env: "" # Env var name for PageIndex Cloud API key (default: auto-detect PAGEINDEX_API_KEY)
171148
```
172149
150+
### PageIndex integration
151+
152+
For long documents, relying solely on summaries often leads to information loss.
153+
We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents, avoiding the information loss common in summary-based approaches.
154+
155+
By default, PageIndex runs locally using the open-source version, with no external dependencies required.
156+
157+
#### Optional: Cloud Support
158+
159+
For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
160+
161+
- OCR support for scanned PDFs (via hosted VLM models)
162+
- Faster structure generation
163+
- Scalable indexing for large documents
164+
165+
Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
166+
167+
```
168+
PAGEINDEX_API_KEY=your_pageindex_api_key
169+
```
170+
173171
### AGENTS.md
174172
175173
The `wiki/AGENTS.md` file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.
176174
177-
At runtime, the LLM reads `AGENTS.md` from disk your edits take effect immediately.
175+
At runtime, the LLM reads `AGENTS.md` from disk, so your edits take effect immediately.
178176
179177
### Using with Obsidian
180178
181-
OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` Obsidian renders it natively.
179+
OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]`. Obsidian renders it natively.
182180
183181
1. Open `wiki/` as an Obsidian vault
184182
2. Browse summaries, concepts, and explorations
@@ -196,7 +194,6 @@ OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian
196194
| Supported formats | Web clipper → .md | PDF, Word, PPT, Excel, HTML, text, CSV, .md |
197195
| Wiki compilation | LLM agent | LLM agent (same) |
198196
| Q&A | Query over wiki | Wiki + PageIndex retrieval |
199-
| Open source | No | Yes |
200197
201198
### Tech Stack
202199
@@ -207,9 +204,13 @@ OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian
207204
- [Click](https://click.palletsprojects.com/) — CLI framework
208205
- [watchdog](https://github.com/gorakhargosh/watchdog) — Filesystem monitoring
209206
207+
### Contributing
208+
209+
Contributions are welcome! Please submit a pull request, or open an [issue](https://github.com/VectifyAI/OpenKB/issues) for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.
210+
210211
### License
211212
212-
Apache 2.0 — see [LICENSE](LICENSE)
213+
Apache 2.0. See [LICENSE](LICENSE).
213214
214215
### Acknowledgments
215216

config.yaml.example

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
2+
api_key_env: LLM_API_KEY # Environment variable for API key
3+
language: en # Wiki output language
4+
pageindex_threshold: 20 # PDF pages threshold for PageIndex
5+
pageindex_api_key_env: "" # Env var name for PageIndex Cloud API key

openkb/cli.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@
66
import time
77
from pathlib import Path
88

9+
import os
10+
911
import click
12+
import litellm
1013
from dotenv import load_dotenv
1114

1215
from openkb.config import DEFAULT_CONFIG, load_config, save_config
@@ -16,6 +19,14 @@
1619

1720
load_dotenv()
1821

22+
23+
def _setup_llm_key(config: dict) -> None:
24+
"""Set LiteLLM API key from the configured env var (default: LLM_API_KEY)."""
25+
env_var = config.get("api_key_env", DEFAULT_CONFIG["api_key_env"])
26+
api_key = os.environ.get(env_var, "")
27+
if api_key:
28+
litellm.api_key = api_key
29+
1930
# Supported document extensions for the `add` command
2031
SUPPORTED_EXTENSIONS = {
2132
".pdf", ".md", ".markdown", ".docx", ".pptx", ".xlsx",
@@ -65,6 +76,7 @@ def _add_single_file(file_path: Path, kb_dir: Path) -> None:
6576

6677
okb_dir = kb_dir / ".okb"
6778
config = load_config(okb_dir / "config.yaml")
79+
_setup_llm_key(config)
6880
model: str = config.get("model", DEFAULT_CONFIG["model"])
6981
registry = HashRegistry(okb_dir / "hashes.json")
7082

@@ -243,6 +255,7 @@ def query(question, save):
243255

244256
okb_dir = kb_dir / ".okb"
245257
config = load_config(okb_dir / "config.yaml")
258+
_setup_llm_key(config)
246259
model: str = config.get("model", DEFAULT_CONFIG["model"])
247260

248261
try:
@@ -307,6 +320,7 @@ def lint(fix):
307320

308321
okb_dir = kb_dir / ".okb"
309322
config = load_config(okb_dir / "config.yaml")
323+
_setup_llm_key(config)
310324
model: str = config.get("model", DEFAULT_CONFIG["model"])
311325

312326
# Structural lint

openkb/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@
77

88
DEFAULT_CONFIG: dict[str, Any] = {
99
"model": "gpt-5.4",
10-
"api_key_env": "OPENAI_API_KEY",
10+
"api_key_env": "LLM_API_KEY",
1111
"language": "en",
12-
"pageindex_threshold": 50,
12+
"pageindex_threshold": 20,
1313
"pageindex_api_key_env": "", # Set to env var name (e.g. PAGEINDEX_API_KEY) to use cloud PageIndex
1414
}
1515

openkb/converter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def convert_document(src: Path, kb_dir: Path) -> ConvertResult:
4949
# ------------------------------------------------------------------
5050
okb_dir = kb_dir / ".okb"
5151
config = load_config(okb_dir / "config.yaml")
52-
threshold: int = config.get("pageindex_threshold", 50)
52+
threshold: int = config.get("pageindex_threshold", 20)
5353
registry = HashRegistry(okb_dir / "hashes.json")
5454

5555
# ------------------------------------------------------------------

tests/test_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ def test_default_config_keys():
1313

1414
def test_default_config_values():
1515
assert DEFAULT_CONFIG["model"] == "gpt-5.4"
16-
assert DEFAULT_CONFIG["api_key_env"] == "OPENAI_API_KEY"
16+
assert DEFAULT_CONFIG["api_key_env"] == "LLM_API_KEY"
1717
assert DEFAULT_CONFIG["language"] == "en"
18-
assert DEFAULT_CONFIG["pageindex_threshold"] == 50
18+
assert DEFAULT_CONFIG["pageindex_threshold"] == 20
1919

2020

2121
def test_load_missing_file_returns_defaults(tmp_path):

tests/test_converter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ def test_short_pdf_converted_via_markitdown(self, kb_dir, tmp_path):
9494
patch("openkb.converter.MarkItDown") as mock_mid_cls,
9595
):
9696
fake_doc = MagicMock()
97-
fake_doc.page_count = 5 # below default threshold of 50
97+
fake_doc.page_count = 5 # below default threshold of 20
9898
fake_doc.__enter__ = MagicMock(return_value=fake_doc)
9999
fake_doc.__exit__ = MagicMock(return_value=False)
100100
mock_mu.return_value = fake_doc

0 commit comments

Comments
 (0)