feat: unified LLM_API_KEY and docs updates

rejojer · rejojer · commit dde8dfc71ccf · 2026-04-08T21:09:44.000+08:00
diff --git a/.env.example b/.env.example
@@ -1,5 +1,8 @@
-# LLM API key (required)
-OPENAI_API_KEY=sk-your-key-here
+# LLM API key (required) — works with any LiteLLM-supported provider
+# OpenAI:    LLM_API_KEY=sk-...
+# Anthropic: LLM_API_KEY=sk-ant-...
+# Gemini:    LLM_API_KEY=AIza...
+LLM_API_KEY=your-key-here
 
 # PageIndex Cloud API key (optional, leave empty for local PageIndex)
 # Get your key at https://pageindex.dev
diff --git a/README.md b/README.md
@@ -1,33 +1,32 @@
 <div align="center">
 
 <a href="https://openkb.ai">
-  <img src="https://docs.pageindex.ai/images/general/openkb.png" alt="OpenKB (by PageIndex)" />
+  <img src="https://docs.pageindex.ai/images/openkb.png" alt="OpenKB (by PageIndex)" />
 </a>
 
-# OpenKB (Open Knowledge Base)
+# OpenKB: Open LLM Knowledge Base
 
-<h3 align="center">LLM-Powered Wiki Knowledge Base</h3>
-
-<p align="center"><i>Scale to long documents&nbsp; ◦ &nbsp;Reasoning-based retrieval&nbsp; ◦ &nbsp;Native multimodality support&nbsp; ◦ &nbsp;No Vector DB</i></p>
+<p align="center"><i>Scale to long documents&nbsp; • &nbsp;Reasoning-based retrieval&nbsp; • &nbsp;Native multi-modality&nbsp; • &nbsp;No Vector DB</i></p>
 
 </div>
 
 ---
 
 # 📑 Introduction to OpenKB
 
-Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wiki — summaries, concept pages, cross-references — all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
+Andrej Karpathy [described](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) a workflow where LLMs compile raw documents into a structured, interlinked markdown wiki; summaries, concept pages, cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
 
-**OpenKB** (Open Knowledge Base) is an open-source CLI that implements this workflow, powered by [PageIndex](https://github.com/VectifyAI/PageIndex) for long document understanding and [markitdown](https://github.com/microsoft/markitdown) for broad format support.
+**OpenKB (Open Knowledge Base)** is an open-source CLI that implements this workflow, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.
 
-### Why not just RAG?
+### Why not just traditional RAG?
 
-RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
+Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
 
 ### Features
 
 - **Any format** — PDF, Word, PowerPoint, Excel, HTML, Markdown, text, CSV, and more via markitdown
-- **Long documents** — Books and reports that exceed LLM context windows are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing
+- **Scale to long documents** — Long and complex documents are handled via [PageIndex](https://github.com/VectifyAI/PageIndex) tree indexing, enabling better long-context retrieval
+- **Native multi-modality** — Retrieves and understands figures, tables, and images, not just text
 - **Auto wiki** — LLM generates summaries, concept pages, and cross-links. You curate sources; the LLM does the rest
 - **Query** — Ask questions against your wiki. The LLM navigates your compiled knowledge to answer
 - **Lint** — Health checks find contradictions, gaps, orphans, and stale content
@@ -47,38 +46,40 @@ pip install openkb
 ```bash
 # 1. Create a knowledge base
 mkdir my-kb && cd my-kb
+
+# 2. Initialize
 okb init
 
-# 2. Add documents
+# 3. Add documents
 okb add paper.pdf
 okb add ~/papers/                   # Add a whole directory
 okb add article.html
 
-# 3. Ask questions
+# 4. Ask questions
 okb query "What are the main findings?"
 
-# 4. Check wiki health
+# 5. Check wiki health
 okb lint
 ```
 
 ### Set up your LLM
 
-Create a `.env` file with your API key:
+OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)). 
+
+Create a `.env` file with your LLM API key. Choose your LLM during `okb init` or edit [`.okb/config.yaml`](#configuration).
 
 ```bash
-OPENAI_API_KEY=sk-...
+LLM_API_KEY=your_llm_api_key
 ```
 
-OpenKB uses [LiteLLM](https://docs.litellm.ai/docs/providers) — any provider works. Set the model during `okb init` or edit `.okb/config.yaml`.
-
 # 🧩 How It Works
 
 ```
 raw/                              You drop files here
  │
  ├─ Short docs ──→ markitdown ──→ LLM reads full text
  │                                     │
- ├─ Long PDFs ──→ PageIndex ────→ LLM reads tree summaries
+ ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
  │                                     │
  │                                     ▼
  │                              Wiki Compilation
@@ -98,38 +99,14 @@ wiki/
 
 ### Two paths, one wiki
 
-| | Short documents | Long documents (PDF ≥ 50 pages) |
+| | Short documents | Long documents (PDF ≥ 20 pages) |
 |---|---|---|
 | **Convert** | markitdown → Markdown | PageIndex → tree index + summaries |
 | **Images** | Extracted inline (pymupdf) | Extracted by PageIndex |
 | **LLM reads** | Full text | Tree summaries only |
 | **Result** | summary + concepts | summary + concepts |
 
-Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries — the LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
-
-
-# PageIndex integration
-For long documents, relying solely on summaries often leads to information loss.
-We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents—avoiding the information loss common in summary-based approaches.
-
-By default, PageIndex runs locally using the open-source version, with no external dependencies required.
-
-### Optional: Cloud Support
-
-For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
-
-- OCR support for scanned PDFs (via hosted VLM models)
-- Faster structure generation
-- Scalable indexing for large documents
-
-
-Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
-
-```
-PAGEINDEX_API_KEY=your_api_key
-```
-
----
+Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, avoiding context window limits while retaining structural understanding.
 
 ### The wiki compiles knowledge
 
@@ -140,7 +117,7 @@ When you add a document, the LLM:
 3. Creates or updates concepts with cross-document synthesis
 4. Updates the **index** and **log**
 
-A single source might touch 10-15 wiki pages. Knowledge accumulates — each document enriches the existing wiki rather than sitting in isolation.
+A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.
 
 # 📦 Usage
 
@@ -164,21 +141,42 @@ Generated by `okb init`, stored in `.okb/config.yaml`:
 
 ```yaml
 model: gpt-5.4                   # LLM model (any LiteLLM-supported provider)
-api_key_env: OPENAI_API_KEY      # Environment variable for API key
+api_key_env: LLM_API_KEY         # Environment variable for LLM API key
 language: en                      # Wiki output language
-pageindex_threshold: 50           # PDF pages threshold for PageIndex
+pageindex_threshold: 20           # PDF pages threshold for PageIndex
 pageindex_api_key_env: ""                # Env var name for PageIndex Cloud API key (default: auto-detect PAGEINDEX_API_KEY)
 ```
 
+### PageIndex integration
+
+For long documents, relying solely on summaries often leads to information loss.
+We integrate [PageIndex](https://github.com/VectifyAI/PageIndex) into the knowledge base to provide structured, context-aware retrieval for long documents, avoiding the information loss common in summary-based approaches.
+
+By default, PageIndex runs locally using the open-source version, with no external dependencies required.
+
+#### Optional: Cloud Support
+
+For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
+
+- OCR support for scanned PDFs (via hosted VLM models)
+- Faster structure generation
+- Scalable indexing for large documents
+
+Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
+
+```
+PAGEINDEX_API_KEY=your_pageindex_api_key
+```
+
 ### AGENTS.md
 
 The `wiki/AGENTS.md` file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.
 
-At runtime, the LLM reads `AGENTS.md` from disk — your edits take effect immediately.
+At runtime, the LLM reads `AGENTS.md` from disk, so your edits take effect immediately.
 
 ### Using with Obsidian
 
-OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian renders it natively.
+OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]`. Obsidian renders it natively.
 
 1. Open `wiki/` as an Obsidian vault
 2. Browse summaries, concepts, and explorations
@@ -196,7 +194,6 @@ OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian
 | Supported formats | Web clipper → .md | PDF, Word, PPT, Excel, HTML, text, CSV, .md |
 | Wiki compilation | LLM agent | LLM agent (same) |
 | Q&A | Query over wiki | Wiki + PageIndex retrieval |
-| Open source | No | Yes |
 
 ### Tech Stack
 
@@ -207,9 +204,13 @@ OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]` — Obsidian
 - [Click](https://click.palletsprojects.com/) — CLI framework
 - [watchdog](https://github.com/gorakhargosh/watchdog) — Filesystem monitoring
 
+### Contributing
+
+Contributions are welcome! Please submit a pull request, or open an [issue](https://github.com/VectifyAI/OpenKB/issues) for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.
+
 ### License
 
-Apache 2.0 — see [LICENSE](LICENSE)
+Apache 2.0. See [LICENSE](LICENSE).
 
 ### Acknowledgments
 
diff --git a/config.yaml.example b/config.yaml.example
@@ -0,0 +1,5 @@
+model: gpt-5.4                   # LLM model (any LiteLLM-supported provider)
+api_key_env: LLM_API_KEY         # Environment variable for API key
+language: en                      # Wiki output language
+pageindex_threshold: 20           # PDF pages threshold for PageIndex
+pageindex_api_key_env: ""         # Env var name for PageIndex Cloud API key
diff --git a/openkb/cli.py b/openkb/cli.py
@@ -6,7 +6,10 @@
 import time
 from pathlib import Path
 
+import os
+
 import click
+import litellm
 from dotenv import load_dotenv
 
 from openkb.config import DEFAULT_CONFIG, load_config, save_config
@@ -16,6 +19,14 @@
 
 load_dotenv()
 
+
+def _setup_llm_key(config: dict) -> None:
+    """Set LiteLLM API key from the configured env var (default: LLM_API_KEY)."""
+    env_var = config.get("api_key_env", DEFAULT_CONFIG["api_key_env"])
+    api_key = os.environ.get(env_var, "")
+    if api_key:
+        litellm.api_key = api_key
+
 # Supported document extensions for the `add` command
 SUPPORTED_EXTENSIONS = {
     ".pdf", ".md", ".markdown", ".docx", ".pptx", ".xlsx",
@@ -65,6 +76,7 @@ def _add_single_file(file_path: Path, kb_dir: Path) -> None:
 
     okb_dir = kb_dir / ".okb"
     config = load_config(okb_dir / "config.yaml")
+    _setup_llm_key(config)
     model: str = config.get("model", DEFAULT_CONFIG["model"])
     registry = HashRegistry(okb_dir / "hashes.json")
 
@@ -243,6 +255,7 @@ def query(question, save):
 
     okb_dir = kb_dir / ".okb"
     config = load_config(okb_dir / "config.yaml")
+    _setup_llm_key(config)
     model: str = config.get("model", DEFAULT_CONFIG["model"])
 
     try:
@@ -307,6 +320,7 @@ def lint(fix):
 
     okb_dir = kb_dir / ".okb"
     config = load_config(okb_dir / "config.yaml")
+    _setup_llm_key(config)
     model: str = config.get("model", DEFAULT_CONFIG["model"])
 
     # Structural lint
diff --git a/openkb/config.py b/openkb/config.py
@@ -7,9 +7,9 @@
 
 DEFAULT_CONFIG: dict[str, Any] = {
     "model": "gpt-5.4",
-    "api_key_env": "OPENAI_API_KEY",
+    "api_key_env": "LLM_API_KEY",
     "language": "en",
-    "pageindex_threshold": 50,
+    "pageindex_threshold": 20,
     "pageindex_api_key_env": "",  # Set to env var name (e.g. PAGEINDEX_API_KEY) to use cloud PageIndex
 }
 
diff --git a/openkb/converter.py b/openkb/converter.py
@@ -49,7 +49,7 @@ def convert_document(src: Path, kb_dir: Path) -> ConvertResult:
     # ------------------------------------------------------------------
     okb_dir = kb_dir / ".okb"
     config = load_config(okb_dir / "config.yaml")
-    threshold: int = config.get("pageindex_threshold", 50)
+    threshold: int = config.get("pageindex_threshold", 20)
     registry = HashRegistry(okb_dir / "hashes.json")
 
     # ------------------------------------------------------------------
diff --git a/tests/test_config.py b/tests/test_config.py
@@ -13,9 +13,9 @@ def test_default_config_keys():
 
 def test_default_config_values():
     assert DEFAULT_CONFIG["model"] == "gpt-5.4"
-    assert DEFAULT_CONFIG["api_key_env"] == "OPENAI_API_KEY"
+    assert DEFAULT_CONFIG["api_key_env"] == "LLM_API_KEY"
     assert DEFAULT_CONFIG["language"] == "en"
-    assert DEFAULT_CONFIG["pageindex_threshold"] == 50
+    assert DEFAULT_CONFIG["pageindex_threshold"] == 20
 
 
 def test_load_missing_file_returns_defaults(tmp_path):
diff --git a/tests/test_converter.py b/tests/test_converter.py
@@ -94,7 +94,7 @@ def test_short_pdf_converted_via_markitdown(self, kb_dir, tmp_path):
             patch("openkb.converter.MarkItDown") as mock_mid_cls,
         ):
             fake_doc = MagicMock()
-            fake_doc.page_count = 5  # below default threshold of 50
+            fake_doc.page_count = 5  # below default threshold of 20
             fake_doc.__enter__ = MagicMock(return_value=fake_doc)
             fake_doc.__exit__ = MagicMock(return_value=False)
             mock_mu.return_value = fake_doc