ZSE is free and open source. If it helps you, consider sponsoring us.
Sponsors get exclusive business perks. Mail us at zse@zyoralabs.com for details.
Ultra memory-efficient LLM inference engine with native INT4 CUDA kernels.
Run 32B models on 24GB GPUs. Run 7B models on 8GB GPUs. Fast cold starts, single-file deployment.
Train 7B models on 8GB GPUs. Train 70B models on 48GB GPUs.
from zse.format import load_zse_model
from zse.training import LoRAConfig, add_lora_to_model, save_lora_adapter
# Load INT4 model (uses ~6GB for 7B)
model, tokenizer, info = load_zse_model("model.zse", device="cuda")
# Add LoRA adapters (~1% trainable params)
config = LoRAConfig(rank=16, alpha=32)
model = add_lora_to_model(model, config)
# Train as usual with PyTorch/HuggingFace
# ... your training loop ...
# Save adapter (tiny file, ~25MB for 7B)
save_lora_adapter(model, "my_adapter.safetensors")Install training dependencies: pip install zllm-zse[training]
| Model | File Size | VRAM | Speed | Cold Start | GPU |
|---|---|---|---|---|---|
| Qwen 7B | 5.57 GB | 5.67 GB | 37.2 tok/s | 5.7s | H200 |
| Qwen 14B | 9.95 GB | 10.08 GB | 20.8 tok/s | 10.5s | H200 |
| Qwen 32B | 19.23 GB | 19.47 GB | 10.9 tok/s | 20.4s | H200 |
| Qwen 72B | 41.21 GB | 41.54 GB | 6.3 tok/s | 51.8s | H200 |
| Model | VRAM | Speed | Cold Start |
|---|---|---|---|
| Qwen 7B | 6.57 GB | 45.6 tok/s | 6.0s |
| Qwen 14B | 11.39 GB | 27.6 tok/s | 7.1s |
| Qwen 32B | 22.27 GB | 20.4 tok/s | 20.8s |
| Qwen 72B | 47.05 GB | 16.4 tok/s | 53.0s |
| Model | Custom Kernel | bnb Backend | Savings |
|---|---|---|---|
| 7B | 5.67 GB | 6.57 GB | -0.90 GB (14%) |
| 14B | 10.08 GB | 11.39 GB | -1.31 GB (12%) |
| 32B | 19.47 GB | 22.27 GB | -2.80 GB (13%) |
| 72B | 41.54 GB | 47.05 GB | -5.51 GB (12%) |
| GPU | VRAM | Max Model |
|---|---|---|
| RTX 3070/4070 | 8GB | 7B |
| RTX 3080 | 12GB | 14B |
| RTX 3090/4090 | 24GB | 32B |
| A100-40GB | 40GB | 32B |
| A100-80GB / H200 | 80-141GB | 72B |
- π¦ Single .zse File: Model + tokenizer + config in one file
- π« No Network Calls: Everything embedded, works offline
- β‘ ZSE Custom Kernel: Native INT4 inference with maximum VRAM efficiency
- π§ Memory Efficient: 72B in 41GB, 32B in 19GB, 7B in 5.7GB VRAM
- π Fast Cold Start: 5.7s for 7B, 20s for 32B, 52s for 72B
- π― Dual Backend: Custom kernel (default) or bnb backend (alternate)
- π₯ QLoRA Training: Fine-tune INT4 models with LoRA adapters (NEW in v1.4.0)
- π Built-in RAG with .zpf: 25% fewer LLM tokens at 100% accuracy vs plain chunking
pip install zllm-zseRequirements:
- Python 3.11+
- CUDA GPU (8GB+ VRAM recommended)
- bitsandbytes (auto-installed)
# Download ready-to-use .zse model (no GPU needed for conversion)
zse pull qwen-7b # 5.18 GB download
zse pull mistral-7b # 3.86 GB download
zse pull qwen-0.5b # 0.69 GB download
# Browse available models
zse list
zse list --cached# Convert any HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen7b.zse
zse convert Qwen/Qwen2.5-32B-Instruct -o qwen32b.zse
# Or in Python
from zse.format.writer import convert_model
convert_model("Qwen/Qwen2.5-7B-Instruct", "qwen7b.zse", quantization="int4")from zse.format.reader_v2 import load_zse_model
# Load model (auto-detects optimal settings)
model, tokenizer, info = load_zse_model("qwen7b.zse")
# Generate
inputs = tokenizer("Write a poem about AI:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))zse serve qwen7b.zse --port 8000import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
model="qwen7b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)| Feature | HuggingFace | .zse Format |
|---|---|---|
| Cold start (7B) | 45s | 9s |
| Cold start (32B) | 120s | 24s |
| Network calls on load | Yes | No |
| Files to manage | Many | One |
| Quantization time | Runtime | Pre-done |
.zpf delivers 100% retrieval accuracy at 25% lower LLM token cost compared to plain chunking β tested on real noisy web content.
.zpf (Z Packed Format) is ZSE's built-in semantic document format for RAG. It compresses documents at write-time, stripping noise (cookie banners, nav, boilerplate, filler prose) while preserving what the LLM needs.
| Metric | .zpf | Plain Chunking |
|---|---|---|
| Correct answers | 10/10 | 10/10 |
| Tokens sent to LLM (avg/query) | 943 | 1,257 |
| Token reduction | 25% | β |
| Cost per 1M queries (GPT-4o) | $2,357 | $3,143 |
| Annual savings at 1M queries | $786 | β |
| Break-even | 1 query per document | β |
# Ingest a document into RAG store
zse rag add paper.pdf --title "ML Paper"
# Semantic search
zse rag search "What is batch normalization?" -k 5
# Get token-budgeted LLM context
zse rag context "What is batch normalization?" --max-tokens 500
# List all documents
zse rag list
# Export to open formats (zero vendor lock-in)
zse rag export paper.zpf -f markdown -o paper.md
# Inspect .zpf file metadata
zse rag inspect paper.zpf
# Show store stats
zse rag stats
# Re-embed with a different model
zse rag reindex --model all-MiniLM-L6-v2
# Remove a document
zse rag remove <doc_id>from zse.core.zrag.pipeline import RAGPipeline
pipeline = RAGPipeline(store_dir="./my_store")
pipeline.ingest("noisy_article.md")
# Get LLM-ready context (token-budgeted)
context = pipeline.get_context("What is transfer learning?", max_tokens=500)
# ~25% fewer tokens than plain chunking, same answer quality- Semantic chunking β splits by content type (11 block types: CODE, TABLE, DEFINITION, PROCEDURE, etc.)
- 10-layer compression β strips filler phrases, verbose patterns, redundant qualifiers, noise lines
- Contextual embedding β each block embeds with parent section hierarchy for cross-section retrieval
- Hybrid retrieval β 0.6Γ embedding similarity + 0.4Γ BM25, with size normalization, block-type boosting, and identifier-aware scoring
Run models across multiple GPUs to reduce per-GPU VRAM or serve larger models.
# Tensor Parallelism β shard weights across GPUs (NCCL all-reduce)
zse serve qwen-32b -tp 2 --port 8000
# Pipeline Parallelism β split layers across GPUs (NCCL send/recv)
zse serve qwen-72b -pp 4 --port 8000
# Hybrid TP+PP β 2D process grid
zse serve qwen-72b -tp 2 -pp 2 --port 8000
# Auto-detect β let ZSE pick optimal layout
zse serve qwen-72b --multi-gpu --port 8000| Config | Speed | VRAM/GPU | vs Single GPU |
|---|---|---|---|
| Single GPU | 23.1 tok/s | 5.22 GB | baseline |
| TP=2 | 18.6 tok/s | 3.76 GB | -28% VRAM |
| PP=2 | 23.1 tok/s | 2.67 GB | -49% VRAM |
| TP=2 Γ PP=2 | 20.6 tok/s | 1.6-2.1 GB | -65% VRAM |
When to use what:
- TP: Fastest inference, splits every layer. Best when GPUs are connected via NVLink.
- PP: Best VRAM reduction, near-perfect memory split. Works over PCIe.
- TP+PP: Maximum scale β run 72B on 4Γ consumer GPUs.
Train any model with QLoRA - LoRA adapters on quantized INT4 base models.
# Install training dependencies
pip install zllm-zse[training]from zse.format import load_zse_model, convert_model
from zse.training import (
LoRAConfig,
add_lora_to_model,
save_lora_adapter,
load_lora_adapter
)
import torch
# 1. Convert model to .zse (one-time)
convert_model("meta-llama/Llama-3-8B", "llama8b.zse", quantization="int4")
# 2. Load INT4 model
model, tokenizer, info = load_zse_model("llama8b.zse", device="cuda")
# 3. Add LoRA adapters
config = LoRAConfig(
rank=16, # LoRA rank (higher = more capacity)
alpha=32, # LoRA alpha (scaling factor)
dropout=0.05, # Dropout for regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] # Which layers
)
model = add_lora_to_model(model, config)
# 4. Train with standard PyTorch
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad],
lr=2e-4
)
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 5. Save adapter (~25MB for 7B model)
save_lora_adapter(model, "my_adapter.safetensors")
# 6. Load adapter for inference
model, tokenizer, info = load_zse_model("llama8b.zse", lora="my_adapter.safetensors")QLoRA VRAM Usage:
| Model | Base VRAM | + LoRA Training | Trainable Params |
|---|---|---|---|
| 7B | 6 GB | ~8 GB | 12M (0.2%) |
| 14B | 11 GB | ~14 GB | 25M (0.2%) |
| 32B | 20 GB | ~26 GB | 50M (0.2%) |
| 70B | 42 GB | ~52 GB | 100M (0.1%) |
# Auto (default): Detect VRAM, pick optimal strategy
model, tok, info = load_zse_model("qwen7b.zse", cache_weights="auto")
# Force bnb mode (low VRAM, fast inference)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=False)
# Force FP16 cache (max speed, high VRAM)
model, tok, info = load_zse_model("qwen7b.zse", cache_weights=True)# Full benchmark
python3 -c "
import time, torch
from zse.format.reader_v2 import load_zse_model
t0 = time.time()
model, tokenizer, info = load_zse_model('qwen7b.zse')
print(f'Load: {time.time()-t0:.1f}s, VRAM: {torch.cuda.memory_allocated()/1e9:.1f}GB')
inputs = tokenizer('Hello', return_tensors='pt').to('cuda')
model.generate(**inputs, max_new_tokens=10) # Warmup
prompt = 'Write a detailed essay about AI.'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
torch.cuda.synchronize()
t0 = time.time()
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
torch.cuda.synchronize()
tokens = out.shape[1] - inputs['input_ids'].shape[1]
print(f'{tokens} tokens in {time.time()-t0:.2f}s = {tokens/(time.time()-t0):.1f} tok/s')
"# Model Management
zse pull <model> # Download pre-converted .zse model
zse list # Browse available models
zse list --cached # Show locally cached models
zse cached # Show cache details
zse convert <model_id> -o out.zse # Convert HuggingFace model
zse info <model> # Show model info
zse hardware # Check GPU/VRAM
# Serving
zse serve <model> --port 8000 # Single GPU
zse serve <model> -tp 2 --port 8000 # Tensor parallel (2 GPUs)
zse serve <model> -pp 2 --port 8000 # Pipeline parallel (2 GPUs)
zse serve <model> -tp 2 -pp 2 # Hybrid TP+PP (4 GPUs)
zse serve <model> --multi-gpu # Auto-detect optimal layout
zse chat <model> # Interactive chat
# RAG (.zpf)
zse rag add <file> [--title <t>] # Ingest document
zse rag search <query> [-k <n>] # Semantic search
zse rag context <query> [--max-tokens <n>] # Token-budgeted context
zse rag list # List documents
zse rag remove <doc_id> # Remove document
zse rag convert <file> [-o out.zpf] # Convert to .zpf only
zse rag inspect <zpf_path> # Show .zpf metadata
zse rag export <zpf> [-f fmt] # Export (markdown/jsonl/json)
zse rag stats # Store statistics
zse rag reindex [--model <name>] # Re-embed with new model
# Auth (for gated models)
zse login
zse logout13 models ready for instant download from huggingface.co/zse-zllm:
| Model | Size | Pull Command |
|---|---|---|
| Qwen2.5-0.5B-Instruct | 0.69 GB | zse pull qwen-0.5b |
| TinyLlama-1.1B-Chat | 0.71 GB | zse pull tinyllama-1.1b |
| Qwen2.5-1.5B-Instruct | 1.51 GB | zse pull qwen-1.5b |
| Qwen2.5-3B-Instruct | 2.51 GB | zse pull qwen-3b |
| Qwen2.5-Coder-1.5B | 1.51 GB | zse pull qwen-coder-1.5b |
| DeepSeek-Coder-6.7B | 3.61 GB | zse pull deepseek-6.7b |
| Mistral-7B-Instruct | 3.86 GB | zse pull mistral-7b |
| Qwen2.5-7B-Instruct | 5.18 GB | zse pull qwen-7b |
| Qwen2.5-Coder-7B | 5.18 GB | zse pull qwen-coder-7b |
| Qwen2.5-14B-Instruct | 9.26 GB | zse pull qwen-14b |
| Qwen2.5-32B-Instruct | 17.9 GB | zse pull qwen-32b |
| Mixtral-8x7B-Instruct | 85.14 GB | zse pull mixtral-8x7b |
| Qwen2.5-72B-Instruct | 38.38 GB | zse pull qwen-72b |
- Conversion: Quantize HF model to INT4, pack weights, embed tokenizer + config
- Loading: Memory-map .zse file, load INT4 weights directly to GPU
- Inference: ZSE custom kernel (default) or bnb backend for matmul
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β HuggingFace ββββββΆβ .zse File ββββββΆβ GPU Model β
β Model (FP16) β β (INT4 + tok) β β (ZSE Kernel) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
One-time Single file 12% less VRAM
conversion ~0.5 bytes/param vs bitsandbytes
Run local models with OpenClaw - the 24/7 AI assistant by @steipete.
# Start ZSE server
zse serve <model-name> --port 8000
# Configure OpenClaw to use local ZSE
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=zseOr in OpenClaw's config.yaml:
llm:
provider: openai-compatible
api_base: http://localhost:8000/v1
api_key: zse
model: defaultBenefits: 100% private, zero API costs, works offline, run ANY model.
# CPU
docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest
# GPU (NVIDIA)
docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu
# With model pre-loaded
docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latestSee deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes.
Apache 2.0
This project is supported by:
- Website: zllm.in
- Company: Zyora Labs
- Email: zse@zyoralabs.com
Made with β€οΈ by Zyora Labs