A local benchmarking and orchestration tool that answers one question:
How many small Gemma models can my machine run concurrently before inference speed becomes unacceptable?
It launches N independent model worker processes, fires concurrent prompts at
each, samples GPU memory and CPU/RSS while they run, and produces JSON / CSV /
Markdown reports classifying each swarm size as interactive, usable, or
background-only based on configurable thresholds.
One crashed model (OOM, segfault inside llama.cpp, GGUF parse error) should
not kill the benchmark. Each worker is a separate python -m src.model_worker
process that loads its own model and exposes it via FastAPI on a private port.
The orchestrator talks HTTP to them and treats process death as data — failed
workers are recorded and the run continues.
┌──────────────┐ spawns N ┌──────────────────┐
│ benchmark.py │ ──────────► │ model_worker.py │ (one per swarm slot)
│ (orchestrator)│ │ ├ FastAPI :PORT │
│ │ HTTP /generate│ └ Backend (llama_cpp / ollama)
│ │ ◄────────── │ loads model on a chosen GPU
└──────┬──────┘ └──────────────────┘
│ samples
▼
┌──────────────┐
│ gpu_monitor │ pynvml (or nvidia-smi fallback) every 0.5s
└──────────────┘
Backend is an abstract class with two concrete implementations
(llama_cpp_backend.py, ollama_backend.py). Adding vLLM later is a third
file plus one branch in get_backend().
cd swarm-bench
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Pick at least one backend:
# llama-cpp-python with CUDA (matches your CUDA version):
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
# OR Ollama (separate daemon):
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:270mSee scripts/download_models.md for model URLs.
# 1) Confirm GPUs are visible
python -m src.main list-gpus
# 2) Single-config run (uses config.yaml)
python -m src.main benchmark --config config.yaml --report
# 3) Override config from the CLI (JSON-parsed values)
python -m src.main benchmark \
--override 'swarm.workers=[1,2,4]' \
--override 'generation.max_tokens=64' \
--tag quick
# 4) Sweep every model in models.yaml
python -m src.main sweep --models models.yaml --max-workers 32
# 5) Re-render a report from a results JSON
python -m src.main report --results results/latest.jsonPer-request:
first_token_latency_s— wall-clock from request send to first stream chunktotal_latency_s— wall-clock to the final chunktokens_per_second— completion tokens / generation window (excludes prefill)prompt_tokens,completion_tokens
Per swarm-size run:
- p50 / p95 / max first-token latency across all requests
- p50 / min / avg tokens-per-second across all requests
- Peak VRAM used per GPU (sampled at 0.5s while the run executes)
- Peak GPU utilization
- Peak CPU% and peak summed RSS across worker processes
- Failure count (HTTP errors + workers that never became ready)
- Classification:
interactive/usable/background/failed
results/
benchmark_20260430-153012_quick.json
benchmark_20260430-153012_quick.csv
latest.json → most recent JSON (symlink)
_worker_logs/
w00_n8.log ← raw stdout of each worker, per swarm size
reports/
benchmark_20260430-153012_quick.md
Configured in config.yaml. Defaults match the problem statement:
| Class | First-token p95 | tps p50 |
|---|---|---|
| interactive | ≤ 2.0 s | ≥ 20 |
| usable | ≤ 5.0 s | ≥ 10 |
| background | anything else that didn't fail |
auto/round_robin: assign workers to GPUs round-robin (worker i → gpu (i % n_gpus)) usingCUDA_VISIBLE_DEVICESpin: use thegpu.pin: [0, 0, 1, 1]list directly (one entry per worker)
When the same GPU hosts multiple workers, allow_multi_model_per_gpu: true
must remain set (default). The benchmark does not enforce a per-worker VRAM
budget — that's reported, not policed.
The DGX Spark's GB10 reports a single "GPU" to NVML, but it shares memory
with the Grace CPU side. peak_used_mb from pynvml reflects what NVML thinks
is GPU-resident; on a unified board you should also watch the worker
processes' RSS (also reported as rss_peak_mb). Don't be surprised when
"VRAM" plus "RSS" exceeds physical RAM — they overlap on these boards.
Round-robin pinning across GPUs degenerates to a single device on this
machine. To still test isolation between concurrent workers there, set
gpu.strategy: auto (everyone shares GPU 0) and rely on the default
multi-model-per-gpu path.
- Create
src/backends/<name>_backend.pysubclassingBackend. - Implement
load,unload,generatereturning aGenerationResult. - Add a branch in
src/backends/base.py:get_backend(). - Add the spawn-args plumbing in
src/benchmark.py:_build_worker_cmd().
vLLM is a natural fit: open an OpenAI-style HTTP endpoint per worker and
hit /v1/completions with stream=true. The plumbing mirrors the Ollama
backend.
swarm-bench/
README.md
requirements.txt
config.yaml ← single-run config
models.yaml ← multi-model sweep config
prompts/basic.txt
src/
main.py ← CLI: list-gpus, benchmark, sweep, report
config.py ← YAML loader + override parser
gpu_monitor.py ← pynvml / nvidia-smi sampler
benchmark.py ← orchestrator (asyncio + httpx + subprocess)
model_worker.py ← FastAPI worker (entry: python -m src.model_worker)
report.py ← Markdown renderer
backends/
base.py
llama_cpp_backend.py
ollama_backend.py
scripts/
run_baseline.sh
run_swarm.sh
download_models.md
results/
reports/