swarm-bench

A local benchmarking and orchestration tool that answers one question:

How many small Gemma models can my machine run concurrently before inference speed becomes unacceptable?

It launches N independent model worker processes, fires concurrent prompts at each, samples GPU memory and CPU/RSS while they run, and produces JSON / CSV / Markdown reports classifying each swarm size as interactive, usable, or background-only based on configurable thresholds.

Why subprocesses

One crashed model (OOM, segfault inside llama.cpp, GGUF parse error) should not kill the benchmark. Each worker is a separate python -m src.model_worker process that loads its own model and exposes it via FastAPI on a private port. The orchestrator talks HTTP to them and treats process death as data — failed workers are recorded and the run continues.

Architecture

   ┌──────────────┐ spawns N    ┌──────────────────┐
   │ benchmark.py │ ──────────► │ model_worker.py  │  (one per swarm slot)
   │ (orchestrator)│             │  ├ FastAPI :PORT │
   │             │  HTTP /generate│  └ Backend (llama_cpp / ollama)
   │             │ ◄──────────  │   loads model on a chosen GPU
   └──────┬──────┘              └──────────────────┘
          │ samples
          ▼
   ┌──────────────┐
   │ gpu_monitor  │  pynvml (or nvidia-smi fallback) every 0.5s
   └──────────────┘

Backend is an abstract class with two concrete implementations (llama_cpp_backend.py, ollama_backend.py). Adding vLLM later is a third file plus one branch in get_backend().

Install

cd swarm-bench
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Pick at least one backend:
# llama-cpp-python with CUDA (matches your CUDA version):
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# OR Ollama (separate daemon):
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:270m

See scripts/download_models.md for model URLs.

Usage

# 1) Confirm GPUs are visible
python -m src.main list-gpus

# 2) Single-config run (uses config.yaml)
python -m src.main benchmark --config config.yaml --report

# 3) Override config from the CLI (JSON-parsed values)
python -m src.main benchmark \
  --override 'swarm.workers=[1,2,4]' \
  --override 'generation.max_tokens=64' \
  --tag quick

# 4) Sweep every model in models.yaml
python -m src.main sweep --models models.yaml --max-workers 32

# 5) Re-render a report from a results JSON
python -m src.main report --results results/latest.json

What gets measured

Per-request:

first_token_latency_s — wall-clock from request send to first stream chunk
total_latency_s — wall-clock to the final chunk
tokens_per_second — completion tokens / generation window (excludes prefill)
prompt_tokens, completion_tokens

Per swarm-size run:

p50 / p95 / max first-token latency across all requests
p50 / min / avg tokens-per-second across all requests
Peak VRAM used per GPU (sampled at 0.5s while the run executes)
Peak GPU utilization
Peak CPU% and peak summed RSS across worker processes
Failure count (HTTP errors + workers that never became ready)
Classification: interactive / usable / background / failed

Outputs

results/
  benchmark_20260430-153012_quick.json
  benchmark_20260430-153012_quick.csv
  latest.json   →   most recent JSON (symlink)
  _worker_logs/
    w00_n8.log  ← raw stdout of each worker, per swarm size
reports/
  benchmark_20260430-153012_quick.md

Classification thresholds

Configured in config.yaml. Defaults match the problem statement:

Class	First-token p95	tps p50
interactive	≤ 2.0 s	≥ 20
usable	≤ 5.0 s	≥ 10
background	anything else that didn't fail

GPU strategies (`config.yaml → gpu.strategy`)

auto / round_robin: assign workers to GPUs round-robin (worker i → gpu (i % n_gpus)) using CUDA_VISIBLE_DEVICES
pin: use the gpu.pin: [0, 0, 1, 1] list directly (one entry per worker)

When the same GPU hosts multiple workers, allow_multi_model_per_gpu: true must remain set (default). The benchmark does not enforce a per-worker VRAM budget — that's reported, not policed.

DGX Spark / unified-memory note

The DGX Spark's GB10 reports a single "GPU" to NVML, but it shares memory with the Grace CPU side. peak_used_mb from pynvml reflects what NVML thinks is GPU-resident; on a unified board you should also watch the worker processes' RSS (also reported as rss_peak_mb). Don't be surprised when "VRAM" plus "RSS" exceeds physical RAM — they overlap on these boards.

Round-robin pinning across GPUs degenerates to a single device on this machine. To still test isolation between concurrent workers there, set gpu.strategy: auto (everyone shares GPU 0) and rely on the default multi-model-per-gpu path.

Adding a backend

Create src/backends/<name>_backend.py subclassing Backend.
Implement load, unload, generate returning a GenerationResult.
Add a branch in src/backends/base.py:get_backend().
Add the spawn-args plumbing in src/benchmark.py:_build_worker_cmd().

vLLM is a natural fit: open an OpenAI-style HTTP endpoint per worker and hit /v1/completions with stream=true. The plumbing mirrors the Ollama backend.

Project layout

swarm-bench/
  README.md
  requirements.txt
  config.yaml          ← single-run config
  models.yaml          ← multi-model sweep config
  prompts/basic.txt
  src/
    main.py            ← CLI: list-gpus, benchmark, sweep, report
    config.py          ← YAML loader + override parser
    gpu_monitor.py     ← pynvml / nvidia-smi sampler
    benchmark.py       ← orchestrator (asyncio + httpx + subprocess)
    model_worker.py    ← FastAPI worker (entry: python -m src.model_worker)
    report.py          ← Markdown renderer
    backends/
      base.py
      llama_cpp_backend.py
      ollama_backend.py
  scripts/
    run_baseline.sh
    run_swarm.sh
    download_models.md
  results/
  reports/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

swarm-bench

Why subprocesses

Architecture

Install

Usage

What gets measured

Outputs

Classification thresholds

GPU strategies (`config.yaml → gpu.strategy`)

DGX Spark / unified-memory note

Adding a backend

Project layout

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
prompts		prompts
reports		reports
results		results
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
models.yaml		models.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

swarm-bench

Why subprocesses

Architecture

Install

Usage

What gets measured

Outputs

Classification thresholds

GPU strategies (config.yaml → gpu.strategy)

DGX Spark / unified-memory note

Adding a backend

Project layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

GPU strategies (`config.yaml → gpu.strategy`)

Packages