Skip to content

callstackincubator/Swarm-Agents-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

swarm-bench

A local benchmarking and orchestration tool that answers one question:

How many small Gemma models can my machine run concurrently before inference speed becomes unacceptable?

It launches N independent model worker processes, fires concurrent prompts at each, samples GPU memory and CPU/RSS while they run, and produces JSON / CSV / Markdown reports classifying each swarm size as interactive, usable, or background-only based on configurable thresholds.

Why subprocesses

One crashed model (OOM, segfault inside llama.cpp, GGUF parse error) should not kill the benchmark. Each worker is a separate python -m src.model_worker process that loads its own model and exposes it via FastAPI on a private port. The orchestrator talks HTTP to them and treats process death as data — failed workers are recorded and the run continues.

Architecture

   ┌──────────────┐ spawns N    ┌──────────────────┐
   │ benchmark.py │ ──────────► │ model_worker.py  │  (one per swarm slot)
   │ (orchestrator)│             │  ├ FastAPI :PORT │
   │             │  HTTP /generate│  └ Backend (llama_cpp / ollama)
   │             │ ◄──────────  │   loads model on a chosen GPU
   └──────┬──────┘              └──────────────────┘
          │ samples
          ▼
   ┌──────────────┐
   │ gpu_monitor  │  pynvml (or nvidia-smi fallback) every 0.5s
   └──────────────┘

Backend is an abstract class with two concrete implementations (llama_cpp_backend.py, ollama_backend.py). Adding vLLM later is a third file plus one branch in get_backend().

Install

cd swarm-bench
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Pick at least one backend:
# llama-cpp-python with CUDA (matches your CUDA version):
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# OR Ollama (separate daemon):
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:270m

See scripts/download_models.md for model URLs.

Usage

# 1) Confirm GPUs are visible
python -m src.main list-gpus

# 2) Single-config run (uses config.yaml)
python -m src.main benchmark --config config.yaml --report

# 3) Override config from the CLI (JSON-parsed values)
python -m src.main benchmark \
  --override 'swarm.workers=[1,2,4]' \
  --override 'generation.max_tokens=64' \
  --tag quick

# 4) Sweep every model in models.yaml
python -m src.main sweep --models models.yaml --max-workers 32

# 5) Re-render a report from a results JSON
python -m src.main report --results results/latest.json

What gets measured

Per-request:

  • first_token_latency_s — wall-clock from request send to first stream chunk
  • total_latency_s — wall-clock to the final chunk
  • tokens_per_second — completion tokens / generation window (excludes prefill)
  • prompt_tokens, completion_tokens

Per swarm-size run:

  • p50 / p95 / max first-token latency across all requests
  • p50 / min / avg tokens-per-second across all requests
  • Peak VRAM used per GPU (sampled at 0.5s while the run executes)
  • Peak GPU utilization
  • Peak CPU% and peak summed RSS across worker processes
  • Failure count (HTTP errors + workers that never became ready)
  • Classification: interactive / usable / background / failed

Outputs

results/
  benchmark_20260430-153012_quick.json
  benchmark_20260430-153012_quick.csv
  latest.json   →   most recent JSON (symlink)
  _worker_logs/
    w00_n8.log  ← raw stdout of each worker, per swarm size
reports/
  benchmark_20260430-153012_quick.md

Classification thresholds

Configured in config.yaml. Defaults match the problem statement:

Class First-token p95 tps p50
interactive ≤ 2.0 s ≥ 20
usable ≤ 5.0 s ≥ 10
background anything else that didn't fail

GPU strategies (config.yaml → gpu.strategy)

  • auto / round_robin: assign workers to GPUs round-robin (worker i → gpu (i % n_gpus)) using CUDA_VISIBLE_DEVICES
  • pin: use the gpu.pin: [0, 0, 1, 1] list directly (one entry per worker)

When the same GPU hosts multiple workers, allow_multi_model_per_gpu: true must remain set (default). The benchmark does not enforce a per-worker VRAM budget — that's reported, not policed.

DGX Spark / unified-memory note

The DGX Spark's GB10 reports a single "GPU" to NVML, but it shares memory with the Grace CPU side. peak_used_mb from pynvml reflects what NVML thinks is GPU-resident; on a unified board you should also watch the worker processes' RSS (also reported as rss_peak_mb). Don't be surprised when "VRAM" plus "RSS" exceeds physical RAM — they overlap on these boards.

Round-robin pinning across GPUs degenerates to a single device on this machine. To still test isolation between concurrent workers there, set gpu.strategy: auto (everyone shares GPU 0) and rely on the default multi-model-per-gpu path.

Adding a backend

  1. Create src/backends/<name>_backend.py subclassing Backend.
  2. Implement load, unload, generate returning a GenerationResult.
  3. Add a branch in src/backends/base.py:get_backend().
  4. Add the spawn-args plumbing in src/benchmark.py:_build_worker_cmd().

vLLM is a natural fit: open an OpenAI-style HTTP endpoint per worker and hit /v1/completions with stream=true. The plumbing mirrors the Ollama backend.

Project layout

swarm-bench/
  README.md
  requirements.txt
  config.yaml          ← single-run config
  models.yaml          ← multi-model sweep config
  prompts/basic.txt
  src/
    main.py            ← CLI: list-gpus, benchmark, sweep, report
    config.py          ← YAML loader + override parser
    gpu_monitor.py     ← pynvml / nvidia-smi sampler
    benchmark.py       ← orchestrator (asyncio + httpx + subprocess)
    model_worker.py    ← FastAPI worker (entry: python -m src.model_worker)
    report.py          ← Markdown renderer
    backends/
      base.py
      llama_cpp_backend.py
      ollama_backend.py
  scripts/
    run_baseline.sh
    run_swarm.sh
    download_models.md
  results/
  reports/

About

Local benchmarking and orchestration lab for running swarms of small LLM agents (Gemma + friends).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors