SimpleTuner's worker orchestration allows you to distribute training jobs across multiple GPU machines. Workers register with a central orchestrator, receive job dispatch events in real-time, and report status back.
The orchestrator/worker architecture enables:
- Distributed training - Run jobs on any machine with GPUs, anywhere
- Auto-discovery - Workers self-register with GPU capabilities
- Real-time dispatch - Jobs dispatched via SSE (Server-Sent Events)
- Mixed fleet - Combine cloud-launched ephemeral workers with persistent on-prem machines
- Fault tolerance - Orphaned jobs are automatically requeued
| Type | Lifecycle | Use Case |
|---|---|---|
| Ephemeral | Shuts down after job completion | Cloud spot instances (RunPod, Vast.ai) |
| Persistent | Stays online between jobs | On-prem GPUs, reserved instances |
Run the SimpleTuner server on your central machine:
simpletuner server --host 0.0.0.0 --port 8001For production, enable SSL:
simpletuner server --host 0.0.0.0 --port 8001 --sslVia Web UI: Administration → Workers → Create Worker
Via API:
curl -s -X POST http://localhost:8001/api/admin/workers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "gpu-worker-1",
"worker_type": "persistent",
"labels": {"location": "datacenter-a", "gpu_type": "a100"}
}'Response includes the token (shown only once):
{
"worker_id": "w_abc123",
"token": "wt_xxxxxxxxxxxx",
"name": "gpu-worker-1"
}On the GPU machine:
simpletuner worker \
--orchestrator-url https://orchestrator.example.com:8001 \
--worker-token wt_xxxxxxxxxxxx \
--name gpu-worker-1 \
--persistentOr via environment variables:
export SIMPLETUNER_ORCHESTRATOR_URL=https://orchestrator.example.com:8001
export SIMPLETUNER_WORKER_TOKEN=wt_xxxxxxxxxxxx
export SIMPLETUNER_WORKER_NAME=gpu-worker-1
export SIMPLETUNER_WORKER_PERSISTENT=true
simpletuner workerThe worker will:
- Connect to the orchestrator
- Report GPU capabilities (auto-detected)
- Enter the job dispatch loop
- Send heartbeats every 30 seconds
Via Web UI: Configure your training, then click Train in Cloud → select Worker as target.
Via API:
curl -s -X POST http://localhost:8001/api/queue/submit \
-H "Content-Type: application/json" \
-d '{
"config_name": "my-training-config",
"target": "worker"
}'Target options:
| Target | Behavior |
|---|---|
worker |
Dispatch only to remote workers |
local |
Run on orchestrator's GPUs |
auto |
Prefer worker if available, fall back to local |
simpletuner worker [OPTIONS]
OPTIONS:
--orchestrator-url URL Orchestrator panel URL (or SIMPLETUNER_ORCHESTRATOR_URL)
--worker-token TOKEN Authentication token (or SIMPLETUNER_WORKER_TOKEN)
--name NAME Worker name (default: hostname)
--persistent Stay online between jobs (default: ephemeral)
-v, --verbose Enable debug logging
Ephemeral (default):
- Worker shuts down after completing one job
- Ideal for cloud spot instances that bill per minute
- Orchestrator cleans up offline ephemeral workers after 1 hour
Persistent (--persistent):
- Worker stays online waiting for new jobs
- Reconnects automatically if connection drops
- Use for on-prem GPUs or reserved instances
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CONNECTING │ ──▶ │ IDLE │ ──▶ │ BUSY │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ DRAINING │ │ OFFLINE │
└─────────────┘ └─────────────┘
| Status | Description |
|---|---|
CONNECTING |
Worker establishing connection |
IDLE |
Ready to receive jobs |
BUSY |
Currently running a job |
DRAINING |
Finishing current job, then shutting down |
OFFLINE |
Disconnected (heartbeat timeout) |
The orchestrator monitors worker health:
- Heartbeat interval: 30 seconds (worker → orchestrator)
- Timeout threshold: 120 seconds without heartbeat → mark offline
- Health check loop: Runs every 60 seconds on orchestrator
Worker goes offline during a job:
- Job marked as failed after heartbeat timeout
- If retries remaining (default: 3), job requeued
- Next available worker picks up the job
Orchestrator restarts:
- Workers automatically reconnect
- Workers report any in-progress jobs
- Orchestrator reconciles state and resumes
Workers report their GPU capabilities on registration:
{
"gpu_count": 2,
"gpu_name": "NVIDIA A100-SXM4-80GB",
"gpu_vram_gb": 80,
"accelerator_type": "cuda"
}Jobs can specify GPU requirements:
curl -s -X POST http://localhost:8001/api/queue/submit \
-H "Content-Type: application/json" \
-d '{
"config_name": "my-config",
"target": "worker",
"worker_labels": {"gpu_type": "a100*"}
}'The scheduler matches jobs to workers based on:
- GPU count requirements
- Label matching (glob patterns supported)
- Worker availability (IDLE status)
Labels provide flexible worker selection:
Assign labels on worker creation:
curl -s -X POST http://localhost:8001/api/admin/workers \
-H "Content-Type: application/json" \
-d '{
"name": "worker-1",
"labels": {
"location": "us-west",
"gpu_type": "a100",
"team": "nlp"
}
}'Select workers by label:
# Match workers with team=nlp
curl -s -X POST http://localhost:8001/api/queue/submit \
-d '{"config_name": "my-config", "worker_labels": {"team": "nlp"}}'
# Match workers with gpu_type starting with "a100"
curl -s -X POST http://localhost:8001/api/queue/submit \
-d '{"config_name": "my-config", "worker_labels": {"gpu_type": "a100*"}}'curl -s http://localhost:8001/api/admin/workers | jqResponse:
{
"workers": [
{
"id": "w_abc123",
"name": "gpu-worker-1",
"status": "idle",
"worker_type": "persistent",
"gpu_count": 2,
"gpu_name": "A100",
"labels": {"location": "datacenter-a"},
"last_heartbeat": "2024-01-15T10:30:00Z"
}
]
}Gracefully finish current job and prevent new dispatch:
curl -s -X POST http://localhost:8001/api/admin/workers/w_abc123/drainThe worker will:
- Complete any running job
- Enter DRAINING status
- Refuse new jobs
- Disconnect after job completion (ephemeral) or remain in draining state (persistent)
Regenerate a worker's authentication token:
curl -s -X POST http://localhost:8001/api/admin/workers/w_abc123/tokenThe old token is immediately invalidated. Update the worker with the new token.
curl -s -X DELETE http://localhost:8001/api/admin/workers/w_abc123Only works if the worker is offline.
- Workers authenticate via
X-Worker-Tokenheader - Tokens are SHA-256 hashed before storage
- Tokens never leave the orchestrator after creation
- Rotate tokens periodically for security
For production:
- Use
--sslflag or terminate TLS at a reverse proxy - Restrict worker registration to trusted networks
- Use firewall rules to limit access to
/api/workers/*endpoints
All worker actions are logged:
- Registration attempts (success/failure)
- Job dispatch events
- Status transitions
- Token rotations
- Admin operations
See Audit Guide for log access.
"Connection refused"
- Verify orchestrator URL and port
- Check firewall rules allow inbound connections
- Ensure orchestrator is running with
--host 0.0.0.0
"Invalid token"
- Token may have been rotated—request a new one
- Check for whitespace in token string
"SSL certificate verify failed"
- Use
--ssl-no-verifyfor self-signed certs (development only) - Or add the CA certificate to the system trust store
Heartbeat timeout (120s)
- Check network stability between worker and orchestrator
- Look for resource exhaustion (CPU/memory) on worker
- Increase
SIMPLETUNER_HEARTBEAT_TIMEOUTif on unreliable network
Process crash
- Check worker logs for Python exceptions
- Verify GPU drivers are functioning (
nvidia-smi) - Ensure sufficient disk space for training
No idle workers
- Check worker status in admin panel
- Verify workers are connected and IDLE
- Check for label mismatch between job and workers
GPU requirements not met
- Job requires more GPUs than any worker has
- Adjust
--num_processesin training config
| Endpoint | Method | Description |
|---|---|---|
/api/workers/register |
POST | Register and report capabilities |
/api/workers/stream |
GET | SSE stream for job dispatch |
/api/workers/heartbeat |
POST | Periodic keepalive |
/api/workers/job/{id}/status |
POST | Report job progress |
/api/workers/disconnect |
POST | Graceful shutdown notification |
| Endpoint | Method | Description |
|---|---|---|
/api/admin/workers |
GET | List all workers |
/api/admin/workers |
POST | Create worker token |
/api/admin/workers/{id} |
DELETE | Remove worker |
/api/admin/workers/{id}/drain |
POST | Drain worker |
/api/admin/workers/{id}/token |
POST | Rotate token |
- Enterprise Guide - SSO, quotas, approval workflows
- Job Queue - Queue scheduling and priorities
- Cloud Training - Cloud provider integration
- API Tutorial - Local training via REST API