Worker Orchestration

SimpleTuner's worker orchestration allows you to distribute training jobs across multiple GPU machines. Workers register with a central orchestrator, receive job dispatch events in real-time, and report status back.

Overview

The orchestrator/worker architecture enables:

Distributed training - Run jobs on any machine with GPUs, anywhere
Auto-discovery - Workers self-register with GPU capabilities
Real-time dispatch - Jobs dispatched via SSE (Server-Sent Events)
Mixed fleet - Combine cloud-launched ephemeral workers with persistent on-prem machines
Fault tolerance - Orphaned jobs are automatically requeued

Worker Types

Type	Lifecycle	Use Case
Ephemeral	Shuts down after job completion	Cloud spot instances (RunPod, Vast.ai)
Persistent	Stays online between jobs	On-prem GPUs, reserved instances

Quick Start

1. Start the Orchestrator

Run the SimpleTuner server on your central machine:

simpletuner server --host 0.0.0.0 --port 8001

For production, enable SSL:

simpletuner server --host 0.0.0.0 --port 8001 --ssl

2. Create a Worker Token

Via Web UI: Administration → Workers → Create Worker

Via API:

curl -s -X POST http://localhost:8001/api/admin/workers \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpu-worker-1",
    "worker_type": "persistent",
    "labels": {"location": "datacenter-a", "gpu_type": "a100"}
  }'

Response includes the token (shown only once):

{
  "worker_id": "w_abc123",
  "token": "wt_xxxxxxxxxxxx",
  "name": "gpu-worker-1"
}

3. Start the Worker

On the GPU machine:

simpletuner worker \
  --orchestrator-url https://orchestrator.example.com:8001 \
  --worker-token wt_xxxxxxxxxxxx \
  --name gpu-worker-1 \
  --persistent

Or via environment variables:

export SIMPLETUNER_ORCHESTRATOR_URL=https://orchestrator.example.com:8001
export SIMPLETUNER_WORKER_TOKEN=wt_xxxxxxxxxxxx
export SIMPLETUNER_WORKER_NAME=gpu-worker-1
export SIMPLETUNER_WORKER_PERSISTENT=true

simpletuner worker

The worker will:

Connect to the orchestrator
Report GPU capabilities (auto-detected)
Enter the job dispatch loop
Send heartbeats every 30 seconds

4. Submit Jobs to Workers

Via Web UI: Configure your training, then click Train in Cloud → select Worker as target.

Via API:

curl -s -X POST http://localhost:8001/api/queue/submit \
  -H "Content-Type: application/json" \
  -d '{
    "config_name": "my-training-config",
    "target": "worker"
  }'

Target options:

Target	Behavior
`worker`	Dispatch only to remote workers
`local`	Run on orchestrator's GPUs
`auto`	Prefer worker if available, fall back to local

CLI Reference

simpletuner worker [OPTIONS]

OPTIONS:
  --orchestrator-url URL   Orchestrator panel URL (or SIMPLETUNER_ORCHESTRATOR_URL)
  --worker-token TOKEN     Authentication token (or SIMPLETUNER_WORKER_TOKEN)
  --name NAME              Worker name (default: hostname)
  --persistent             Stay online between jobs (default: ephemeral)
  -v, --verbose            Enable debug logging

Ephemeral vs Persistent Mode

Ephemeral (default):

Worker shuts down after completing one job
Ideal for cloud spot instances that bill per minute
Orchestrator cleans up offline ephemeral workers after 1 hour

Persistent (--persistent):

Worker stays online waiting for new jobs
Reconnects automatically if connection drops
Use for on-prem GPUs or reserved instances

Worker Lifecycle

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  CONNECTING │ ──▶ │    IDLE     │ ──▶ │    BUSY     │
└─────────────┘     └─────────────┘     └─────────────┘
                           │                   │
                           │                   │
                           ▼                   ▼
                    ┌─────────────┐     ┌─────────────┐
                    │  DRAINING   │     │   OFFLINE   │
                    └─────────────┘     └─────────────┘

Status	Description
`CONNECTING`	Worker establishing connection
`IDLE`	Ready to receive jobs
`BUSY`	Currently running a job
`DRAINING`	Finishing current job, then shutting down
`OFFLINE`	Disconnected (heartbeat timeout)

Health Monitoring

The orchestrator monitors worker health:

Heartbeat interval: 30 seconds (worker → orchestrator)
Timeout threshold: 120 seconds without heartbeat → mark offline
Health check loop: Runs every 60 seconds on orchestrator

Handling Failures

Worker goes offline during a job:

Job marked as failed after heartbeat timeout
If retries remaining (default: 3), job requeued
Next available worker picks up the job

Orchestrator restarts:

Workers automatically reconnect
Workers report any in-progress jobs
Orchestrator reconciles state and resumes

GPU Matching

Workers report their GPU capabilities on registration:

{
  "gpu_count": 2,
  "gpu_name": "NVIDIA A100-SXM4-80GB",
  "gpu_vram_gb": 80,
  "accelerator_type": "cuda"
}

Jobs can specify GPU requirements:

curl -s -X POST http://localhost:8001/api/queue/submit \
  -H "Content-Type: application/json" \
  -d '{
    "config_name": "my-config",
    "target": "worker",
    "worker_labels": {"gpu_type": "a100*"}
  }'

The scheduler matches jobs to workers based on:

GPU count requirements
Label matching (glob patterns supported)
Worker availability (IDLE status)

Labels

Labels provide flexible worker selection:

Assign labels on worker creation:

curl -s -X POST http://localhost:8001/api/admin/workers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "worker-1",
    "labels": {
      "location": "us-west",
      "gpu_type": "a100",
      "team": "nlp"
    }
  }'

Select workers by label:

# Match workers with team=nlp
curl -s -X POST http://localhost:8001/api/queue/submit \
  -d '{"config_name": "my-config", "worker_labels": {"team": "nlp"}}'

# Match workers with gpu_type starting with "a100"
curl -s -X POST http://localhost:8001/api/queue/submit \
  -d '{"config_name": "my-config", "worker_labels": {"gpu_type": "a100*"}}'

Admin Operations

List Workers

curl -s http://localhost:8001/api/admin/workers | jq

Response:

{
  "workers": [
    {
      "id": "w_abc123",
      "name": "gpu-worker-1",
      "status": "idle",
      "worker_type": "persistent",
      "gpu_count": 2,
      "gpu_name": "A100",
      "labels": {"location": "datacenter-a"},
      "last_heartbeat": "2024-01-15T10:30:00Z"
    }
  ]
}

Drain a Worker

Gracefully finish current job and prevent new dispatch:

curl -s -X POST http://localhost:8001/api/admin/workers/w_abc123/drain

The worker will:

Complete any running job
Enter DRAINING status
Refuse new jobs
Disconnect after job completion (ephemeral) or remain in draining state (persistent)

Rotate Token

Regenerate a worker's authentication token:

curl -s -X POST http://localhost:8001/api/admin/workers/w_abc123/token

The old token is immediately invalidated. Update the worker with the new token.

Delete a Worker

curl -s -X DELETE http://localhost:8001/api/admin/workers/w_abc123

Only works if the worker is offline.

Security

Token Authentication

Workers authenticate via X-Worker-Token header
Tokens are SHA-256 hashed before storage
Tokens never leave the orchestrator after creation
Rotate tokens periodically for security

Network Security

For production:

Use --ssl flag or terminate TLS at a reverse proxy
Restrict worker registration to trusted networks
Use firewall rules to limit access to /api/workers/* endpoints

Audit Logging

All worker actions are logged:

Registration attempts (success/failure)
Job dispatch events
Status transitions
Token rotations
Admin operations

See Audit Guide for log access.

Troubleshooting

Worker Can't Connect

"Connection refused"

Verify orchestrator URL and port
Check firewall rules allow inbound connections
Ensure orchestrator is running with --host 0.0.0.0

"Invalid token"

Token may have been rotated—request a new one
Check for whitespace in token string

"SSL certificate verify failed"

Use --ssl-no-verify for self-signed certs (development only)
Or add the CA certificate to the system trust store

Worker Goes Offline Unexpectedly

Heartbeat timeout (120s)

Check network stability between worker and orchestrator
Look for resource exhaustion (CPU/memory) on worker
Increase SIMPLETUNER_HEARTBEAT_TIMEOUT if on unreliable network

Process crash

Check worker logs for Python exceptions
Verify GPU drivers are functioning (nvidia-smi)
Ensure sufficient disk space for training

Jobs Not Dispatching to Workers

No idle workers

Check worker status in admin panel
Verify workers are connected and IDLE
Check for label mismatch between job and workers

GPU requirements not met

Job requires more GPUs than any worker has
Adjust --num_processes in training config

API Reference

Worker Endpoints (Worker → Orchestrator)

Endpoint	Method	Description
`/api/workers/register`	POST	Register and report capabilities
`/api/workers/stream`	GET	SSE stream for job dispatch
`/api/workers/heartbeat`	POST	Periodic keepalive
`/api/workers/job/{id}/status`	POST	Report job progress
`/api/workers/disconnect`	POST	Graceful shutdown notification

Admin Endpoints (Requires `admin.workers` permission)

Endpoint	Method	Description
`/api/admin/workers`	GET	List all workers
`/api/admin/workers`	POST	Create worker token
`/api/admin/workers/{id}`	DELETE	Remove worker
`/api/admin/workers/{id}/drain`	POST	Drain worker
`/api/admin/workers/{id}/token`	POST	Rotate token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker Orchestration

Overview

Worker Types

Quick Start

1. Start the Orchestrator

2. Create a Worker Token

3. Start the Worker

4. Submit Jobs to Workers

CLI Reference

Ephemeral vs Persistent Mode

Worker Lifecycle

Health Monitoring

Handling Failures

GPU Matching

Labels

Admin Operations

List Workers

Drain a Worker

Rotate Token

Delete a Worker

Security

Token Authentication

Network Security

Audit Logging

Troubleshooting

Worker Can't Connect

Worker Goes Offline Unexpectedly

Jobs Not Dispatching to Workers

API Reference

Worker Endpoints (Worker → Orchestrator)

Admin Endpoints (Requires `admin.workers` permission)

See Also

FilesExpand file tree

WORKERS.md

Latest commit

History

WORKERS.md

File metadata and controls

Worker Orchestration

Overview

Worker Types

Quick Start

1. Start the Orchestrator

2. Create a Worker Token

3. Start the Worker

4. Submit Jobs to Workers

CLI Reference

Ephemeral vs Persistent Mode

Worker Lifecycle

Health Monitoring

Handling Failures

GPU Matching

Labels

Admin Operations

List Workers

Drain a Worker

Rotate Token

Delete a Worker

Security

Token Authentication

Network Security

Audit Logging

Troubleshooting

Worker Can't Connect

Worker Goes Offline Unexpectedly

Jobs Not Dispatching to Workers

API Reference

Worker Endpoints (Worker → Orchestrator)

Admin Endpoints (Requires admin.workers permission)

See Also

Admin Endpoints (Requires `admin.workers` permission)