This document covers deploying SimpleTuner in multi-user environments with authentication, approval workflows, and quota management.
Most enterprise features can be configured via the Web UI (Administration panel) or REST API. A few infrastructure-level settings require a config file or environment variables.
| Feature | Web UI | API | Config File |
|---|---|---|---|
| OIDC/LDAP providers | ✓ | ✓ | ✓ |
| Users & roles | ✓ | ✓ | |
| Approval rules | ✓ | ✓ | |
| Quotas | ✓ | ✓ | |
| Notifications | ✓ | ✓ | |
| Network bypass (trusted proxies) | ✓ | ||
| Background job polling | ✓ | ||
| TLS settings | ✓ |
Config file (simpletuner-enterprise.yaml or .json) is only needed for infrastructure settings that must be known at startup. SimpleTuner searches these locations:
$SIMPLETUNER_ENTERPRISE_CONFIG(environment variable)./simpletuner-enterprise.yaml(current directory)~/.config/simpletuner/enterprise.yaml/etc/simpletuner/enterprise.yaml
The file supports environment variable interpolation with ${VAR} syntax.
- Start SimpleTuner:
simpletuner server(or--webuifor local use) - Configure via UI: Navigate to Administration panel to set up users, SSO, quotas
- Health Checks (for production):
- Liveness:
GET /api/cloud/health/live(200 OK) - Readiness:
GET /api/cloud/health/ready(200 OK) - Deep Check:
GET /api/cloud/health(includes provider connectivity)
- Liveness:
Configuring trusted proxies and internal network bypass (config file required)
In corporate environments (VPNs, private VPCs), you may want to trust internal traffic or offload authentication to a gateway.
simpletuner-enterprise.yaml:
network:
# Trust headers from your load balancer (e.g., AWS ALB, Nginx)
trust_proxy_headers: true
trusted_proxies:
- "10.0.0.0/8"
- "192.168.0.0/16"
# Optional: Trust specific internal subnets to bypass login
bypass_auth_for_internal: true
internal_networks:
- "10.10.0.0/16" # VPN Clients
auth:
# Always allow health checks without auth
bypass_paths:
- "/health"
- "/api/cloud/health"
- "/api/cloud/metrics/prometheus"SimpleTuner expects an upstream reverse proxy for TLS termination.
nginx reverse proxy example
server {
listen 443 ssl http2;
server_name trainer.internal;
ssl_certificate /etc/ssl/certs/simpletuner.crt;
ssl_certificate_key /etc/ssl/private/simpletuner.key;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support for real-time logs/SSE
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
}
}Metrics:
Scrape GET /api/cloud/metrics/prometheus for operational insights.
simpletuner_jobs_active: Current queue depth.simpletuner_cost_total_usd: Spend tracking.simpletuner_uptime_seconds: Availability.
Logging:
Set SIMPLETUNER_LOG_FORMAT=json for ingestion into Splunk/Datadog/ELK.
Data Retention Configuration
Configure retention periods for compliance requirements via environment variables:
| Variable | Default | Description |
|---|---|---|
SIMPLETUNER_JOB_RETENTION_DAYS |
90 | Days to retain completed job records |
SIMPLETUNER_AUDIT_RETENTION_DAYS |
90 | Days to retain audit log entries |
# SOC 2 / HIPAA: 1 year retention
export SIMPLETUNER_JOB_RETENTION_DAYS=365
export SIMPLETUNER_AUDIT_RETENTION_DAYS=365
# Disable automatic cleanup (manual management)
export SIMPLETUNER_JOB_RETENTION_DAYS=0Setting to 0 disables automatic cleanup. Cleanup runs daily.
SimpleTuner supports OIDC (OpenID Connect) and LDAP for SSO with Okta, Azure AD, Keycloak, and Active Directory.
Via Web UI: Navigate to Administration → Auth to add and configure providers.
Via API: See the API Cookbook for curl examples.
Via config file (for IaC/GitOps workflows)
Add to your simpletuner-enterprise.yaml:
oidc:
enabled: true
provider: "okta" # or "azure_ad", "google"
client_id: "0oa1234567890abcdef"
client_secret: "${OIDC_CLIENT_SECRET}"
issuer_url: "https://your-org.okta.com/oauth2/default"
scopes: ["openid", "email", "profile", "groups"]
# Map Identity Provider groups to SimpleTuner Roles
role_mapping:
claim: "groups"
admin_groups: ["ML-Platform-Admins"]
user_groups: ["ML-Researchers"]Cross-Worker OAuth State Validation
When using OIDC authentication in multi-worker deployments (e.g., behind a load balancer with multiple Gunicorn workers), OAuth state validation must work across all workers. SimpleTuner handles this automatically by storing OAuth state in the database.
How it works:
-
State Generation: When a user initiates OIDC login, a cryptographically random state token is generated and stored in the database with the provider name, redirect URI, and a 10-minute expiration.
-
State Validation: When the callback arrives (potentially to a different worker), the state is looked up and atomically consumed (single-use).
-
Cleanup: Expired states are automatically purged during normal operations.
No additional configuration is needed. OAuth state storage uses the same database as jobs and users.
Troubleshooting "Invalid OAuth state" errors:
- Check if the callback arrived within 10 minutes of login initiation
- Verify all workers share the same database path
- Check database write permissions
- Look for "Failed to store OAuth state" errors in logs
SimpleTuner uses a hierarchical role system. Users can be managed via GET/POST /api/users.
| Role | Priority | Description |
|---|---|---|
| Viewer | 10 | Read-only access to job history and logs. |
| Researcher | 20 | Standard access. Can submit jobs and manage their own API keys. |
| Lead | 30 | Can approve pending jobs and view team resource usage. |
| Admin | 100 | Full system access, including user management and rule configuration. |
Control costs and resource usage by requiring approvals for specific criteria. Rules are evaluated at submission time.
Workflow:
- User submits job -> Status becomes
pending_approval. - Leads check
GET /api/approvals/requests. - Lead calls
POST /.../approveorreject. - Job automatically proceeds to queue or is cancelled.
Approval Rules Engine
The rules engine evaluates job submissions against configured rules. Rules are processed in priority order; the first matching rule triggers the approval requirement.
Available Rule Conditions:
| Condition | Description |
|---|---|
cost_exceeds |
Triggers when estimated cost exceeds threshold (USD) |
hardware_type |
Matches hardware type (glob pattern, e.g., a100*) |
daily_jobs_exceed |
Triggers when user's daily job count exceeds threshold |
first_job |
Triggers for a user's very first job |
config_pattern |
Matches config name patterns |
provider |
Matches specific provider name |
Example: Require approval for jobs over $50:
curl -X POST http://localhost:8080/api/approvals/rules \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"name": "high_cost",
"condition": "cost_exceeds",
"threshold": "50",
"required_approver_level": "lead",
"exempt_levels": ["admin"]
}'Rules can specify exempt_levels to allow certain users to bypass approval, and applies_to_provider/applies_to_level to scope rules.
Email-Based Approval (IMAP Workflow)
For teams that prefer email-based workflows, SimpleTuner supports approval via email replies using IMAP IDLE.
How It Works:
- Job submission triggers approval requirement
- Notification email sent to approvers with unique response token
- IMAP handler monitors inbox using IDLE (push notifications)
- Approver replies with "approve" or "reject" (or aliases like
yes,lgtm,+1) - System parses response and processes approval
Configure via Administration → Notifications or API. Response tokens expire after 24 hours and are single-use.
The scheduler manages fair usage of resources. See its dedicated documentation for details.
- Priority: Admins > Leads > Researchers > Viewers.
- Concurrency: Limits are enforced globally and per-user.
- Update limits via UI: Cloud tab → Job Queue panel (admin only)
- Update limits via API:
POST /api/queue/concurrencywith{"max_concurrent": 10, "user_max_concurrent": 3}
For secure environments where public webhooks are impossible, SimpleTuner includes a background poller.
Add to simpletuner-enterprise.yaml:
background:
job_status_polling:
enabled: true
interval_seconds: 30This service queries the provider API every 30s and updates the local database, emitting real-time events to the UI via SSE.
Securely manage cloud provider credentials. See API Cookbook for rotation scripts and provider-specific details in the Cloud Training documentation.
OIDC/LDAP Configuration Examples
Keycloak (OIDC):
curl -X POST http://localhost:8080/api/cloud/external-auth/providers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"name": "keycloak",
"provider_type": "oidc",
"enabled": true,
"config": {
"issuer": "https://keycloak.example.com/realms/ml-training",
"client_id": "simpletuner",
"client_secret": "your-client-secret",
"scopes": ["openid", "email", "profile", "roles"],
"roles_claim": "realm_access.roles"
}
}'LDAP / Active Directory:
curl -X POST http://localhost:8080/api/cloud/external-auth/providers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"name": "corporate-ad",
"provider_type": "ldap",
"enabled": true,
"level_mapping": {
"CN=ML-Admins,OU=Groups,DC=corp,DC=com": ["admin"]
},
"config": {
"server": "ldaps://ldap.corp.com:636",
"base_dn": "DC=corp,DC=com",
"bind_dn": "CN=svc-simpletuner,OU=Service Accounts,DC=corp,DC=com",
"bind_password": "service-account-password",
"user_search_filter": "(sAMAccountName={username})",
"use_ssl": true
}
}'User Administration Examples
Create a Researcher:
curl -X POST http://localhost:8080/api/users \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"email": "researcher@company.com",
"username": "jsmith",
"password": "secure_password_123",
"level_names": ["researcher"]
}'Grant Custom Permission:
curl -X POST http://localhost:8080/api/users/123/permissions \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"permission_name": "admin.approve", "granted": true}'Credential Management
SimpleTuner includes credential lifecycle management for tracking, rotating, and auditing API credentials.
Credential Resolution: When submitting jobs, SimpleTuner checks per-user credentials first, then falls back to global credentials (environment variables).
| Scenario | Per-User | Global | Behavior |
|---|---|---|---|
| Shared org key | ❌ | ✅ | All users share the org's API key |
| BYOK | ✅ | ❌ | Each user provides their own key |
| Hybrid | Some | ✅ | Users with keys use theirs, others use global |
Rotation: Navigate to Admin > Auth → user → Manage Credentials → Rotate. Stale credentials (>90 days) display a warning badge.
Airflow example
def submit_and_wait(job_config, provider="replicate", **context):
resp = requests.post(
f"http://localhost:8080/api/cloud/{provider}/submit",
json=job_config,
headers={"Authorization": f"Bearer {TOKEN}"}
)
job_id = resp.json()["job_id"]
while True:
status = requests.get(f"http://localhost:8080/api/cloud/jobs/{job_id}")
state = status.json()["status"]
if state in ("completed", "failed", "cancelled"):
return status.json()
time.sleep(30)Health Check Failures
503 Service Unavailable: Check database connectivity.Degraded: Usually means an optional component (like a cloud provider API) is unreachable or unconfigured.
Authentication Issues
- OIDC Redirect Loop: Verify
issuer_urlmatches exactly what the provider expects (trailing slashes matter!). - Internal Auth Bypass: Check server logs for "Auth bypassed for IP..." to confirm your load balancer is passing the correct
X-Real-IP.
Job Updates Stalled
- If webhooks are blocked, ensure Job Status Polling is enabled in
simpletuner-enterprise.yaml. - Check
GET /api/cloud/metrics/prometheusforsimpletuner_jobs_activeto see if the internal state thinks jobs are running.
Missing Metrics
- Ensure your Prometheus scraper is configured to hit
/api/cloud/metrics/prometheusand not just/metrics.
SimpleTuner supports hierarchical organizations and teams with ceiling-based quota enforcement.
Organization (quota ceiling)
└── Team (quota ceiling, bounded by org)
└── User (limit, bounded by team and org)
Quotas use a ceiling model where org limits are absolute ceilings:
- Org quota: Absolute ceiling for all members
- Team quota: Ceiling for team members (cannot exceed org)
- User/Level quota: Specific limits (bounded by team and org)
Example:
- Org ceiling: 100 concurrent jobs
- Team ceiling: 20 concurrent jobs
- User limit: 50 concurrent jobs → Effective: 20 (team ceiling applies)
Enforcement Rules:
- Team quotas are validated at set-time: attempting to set a team quota higher than the org ceiling returns HTTP 400
- User quotas are validated at runtime: effective limit is the minimum of user, team, and org ceilings
- Reducing an org ceiling does not automatically reduce existing team ceilings (admin must update manually)
API Examples
Create Organization:
curl -X POST http://localhost:8080/api/orgs \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"name": "ML Research", "slug": "ml-research"}'Set Org Quota Ceiling:
curl -X POST http://localhost:8080/api/orgs/1/quotas \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"quota_type": "concurrent_jobs", "limit_value": 100, "action": "block"}'Create Team:
curl -X POST http://localhost:8080/api/orgs/1/teams \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"name": "NLP Team", "slug": "nlp"}'Add User to Team:
curl -X POST http://localhost:8080/api/orgs/1/teams/1/members \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"user_id": 123, "role": "member"}'When a quota or cost limit is reached, the configured action determines behavior:
| Action | Behavior |
|---|---|
warn |
Job proceeds with warning in logs/UI |
block |
Job submission rejected |
notify |
Job proceeds, admins alerted |
Cost Limit Configuration
Cost limits can be configured per-provider via Cloud tab → Settings or API:
curl -X PUT http://localhost:8080/api/cloud/providers/<provider>/config \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"cost_limit_enabled": true,
"cost_limit_amount": 500.00,
"cost_limit_period": "monthly",
"cost_limit_action": "warn"
}'Check status: GET /api/cloud/metrics/cost-limit-status
SimpleTuner does not support job dependencies or multi-step workflows where one job's output feeds into another. Each cloud job is independent.
Recommended approach: Use external orchestration tools like Airflow, Prefect, or Dagster to chain jobs via the REST API. See the Airflow example in the API Cookbook above.
There is no built-in support for resuming interrupted, failed, or cancelled training runs. Cloud jobs do not automatically recover from checkpoints.
Workarounds:
- Configure frequent HuggingFace Hub pushes (
--push_checkpoints_to_hub) to save intermediate state - Implement custom checkpoint management by downloading outputs and using them as starting points for new jobs
- For mission-critical workloads, consider breaking long training runs into smaller segments
UI Feature Reference
| Feature | UI Location | API |
|---|---|---|
| Organizations & Teams | Administration → Orgs | /api/orgs |
| Quotas | Administration → Quotas | /api/orgs/{id}/quotas |
| OIDC/LDAP | Administration → Auth | /api/cloud/external-auth/providers |
| Users | Administration → Users | /api/users |
| Audit Logs | Sidebar → Audit Log | /api/audit |
| Queue | Cloud tab → Job Queue | /api/queue/concurrency |
| Approvals | Administration → Approvals | /api/approvals/requests |
The Administration section is visible when no auth is configured (single-user mode) or the user has admin privileges.
Enterprise Onboarding Flow
The Admin panel includes a guided onboarding that helps set up authentication, organizations, teams, quotas, and credentials in order.
| Step | Feature |
|---|---|
| 1 | Authentication (OIDC/LDAP) |
| 2 | Organization |
| 3 | Teams |
| 4 | Quotas |
| 5 | Credentials |
Each step can be completed or skipped. State persists in browser localStorage.
SimpleTuner includes a multi-channel notification system for job status, approvals, quotas, and system events.
| Channel | Use Case |
|---|---|
| Approval requests, job completion (SMTP/IMAP) | |
| Webhook | CI/CD integration (JSON + HMAC signatures) |
| Slack | Team notifications (incoming webhooks) |
Configure via Administration → Notifications or API.
Event Types
| Category | Events |
|---|---|
| Approval | approval.required, approval.granted, approval.rejected, approval.expired |
| Job | job.submitted, job.started, job.completed, job.failed, job.cancelled |
| Quota | quota.warning, quota.exceeded, cost.warning, cost.exceeded |
| System | system.provider_error, system.provider_degraded, system.webhook_failure |
| Auth | auth.login_failure, auth.new_device |
Channel Configuration Examples
Email:
curl -X POST http://localhost:8080/api/cloud/notifications/channels \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"channel_type": "email",
"name": "Primary Email",
"smtp_host": "smtp.gmail.com",
"smtp_port": 587,
"smtp_use_tls": true
}'Slack:
curl -X POST http://localhost:8080/api/cloud/notifications/channels \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"channel_type": "slack",
"name": "Training Alerts",
"webhook_url": "https://hooks.slack.com/services/T00/B00/xxxx"
}'Webhook: Payloads signed with HMAC-SHA256 (X-SimpleTuner-Signature header).
Resource rules provide fine-grained access control for configs, hardware types, and output paths using glob patterns.
| Type | Example Pattern |
|---|---|
config |
team-x-*, production-* |
hardware |
gpu-a100*, *-80gb |
provider |
replicate, runpod |
Rules use allow/deny actions with "most permissive wins" logic. Configure via Administration → Rules.
Rule Examples
Team Isolation: Researchers can only use configs starting with "team-x-"
Level: researcher
Rules:
- config: "team-x-*" → allow
- config: "*" → deny
Hardware Restrictions: Researchers limited to T4/V100, leads can use any hardware
Level: researcher → hardware: "gpu-t4*" allow, "gpu-v100*" allow
Level: lead → hardware: "*" allow
Full Permission Matrix
| Permission | Viewer | Researcher | Lead | Admin |
|---|---|---|---|---|
job.submit |
✓ | ✓ | ✓ | |
job.view.own |
✓ | ✓ | ✓ | ✓ |
job.view.all |
✓ | ✓ | ||
job.cancel.own |
✓ | ✓ | ✓ | |
job.cancel.all |
✓ | |||
job.priority.high |
✓ | ✓ | ||
job.bypass.queue |
✓ | |||
job.bypass.approval |
✓ |
| Permission | Viewer | Researcher | Lead | Admin |
|---|---|---|---|---|
config.view |
✓ | ✓ | ✓ | ✓ |
config.create |
✓ | ✓ | ✓ | |
config.edit.own |
✓ | ✓ | ✓ | |
config.edit.all |
✓ |
| Permission | Viewer | Researcher | Lead | Admin |
|---|---|---|---|---|
admin.users |
✓ | |||
admin.approve |
✓ | ✓ | ||
admin.audit |
✓ | ✓ | ||
admin.config |
✓ | |||
queue.approve |
✓ | ✓ | ||
queue.manage |
✓ |
| Permission | Viewer | Researcher | Lead | Admin |
|---|---|---|---|---|
org.view |
✓ | ✓ | ||
org.create |
✓ | |||
team.view |
✓ | ✓ | ||
team.create |
✓ | ✓ | ||
team.manage.members |
✓ | ✓ |
Permission Overrides: Individual users can have permissions granted or denied via Administration → Users → Permission Overrides.