Skip to content

Latest commit

 

History

History
734 lines (534 loc) · 22.6 KB

File metadata and controls

734 lines (534 loc) · 22.6 KB

Enterprise Guide

This document covers deploying SimpleTuner in multi-user environments with authentication, approval workflows, and quota management.

1. Deployment & Infrastructure

Configuration Methods

Most enterprise features can be configured via the Web UI (Administration panel) or REST API. A few infrastructure-level settings require a config file or environment variables.

Feature Web UI API Config File
OIDC/LDAP providers
Users & roles
Approval rules
Quotas
Notifications
Network bypass (trusted proxies)
Background job polling
TLS settings

Config file (simpletuner-enterprise.yaml or .json) is only needed for infrastructure settings that must be known at startup. SimpleTuner searches these locations:

  1. $SIMPLETUNER_ENTERPRISE_CONFIG (environment variable)
  2. ./simpletuner-enterprise.yaml (current directory)
  3. ~/.config/simpletuner/enterprise.yaml
  4. /etc/simpletuner/enterprise.yaml

The file supports environment variable interpolation with ${VAR} syntax.

Quick Start Checklist

  1. Start SimpleTuner: simpletuner server (or --webui for local use)
  2. Configure via UI: Navigate to Administration panel to set up users, SSO, quotas
  3. Health Checks (for production):
    • Liveness: GET /api/cloud/health/live (200 OK)
    • Readiness: GET /api/cloud/health/ready (200 OK)
    • Deep Check: GET /api/cloud/health (includes provider connectivity)

Network Security & Authentication Bypass

Configuring trusted proxies and internal network bypass (config file required)

In corporate environments (VPNs, private VPCs), you may want to trust internal traffic or offload authentication to a gateway.

simpletuner-enterprise.yaml:

network:
  # Trust headers from your load balancer (e.g., AWS ALB, Nginx)
  trust_proxy_headers: true
  trusted_proxies:
    - "10.0.0.0/8"
    - "192.168.0.0/16"

  # Optional: Trust specific internal subnets to bypass login
  bypass_auth_for_internal: true
  internal_networks:
    - "10.10.0.0/16"  # VPN Clients

auth:
  # Always allow health checks without auth
  bypass_paths:
    - "/health"
    - "/api/cloud/health"
    - "/api/cloud/metrics/prometheus"

Load Balancer & TLS Configuration

SimpleTuner expects an upstream reverse proxy for TLS termination.

nginx reverse proxy example
server {
    listen 443 ssl http2;
    server_name trainer.internal;

    ssl_certificate /etc/ssl/certs/simpletuner.crt;
    ssl_certificate_key /etc/ssl/private/simpletuner.key;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for real-time logs/SSE
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400;
    }
}

Observability (Prometheus & Logging)

Metrics: Scrape GET /api/cloud/metrics/prometheus for operational insights.

  • simpletuner_jobs_active: Current queue depth.
  • simpletuner_cost_total_usd: Spend tracking.
  • simpletuner_uptime_seconds: Availability.

Logging: Set SIMPLETUNER_LOG_FORMAT=json for ingestion into Splunk/Datadog/ELK.

Data Retention Configuration

Configure retention periods for compliance requirements via environment variables:

Variable Default Description
SIMPLETUNER_JOB_RETENTION_DAYS 90 Days to retain completed job records
SIMPLETUNER_AUDIT_RETENTION_DAYS 90 Days to retain audit log entries
# SOC 2 / HIPAA: 1 year retention
export SIMPLETUNER_JOB_RETENTION_DAYS=365
export SIMPLETUNER_AUDIT_RETENTION_DAYS=365

# Disable automatic cleanup (manual management)
export SIMPLETUNER_JOB_RETENTION_DAYS=0

Setting to 0 disables automatic cleanup. Cleanup runs daily.


2. Identity & Access Management (SSO)

SimpleTuner supports OIDC (OpenID Connect) and LDAP for SSO with Okta, Azure AD, Keycloak, and Active Directory.

Configuring Providers

Via Web UI: Navigate to Administration → Auth to add and configure providers.

Via API: See the API Cookbook for curl examples.

Via config file (for IaC/GitOps workflows)

Add to your simpletuner-enterprise.yaml:

oidc:
  enabled: true
  provider: "okta"  # or "azure_ad", "google"

  client_id: "0oa1234567890abcdef"
  client_secret: "${OIDC_CLIENT_SECRET}"
  issuer_url: "https://your-org.okta.com/oauth2/default"

  scopes: ["openid", "email", "profile", "groups"]

  # Map Identity Provider groups to SimpleTuner Roles
  role_mapping:
    claim: "groups"
    admin_groups: ["ML-Platform-Admins"]
    user_groups: ["ML-Researchers"]
Cross-Worker OAuth State Validation

When using OIDC authentication in multi-worker deployments (e.g., behind a load balancer with multiple Gunicorn workers), OAuth state validation must work across all workers. SimpleTuner handles this automatically by storing OAuth state in the database.

How it works:

  1. State Generation: When a user initiates OIDC login, a cryptographically random state token is generated and stored in the database with the provider name, redirect URI, and a 10-minute expiration.

  2. State Validation: When the callback arrives (potentially to a different worker), the state is looked up and atomically consumed (single-use).

  3. Cleanup: Expired states are automatically purged during normal operations.

No additional configuration is needed. OAuth state storage uses the same database as jobs and users.

Troubleshooting "Invalid OAuth state" errors:

  1. Check if the callback arrived within 10 minutes of login initiation
  2. Verify all workers share the same database path
  3. Check database write permissions
  4. Look for "Failed to store OAuth state" errors in logs

User Management & Roles

SimpleTuner uses a hierarchical role system. Users can be managed via GET/POST /api/users.

Role Priority Description
Viewer 10 Read-only access to job history and logs.
Researcher 20 Standard access. Can submit jobs and manage their own API keys.
Lead 30 Can approve pending jobs and view team resource usage.
Admin 100 Full system access, including user management and rule configuration.

3. Governance & Operations

Approval Workflows

Control costs and resource usage by requiring approvals for specific criteria. Rules are evaluated at submission time.

Workflow:

  1. User submits job -> Status becomes pending_approval.
  2. Leads check GET /api/approvals/requests.
  3. Lead calls POST /.../approve or reject.
  4. Job automatically proceeds to queue or is cancelled.
Approval Rules Engine

The rules engine evaluates job submissions against configured rules. Rules are processed in priority order; the first matching rule triggers the approval requirement.

Available Rule Conditions:

Condition Description
cost_exceeds Triggers when estimated cost exceeds threshold (USD)
hardware_type Matches hardware type (glob pattern, e.g., a100*)
daily_jobs_exceed Triggers when user's daily job count exceeds threshold
first_job Triggers for a user's very first job
config_pattern Matches config name patterns
provider Matches specific provider name

Example: Require approval for jobs over $50:

curl -X POST http://localhost:8080/api/approvals/rules \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "name": "high_cost",
    "condition": "cost_exceeds",
    "threshold": "50",
    "required_approver_level": "lead",
    "exempt_levels": ["admin"]
  }'

Rules can specify exempt_levels to allow certain users to bypass approval, and applies_to_provider/applies_to_level to scope rules.

Email-Based Approval (IMAP Workflow)

For teams that prefer email-based workflows, SimpleTuner supports approval via email replies using IMAP IDLE.

How It Works:

  1. Job submission triggers approval requirement
  2. Notification email sent to approvers with unique response token
  3. IMAP handler monitors inbox using IDLE (push notifications)
  4. Approver replies with "approve" or "reject" (or aliases like yes, lgtm, +1)
  5. System parses response and processes approval

Configure via Administration → Notifications or API. Response tokens expire after 24 hours and are single-use.

Job Queue & Concurrency

The scheduler manages fair usage of resources. See its dedicated documentation for details.

  • Priority: Admins > Leads > Researchers > Viewers.
  • Concurrency: Limits are enforced globally and per-user.
    • Update limits via UI: Cloud tab → Job Queue panel (admin only)
    • Update limits via API: POST /api/queue/concurrency with {"max_concurrent": 10, "user_max_concurrent": 3}

Job Status Polling (No Webhooks Required)

For secure environments where public webhooks are impossible, SimpleTuner includes a background poller.

Add to simpletuner-enterprise.yaml:

background:
  job_status_polling:
    enabled: true
    interval_seconds: 30

This service queries the provider API every 30s and updates the local database, emitting real-time events to the UI via SSE.

API Key Rotation

Securely manage cloud provider credentials. See API Cookbook for rotation scripts and provider-specific details in the Cloud Training documentation.


4. API Cookbook

OIDC/LDAP Configuration Examples

Keycloak (OIDC):

curl -X POST http://localhost:8080/api/cloud/external-auth/providers \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "name": "keycloak",
    "provider_type": "oidc",
    "enabled": true,
    "config": {
      "issuer": "https://keycloak.example.com/realms/ml-training",
      "client_id": "simpletuner",
      "client_secret": "your-client-secret",
      "scopes": ["openid", "email", "profile", "roles"],
      "roles_claim": "realm_access.roles"
    }
  }'

LDAP / Active Directory:

curl -X POST http://localhost:8080/api/cloud/external-auth/providers \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "name": "corporate-ad",
    "provider_type": "ldap",
    "enabled": true,
    "level_mapping": {
      "CN=ML-Admins,OU=Groups,DC=corp,DC=com": ["admin"]
    },
    "config": {
      "server": "ldaps://ldap.corp.com:636",
      "base_dn": "DC=corp,DC=com",
      "bind_dn": "CN=svc-simpletuner,OU=Service Accounts,DC=corp,DC=com",
      "bind_password": "service-account-password",
      "user_search_filter": "(sAMAccountName={username})",
      "use_ssl": true
    }
  }'
User Administration Examples

Create a Researcher:

curl -X POST http://localhost:8080/api/users \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "email": "researcher@company.com",
    "username": "jsmith",
    "password": "secure_password_123",
    "level_names": ["researcher"]
  }'

Grant Custom Permission:

curl -X POST http://localhost:8080/api/users/123/permissions \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"permission_name": "admin.approve", "granted": true}'
Credential Management

SimpleTuner includes credential lifecycle management for tracking, rotating, and auditing API credentials.

Credential Resolution: When submitting jobs, SimpleTuner checks per-user credentials first, then falls back to global credentials (environment variables).

Scenario Per-User Global Behavior
Shared org key All users share the org's API key
BYOK Each user provides their own key
Hybrid Some Users with keys use theirs, others use global

Rotation: Navigate to Admin > Auth → user → Manage CredentialsRotate. Stale credentials (>90 days) display a warning badge.

External Orchestration {#external-orchestration-airflow}

Airflow example
def submit_and_wait(job_config, provider="replicate", **context):
    resp = requests.post(
        f"http://localhost:8080/api/cloud/{provider}/submit",
        json=job_config,
        headers={"Authorization": f"Bearer {TOKEN}"}
    )
    job_id = resp.json()["job_id"]

    while True:
        status = requests.get(f"http://localhost:8080/api/cloud/jobs/{job_id}")
        state = status.json()["status"]
        if state in ("completed", "failed", "cancelled"):
            return status.json()
        time.sleep(30)

5. Troubleshooting

Health Check Failures

  • 503 Service Unavailable: Check database connectivity.
  • Degraded: Usually means an optional component (like a cloud provider API) is unreachable or unconfigured.

Authentication Issues

  • OIDC Redirect Loop: Verify issuer_url matches exactly what the provider expects (trailing slashes matter!).
  • Internal Auth Bypass: Check server logs for "Auth bypassed for IP..." to confirm your load balancer is passing the correct X-Real-IP.

Job Updates Stalled

  • If webhooks are blocked, ensure Job Status Polling is enabled in simpletuner-enterprise.yaml.
  • Check GET /api/cloud/metrics/prometheus for simpletuner_jobs_active to see if the internal state thinks jobs are running.

Missing Metrics

  • Ensure your Prometheus scraper is configured to hit /api/cloud/metrics/prometheus and not just /metrics.

6. Organizations & Team Quotas

SimpleTuner supports hierarchical organizations and teams with ceiling-based quota enforcement.

Hierarchy

Organization (quota ceiling)
    └── Team (quota ceiling, bounded by org)
         └── User (limit, bounded by team and org)

Ceiling Model

Quotas use a ceiling model where org limits are absolute ceilings:

  • Org quota: Absolute ceiling for all members
  • Team quota: Ceiling for team members (cannot exceed org)
  • User/Level quota: Specific limits (bounded by team and org)

Example:

  • Org ceiling: 100 concurrent jobs
  • Team ceiling: 20 concurrent jobs
  • User limit: 50 concurrent jobs → Effective: 20 (team ceiling applies)

Enforcement Rules:

  • Team quotas are validated at set-time: attempting to set a team quota higher than the org ceiling returns HTTP 400
  • User quotas are validated at runtime: effective limit is the minimum of user, team, and org ceilings
  • Reducing an org ceiling does not automatically reduce existing team ceilings (admin must update manually)
API Examples

Create Organization:

curl -X POST http://localhost:8080/api/orgs \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"name": "ML Research", "slug": "ml-research"}'

Set Org Quota Ceiling:

curl -X POST http://localhost:8080/api/orgs/1/quotas \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"quota_type": "concurrent_jobs", "limit_value": 100, "action": "block"}'

Create Team:

curl -X POST http://localhost:8080/api/orgs/1/teams \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"name": "NLP Team", "slug": "nlp"}'

Add User to Team:

curl -X POST http://localhost:8080/api/orgs/1/teams/1/members \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"user_id": 123, "role": "member"}'

Quota and Cost Limit Actions

When a quota or cost limit is reached, the configured action determines behavior:

Action Behavior
warn Job proceeds with warning in logs/UI
block Job submission rejected
notify Job proceeds, admins alerted
Cost Limit Configuration

Cost limits can be configured per-provider via Cloud tab → Settings or API:

curl -X PUT http://localhost:8080/api/cloud/providers/<provider>/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "cost_limit_enabled": true,
    "cost_limit_amount": 500.00,
    "cost_limit_period": "monthly",
    "cost_limit_action": "warn"
  }'

Check status: GET /api/cloud/metrics/cost-limit-status


7. Limitations

Workflow / Pipeline Jobs (DAGs)

SimpleTuner does not support job dependencies or multi-step workflows where one job's output feeds into another. Each cloud job is independent.

Recommended approach: Use external orchestration tools like Airflow, Prefect, or Dagster to chain jobs via the REST API. See the Airflow example in the API Cookbook above.

Resuming Training Runs

There is no built-in support for resuming interrupted, failed, or cancelled training runs. Cloud jobs do not automatically recover from checkpoints.

Workarounds:

  • Configure frequent HuggingFace Hub pushes (--push_checkpoints_to_hub) to save intermediate state
  • Implement custom checkpoint management by downloading outputs and using them as starting points for new jobs
  • For mission-critical workloads, consider breaking long training runs into smaller segments
UI Feature Reference
Feature UI Location API
Organizations & Teams Administration → Orgs /api/orgs
Quotas Administration → Quotas /api/orgs/{id}/quotas
OIDC/LDAP Administration → Auth /api/cloud/external-auth/providers
Users Administration → Users /api/users
Audit Logs Sidebar → Audit Log /api/audit
Queue Cloud tab → Job Queue /api/queue/concurrency
Approvals Administration → Approvals /api/approvals/requests

The Administration section is visible when no auth is configured (single-user mode) or the user has admin privileges.

Enterprise Onboarding Flow

The Admin panel includes a guided onboarding that helps set up authentication, organizations, teams, quotas, and credentials in order.

Step Feature
1 Authentication (OIDC/LDAP)
2 Organization
3 Teams
4 Quotas
5 Credentials

Each step can be completed or skipped. State persists in browser localStorage.


8. Notification System

SimpleTuner includes a multi-channel notification system for job status, approvals, quotas, and system events.

Channel Use Case
Email Approval requests, job completion (SMTP/IMAP)
Webhook CI/CD integration (JSON + HMAC signatures)
Slack Team notifications (incoming webhooks)

Configure via Administration → Notifications or API.

Event Types
Category Events
Approval approval.required, approval.granted, approval.rejected, approval.expired
Job job.submitted, job.started, job.completed, job.failed, job.cancelled
Quota quota.warning, quota.exceeded, cost.warning, cost.exceeded
System system.provider_error, system.provider_degraded, system.webhook_failure
Auth auth.login_failure, auth.new_device
Channel Configuration Examples

Email:

curl -X POST http://localhost:8080/api/cloud/notifications/channels \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "channel_type": "email",
    "name": "Primary Email",
    "smtp_host": "smtp.gmail.com",
    "smtp_port": 587,
    "smtp_use_tls": true
  }'

Slack:

curl -X POST http://localhost:8080/api/cloud/notifications/channels \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{
    "channel_type": "slack",
    "name": "Training Alerts",
    "webhook_url": "https://hooks.slack.com/services/T00/B00/xxxx"
  }'

Webhook: Payloads signed with HMAC-SHA256 (X-SimpleTuner-Signature header).


9. Resource Rules

Resource rules provide fine-grained access control for configs, hardware types, and output paths using glob patterns.

Type Example Pattern
config team-x-*, production-*
hardware gpu-a100*, *-80gb
provider replicate, runpod

Rules use allow/deny actions with "most permissive wins" logic. Configure via Administration → Rules.

Rule Examples

Team Isolation: Researchers can only use configs starting with "team-x-"

Level: researcher
Rules:
  - config: "team-x-*" → allow
  - config: "*" → deny

Hardware Restrictions: Researchers limited to T4/V100, leads can use any hardware

Level: researcher → hardware: "gpu-t4*" allow, "gpu-v100*" allow
Level: lead → hardware: "*" allow

10. Permission Matrix

Full Permission Matrix

Job Permissions

Permission Viewer Researcher Lead Admin
job.submit
job.view.own
job.view.all
job.cancel.own
job.cancel.all
job.priority.high
job.bypass.queue
job.bypass.approval

Config Permissions

Permission Viewer Researcher Lead Admin
config.view
config.create
config.edit.own
config.edit.all

Admin Permissions

Permission Viewer Researcher Lead Admin
admin.users
admin.approve
admin.audit
admin.config
queue.approve
queue.manage

Org/Team Permissions

Permission Viewer Researcher Lead Admin
org.view
org.create
team.view
team.create
team.manage.members

Permission Overrides: Individual users can have permissions granted or denied via Administration → Users → Permission Overrides.