Documentation class: live truth This is live truth. Default recovery path: README -> PROJECT_STATE_AUDIT -> LATEST_EVIDENCE_INDEX Most authoritative source for the current state in this scope: this file. Updated: 2026-04-07
Self-hosted AI workspace for teams
AI-Stack is a self-hosted AI workspace built for private LLM deployment. It combines a team-facing web chat experience and admin/runtime controls on top, with an OpenAI-compatible runtime governance layer underneath.
It is designed for teams that want more than "a model that runs":
- a usable internal AI entrypoint
- centralized model access
- tenant-aware policy and routing
- traceable and observable runtime behavior
- safer benchmark, regression, and rollout workflows
The public-facing project name is AI-Stack.
Some repository paths and historical runtime identifiers still use ai-stack, which is normal for the current codebase.
- web chat for internal AI usage
- a product surface suitable for private deployment
- a foundation for multi-model conversations and controlled rollout
- file-grounded chat, browser-grounded chat, and direct vision entrypoints on the same
/chatsurface
- tenant-aware runtime policy
- rate, inflight, and timeout controls
- static and weighted gray routing
- runtime health, readiness, and metrics endpoints
- trace output for request-level diagnosis
- deployment publication, route binding, quarantine, rebound, and catalog-backed routing
- OpenAI-compatible chat serving path
- OpenAI-compatible embeddings path
vLLMas the primary serving baselineSGLangas a secondary comparison and fast-lane path- benchmark, repeated-run comparison, and regression workflows
- K8s smoke, observability export, rollback, and signoff workflows
Many teams can already launch a model server.
The harder part starts after that:
- turning model access into a usable internal product
- controlling runtime behavior across teams or tenants
- diagnosing failures and stream issues
- comparing backend performance with real reports
- validating rollouts before they hit real users
- keeping deployment and release evidence organized
AI-Stack focuses on that layer.
AI-Stack is built for:
- teams deploying internal AI assistants in private environments
- engineers building self-hosted LLM platforms
- organizations that need governance, observability, and safer rollout workflows
- teams that want a workspace on top and a controllable runtime underneath
- OpenAI-compatible gateway
- Tenant-aware policy enforcement
- Gray routing and rollback
- SSE passthrough for chat streaming
- Trace JSON write-out for request diagnosis
- Prometheus-friendly metrics
vLLMprimary baselineSGLangcomparison and fast-lane path- Benchmark / regression / signoff workflows
- Kubernetes smoke and observability export
flowchart TB
subgraph Product["Team-facing product layer"]
Chat["Web Chat Workspace"]
Admin["Admin Console / Runtime UI"]
end
subgraph Governance["OpenAI-compatible runtime governance layer"]
Gateway["gateway-go\npolicy / routing / SSE / traces / metrics"]
CP["control-plane + deployment-controller\nidentity / placement / publish"]
HA["host-agent\nruntime lifecycle"]
end
subgraph Serving["Serving backend layer"]
VLLM["vLLM\nprimary baseline"]
SGLang["SGLang\nsecondary compare / fast lane"]
Llama["llama.cpp\nintegrated runtime path"]
end
subgraph Ops["Validation, delivery, and operations"]
Bench["benchmark / aggregates / regression"]
K8s["K8s smoke / observability / rollback / signoff"]
end
Product --> Gateway
Gateway --> CP
CP --> HA
HA --> VLLM
HA --> SGLang
HA --> Llama
Gateway --> Bench
CP --> Bench
HA --> Bench
Bench --> K8s
The current /chat surface is designed as a governed workspace, not just a model picker.
It separates evidence collection from answer generation so teams can keep one chat entrypoint while still routing requests through different grounded paths.
flowchart LR
User["User prompt / attachment"] --> Mode["Evidence mode\nAuto / Chat / File / Browser / Vision Direct"]
Mode -->|Auto| Intent["Intent resolver"]
Mode -->|Explicit File| FileChain["File evidence chain"]
Mode -->|Explicit Browser| BrowserChain["Browser evidence chain"]
Mode -->|Explicit Vision| VisionDirect["Vision direct path"]
Intent --> ChatPath["Plain chat path"]
Intent --> FileChain
Intent --> BrowserChain
Intent --> Utility["Live utility guardrail"]
ChatPath --> Answer["Answer model\nMain / Fast / Vision"]
FileChain --> Answer
BrowserChain --> Answer
VisionDirect --> Answer
Utility --> Answer
Answer --> UX["Citations / evidence rail / session context"]
Current workspace design principles:
| Design axis | Current behavior | Why it matters |
|---|---|---|
| Unified entrypoint | plain chat, file-grounded chat, browser-grounded chat, and Vision Direct live on the same /chat surface |
teams do not need separate products for each interaction mode |
| Evidence modes | Auto, Chat, File, Browser, and Vision Direct are live now; Mixed is visible as the next mode but not wired end-to-end yet |
the UI exposes real capability boundaries instead of hiding them |
| Governed routing | Auto resolves intent first; explicit web-search directives force Browser mode, and current date/time prompts can resolve to utility_live |
the workspace can escalate into grounded flows without relying only on prompt wording |
| Answer-model split | Main, Fast, or Vision can be chosen independently from the evidence chain |
teams can trade off latency, quality, and modality without changing the product flow |
| Grounded UX | citations, evidence rail metadata, source chips, and attachment status are part of the page model | grounded answers stay inspectable instead of feeling like a black box |
| Operator defaults | same-origin console proxy, concise default answers, and session-local context are already part of the current experience | the workspace behaves like an internal product surface, not just a raw API tester |
The current admin/runtime console is already split into concrete operational surfaces:
| Surface | Current purpose |
|---|---|
Overview |
project health, route mix, recent releases, and operator shortcuts |
Models & Routing |
model inventory, route selection, candidate lanes, gray rollout, and publish/rollback workflows |
Deployments |
host view, placement decisions, deployment health, and release state |
Observability |
events, trace-oriented investigation, logs, and benchmark context |
Access & Credentials |
service accounts, API keys, and OAuth client management |
Cost & Policy |
budget posture, policy triggers, tenant or project cost views, and current enforcement posture |
Settings |
workspace defaults, shortcuts, and console-level preferences |
AI-Stack now reaches beyond "serve a model" into a broader team and platform capability layer:
| Capability area | Current business-facing outcome |
|---|---|
| Internal AI workspace | a single team-facing chat entrypoint for everyday internal AI usage |
| File-grounded Q&A | upload, process, retrieve, and answer with citations on top of a file evidence chain |
| Browser-grounded research | search, read, rerank, and answer with web citations for freshness-seeking prompts |
| Vision interaction | explicit Vision Direct path for image-aware requests on the same workspace surface |
| Governed model access | centralized main, fast, vision, and embeddings access behind API keys and aliases |
| Runtime governance | tenant-aware policy, routing, rate controls, timeout controls, and gray rollout |
| Deployment operations | placement, publish, quarantine, rebound, route-binding correction, and catalog-backed routing |
| Access management | service-account, API-key, and OAuth-client control for team or project use |
| Runtime observability | health, readiness, metrics, events, traces, and evidence-oriented debugging views |
| External integration | OpenAI-compatible APIs for other internal tools and apps |
| Release validation | benchmark, repeated-run comparison, regression, kind smoke, rollback, and signoff workflows |
GO_BIN=/home/fiscan/.local/tools/go/bin/go bash ai-stack/scripts/dev.shcd ai-stack
bash scripts/steady_fp8_stack.sh startbash infra/k8s/scripts/install.sh
bash infra/k8s/scripts/smoke_k8s.shbash infra/k8s/scripts/kind_ci_smoke.sh| Area | Current status |
|---|---|
| Team-facing chat workspace | Available |
| Admin console / runtime UI | Available |
Governed Auto routing with answer-model selection |
Available |
| OpenAI-compatible chat completions | Available |
| SSE streaming path | Available |
| Models and routing console | Available |
| Deployment and release console | Available |
| Access and credential management | Available |
| Cost and policy console | Available |
| Observability and events console | Available |
| Tenant-aware policy enforcement | Available |
| Gray routing / rollback | Available |
| Tracing and metrics | Available |
vLLM primary serving baseline |
Available |
SGLang comparison / fast-lane path |
Available |
| File-grounded chat | Available as a development baseline |
| Browser-grounded chat | Available as a development baseline |
| Direct vision path | Available |
| Embeddings runtime support | Available as a development baseline |
| Benchmark / repeated-run reports | Available |
| K8s smoke / observability / rollback workflows | Available |
| Problem | Typical setup | AI-Stack |
|---|---|---|
| Internal AI entrypoint | Separate demo UI or raw API | Workspace-oriented product surface |
| Runtime governance | Ad hoc scripts | Gateway-based policy and routing |
| Multi-tenant control | Usually missing | Built into runtime story |
| Streaming diagnosis | Hard to inspect | Trace + metrics friendly |
| Backend comparison | Manual experiments | Benchmark / aggregate / regression workflows |
| Safer rollout | One-shot switch | Gray routing + rollback path |
| K8s verification | Hand-built scripts | Smoke and observability workflows included |
The current local serving baseline is the steady FP8 workstation stack.
Current verified ports:
| Component | Port |
|---|---|
| control-plane | 8030 |
| gateway | 8020 |
| console | 4173 |
| qdrant | 6333 |
| main | 8100 |
| vision | 8101 |
| fast | 8102 |
| embed | 8103 |
| rerank | 8104 |
| degrade | 8105 |
Current ready public baseline:
mainfastvisionembedrerankqdrant
Current non-default / non-ready public lane:
degrade
The current Kubernetes baseline is a minimal multi-node kind cluster:
1control-plane node2worker nodes- active
control-plane,deployment-controller,gateway,prometheus,grafana, andvllm - passing
smoke_k8s.sh
This is a verified development baseline, not a production-grade HA cluster story.
Current verified public-facing APIs:
GET /v1/modelsPOST /v1/chat/completionsPOST /v1/embeddingsGET /v1/filesPOST /v1/files
Current note:
- the gateway code exposes
/v1/responses, but the current verified running instance still returns404there, so it should not be treated as part of today's promised external contract
Current verified aliases:
chat-defaultqwen-local-mainqwen-local-fastchat-visionchat-filechat-browserretrieval-embed-dev
Important boundary:
chat-defaultandqwen-local-maincurrently route to the same main laneqwen-local-fastandchat-visionare distinct backends
contracts/: source-of-truth contractsinfra/gateway-go/: governance gatewayinfra/mock-openai/: local smoke and CI backendinfra/shared-gateway/: stable shared-host edge patterninfra/observability/: Prometheus, Grafana, runbooks, chaos assetsinfra/k8s/: deploy, smoke, rollback, and observability workflowsservices/control-plane/: identity, placement, deployment, catalog, retrieval, workflow resourcesservices/host-agent/: runtime lifecycle and engine adaptersserving/vllm/: primary serving baselineserving/sglang/: secondary engine comparison pathweb/console/: team-facing workspace and runtime UIbench/: reports, baselines, aggregates, trendsscripts/: dev, smoke, release, and maintenance scriptsdocs/: architecture, boundaries, evidence, and state-audit docs
AI-Stack currently focuses on a chat-first workspace and runtime governance baseline.
Current mainline scope includes:
- gateway-based chat serving
- governance and routing
- backend serving baselines
- benchmark and regression workflows
- observability and deployment validation
- file-grounded chat, browser-grounded chat, and direct vision entrypoints
Current boundaries are explicit:
- retrieval and embeddings are real, but still a development-baseline capability plane rather than a production retrieval platform
- multimodal is real, but not yet a production VLM/OCR platform
- workflow, batch/async, assistant-v2, and fit-v2 are active starter planes rather than finished product surfaces
- the Python orchestrator path is currently parked legacy
- the K8s story is a real baseline, but not yet a production-grade platform
This keeps the project honest, inspectable, and easier to adopt.
AI-Stack is a strong fit if you want to:
- build an internal AI workspace on top of self-hosted models
- standardize team access behind an OpenAI-compatible gateway
- add governance, traces, and metrics to an existing serving stack
- compare serving backends with repeatable benchmark evidence
- validate K8s delivery paths with smoke and signoff workflows
Natural next steps from the current baseline:
- deeper workspace polish
- richer admin controls
- stronger product packaging for private deployment
- broader runtime integrations
- tighter release and evidence automation
Recommended entry points:
README.mddocs/PROJECT_STATE_AUDIT.mddocs/LATEST_EVIDENCE_INDEX.mddocs/TECHNICAL_OVERVIEW.mddocs/PLATFORM_BOUNDARIES.mddocs/PROJECT_PITCH.mddocs/AI_STACK_TECH_PRODUCT_GUIDE_CN.mdinfra/gateway-go/README.mdserving/vllm/README.mdinfra/k8s/README.md
This repository currently uses the Apache-2.0 license.
Issues, discussions, and architecture feedback are welcome.
If you are working on:
- self-hosted LLM deployment
- runtime governance
- AI workspace UX
- observability and rollout safety
AI-Stack is built for exactly that layer.


