AI-Stack

Documentation class: live truth This is live truth. Default recovery path: README -> PROJECT_STATE_AUDIT -> LATEST_EVIDENCE_INDEX Most authoritative source for the current state in this scope: this file. Updated: 2026-04-07

AI-Stack

Self-hosted AI workspace for teams

AI-Stack is a self-hosted AI workspace built for private LLM deployment. It combines a team-facing web chat experience and admin/runtime controls on top, with an OpenAI-compatible runtime governance layer underneath.

It is designed for teams that want more than "a model that runs":

a usable internal AI entrypoint
centralized model access
tenant-aware policy and routing
traceable and observable runtime behavior
safer benchmark, regression, and rollout workflows

The public-facing project name is AI-Stack. Some repository paths and historical runtime identifiers still use ai-stack, which is normal for the current codebase.

What AI-Stack Provides

Team-facing workspace

web chat for internal AI usage
a product surface suitable for private deployment
a foundation for multi-model conversations and controlled rollout
file-grounded chat, browser-grounded chat, and direct vision entrypoints on the same /chat surface

Admin and governance layer

tenant-aware runtime policy
rate, inflight, and timeout controls
static and weighted gray routing
runtime health, readiness, and metrics endpoints
trace output for request-level diagnosis
deployment publication, route binding, quarantine, rebound, and catalog-backed routing

Serving and delivery toolchain

OpenAI-compatible chat serving path
OpenAI-compatible embeddings path
vLLM as the primary serving baseline
SGLang as a secondary comparison and fast-lane path
benchmark, repeated-run comparison, and regression workflows
K8s smoke, observability export, rollback, and signoff workflows

Why AI-Stack

Many teams can already launch a model server.

The harder part starts after that:

turning model access into a usable internal product
controlling runtime behavior across teams or tenants
diagnosing failures and stream issues
comparing backend performance with real reports
validating rollouts before they hit real users
keeping deployment and release evidence organized

AI-Stack focuses on that layer.

Who It Is For

AI-Stack is built for:

teams deploying internal AI assistants in private environments
engineers building self-hosted LLM platforms
organizations that need governance, observability, and safer rollout workflows
teams that want a workspace on top and a controllable runtime underneath

Core Highlights

OpenAI-compatible gateway
Tenant-aware policy enforcement
Gray routing and rollback
SSE passthrough for chat streaming
Trace JSON write-out for request diagnosis
Prometheus-friendly metrics
vLLM primary baseline
SGLang comparison and fast-lane path
Benchmark / regression / signoff workflows
Kubernetes smoke and observability export

Architecture At A Glance

flowchart TB
  subgraph Product["Team-facing product layer"]
    Chat["Web Chat Workspace"]
    Admin["Admin Console / Runtime UI"]
  end

  subgraph Governance["OpenAI-compatible runtime governance layer"]
    Gateway["gateway-go\npolicy / routing / SSE / traces / metrics"]
    CP["control-plane + deployment-controller\nidentity / placement / publish"]
    HA["host-agent\nruntime lifecycle"]
  end

  subgraph Serving["Serving backend layer"]
    VLLM["vLLM\nprimary baseline"]
    SGLang["SGLang\nsecondary compare / fast lane"]
    Llama["llama.cpp\nintegrated runtime path"]
  end

  subgraph Ops["Validation, delivery, and operations"]
    Bench["benchmark / aggregates / regression"]
    K8s["K8s smoke / observability / rollback / signoff"]
  end

  Product --> Gateway
  Gateway --> CP
  CP --> HA
  HA --> VLLM
  HA --> SGLang
  HA --> Llama
  Gateway --> Bench
  CP --> Bench
  HA --> Bench
  Bench --> K8s

Screenshots

Chat Workspace

Admin Console

Observability / Events

Chat Workspace Design

The current /chat surface is designed as a governed workspace, not just a model picker. It separates evidence collection from answer generation so teams can keep one chat entrypoint while still routing requests through different grounded paths.

flowchart LR
  User["User prompt / attachment"] --> Mode["Evidence mode\nAuto / Chat / File / Browser / Vision Direct"]
  Mode -->|Auto| Intent["Intent resolver"]
  Mode -->|Explicit File| FileChain["File evidence chain"]
  Mode -->|Explicit Browser| BrowserChain["Browser evidence chain"]
  Mode -->|Explicit Vision| VisionDirect["Vision direct path"]
  Intent --> ChatPath["Plain chat path"]
  Intent --> FileChain
  Intent --> BrowserChain
  Intent --> Utility["Live utility guardrail"]
  ChatPath --> Answer["Answer model\nMain / Fast / Vision"]
  FileChain --> Answer
  BrowserChain --> Answer
  VisionDirect --> Answer
  Utility --> Answer
  Answer --> UX["Citations / evidence rail / session context"]

Current workspace design principles:

Design axis	Current behavior	Why it matters
Unified entrypoint	plain chat, file-grounded chat, browser-grounded chat, and `Vision Direct` live on the same `/chat` surface	teams do not need separate products for each interaction mode
Evidence modes	`Auto`, `Chat`, `File`, `Browser`, and `Vision Direct` are live now; `Mixed` is visible as the next mode but not wired end-to-end yet	the UI exposes real capability boundaries instead of hiding them
Governed routing	`Auto` resolves intent first; explicit web-search directives force Browser mode, and current date/time prompts can resolve to `utility_live`	the workspace can escalate into grounded flows without relying only on prompt wording
Answer-model split	`Main`, `Fast`, or `Vision` can be chosen independently from the evidence chain	teams can trade off latency, quality, and modality without changing the product flow
Grounded UX	citations, evidence rail metadata, source chips, and attachment status are part of the page model	grounded answers stay inspectable instead of feeling like a black box
Operator defaults	same-origin console proxy, concise default answers, and session-local context are already part of the current experience	the workspace behaves like an internal product surface, not just a raw API tester

Admin Console Surfaces

The current admin/runtime console is already split into concrete operational surfaces:

Surface	Current purpose
`Overview`	project health, route mix, recent releases, and operator shortcuts
`Models & Routing`	model inventory, route selection, candidate lanes, gray rollout, and publish/rollback workflows
`Deployments`	host view, placement decisions, deployment health, and release state
`Observability`	events, trace-oriented investigation, logs, and benchmark context
`Access & Credentials`	service accounts, API keys, and OAuth client management
`Cost & Policy`	budget posture, policy triggers, tenant or project cost views, and current enforcement posture
`Settings`	workspace defaults, shortcuts, and console-level preferences

Expanded Capability Surface

AI-Stack now reaches beyond "serve a model" into a broader team and platform capability layer:

Capability area	Current business-facing outcome
Internal AI workspace	a single team-facing chat entrypoint for everyday internal AI usage
File-grounded Q&A	upload, process, retrieve, and answer with citations on top of a file evidence chain
Browser-grounded research	search, read, rerank, and answer with web citations for freshness-seeking prompts
Vision interaction	explicit `Vision Direct` path for image-aware requests on the same workspace surface
Governed model access	centralized `main`, `fast`, `vision`, and embeddings access behind API keys and aliases
Runtime governance	tenant-aware policy, routing, rate controls, timeout controls, and gray rollout
Deployment operations	placement, publish, quarantine, rebound, route-binding correction, and catalog-backed routing
Access management	service-account, API-key, and OAuth-client control for team or project use
Runtime observability	health, readiness, metrics, events, traces, and evidence-oriented debugging views
External integration	OpenAI-compatible APIs for other internal tools and apps
Release validation	benchmark, repeated-run comparison, regression, kind smoke, rollback, and signoff workflows

Quick Start

1. Start the local gateway development path

GO_BIN=/home/fiscan/.local/tools/go/bin/go bash ai-stack/scripts/dev.sh

2. Start the steady FP8 workstation stack

cd ai-stack
bash scripts/steady_fp8_stack.sh start

3. Run the K8s smoke path

bash infra/k8s/scripts/install.sh
bash infra/k8s/scripts/smoke_k8s.sh

4. Run the non-GPU kind CI smoke lane

bash infra/k8s/scripts/kind_ci_smoke.sh

Current Capability Map

Area	Current status
Team-facing chat workspace	Available
Admin console / runtime UI	Available
Governed `Auto` routing with answer-model selection	Available
OpenAI-compatible chat completions	Available
SSE streaming path	Available
Models and routing console	Available
Deployment and release console	Available
Access and credential management	Available
Cost and policy console	Available
Observability and events console	Available
Tenant-aware policy enforcement	Available
Gray routing / rollback	Available
Tracing and metrics	Available
`vLLM` primary serving baseline	Available
`SGLang` comparison / fast-lane path	Available
File-grounded chat	Available as a development baseline
Browser-grounded chat	Available as a development baseline
Direct vision path	Available
Embeddings runtime support	Available as a development baseline
Benchmark / repeated-run reports	Available
K8s smoke / observability / rollback workflows	Available

What Makes AI-Stack Different

Problem	Typical setup	AI-Stack
Internal AI entrypoint	Separate demo UI or raw API	Workspace-oriented product surface
Runtime governance	Ad hoc scripts	Gateway-based policy and routing
Multi-tenant control	Usually missing	Built into runtime story
Streaming diagnosis	Hard to inspect	Trace + metrics friendly
Backend comparison	Manual experiments	Benchmark / aggregate / regression workflows
Safer rollout	One-shot switch	Gray routing + rollback path
K8s verification	Hand-built scripts	Smoke and observability workflows included

Current Verified Runtime Baselines

Workstation baseline

The current local serving baseline is the steady FP8 workstation stack.

Current verified ports:

Component	Port
control-plane	`8030`
gateway	`8020`
console	`4173`
qdrant	`6333`
main	`8100`
vision	`8101`
fast	`8102`
embed	`8103`
rerank	`8104`
degrade	`8105`

Current ready public baseline:

main
fast
vision
embed
rerank
qdrant

Current non-default / non-ready public lane:

degrade

K8s baseline

The current Kubernetes baseline is a minimal multi-node kind cluster:

1 control-plane node
2 worker nodes
active control-plane, deployment-controller, gateway, prometheus, grafana, and vllm
passing smoke_k8s.sh

This is a verified development baseline, not a production-grade HA cluster story.

Current Public API Surface

Current verified public-facing APIs:

GET /v1/models
POST /v1/chat/completions
POST /v1/embeddings
GET /v1/files
POST /v1/files

Current note:

the gateway code exposes /v1/responses, but the current verified running instance still returns 404 there, so it should not be treated as part of today's promised external contract

Current verified aliases:

chat-default
qwen-local-main
qwen-local-fast
chat-vision
chat-file
chat-browser
retrieval-embed-dev

Important boundary:

chat-default and qwen-local-main currently route to the same main lane
qwen-local-fast and chat-vision are distinct backends

Repository Layout

contracts/: source-of-truth contracts
infra/gateway-go/: governance gateway
infra/mock-openai/: local smoke and CI backend
infra/shared-gateway/: stable shared-host edge pattern
infra/observability/: Prometheus, Grafana, runbooks, chaos assets
infra/k8s/: deploy, smoke, rollback, and observability workflows
services/control-plane/: identity, placement, deployment, catalog, retrieval, workflow resources
services/host-agent/: runtime lifecycle and engine adapters
serving/vllm/: primary serving baseline
serving/sglang/: secondary engine comparison path
web/console/: team-facing workspace and runtime UI
bench/: reports, baselines, aggregates, trends
scripts/: dev, smoke, release, and maintenance scripts
docs/: architecture, boundaries, evidence, and state-audit docs

Scope And Current Boundaries

AI-Stack currently focuses on a chat-first workspace and runtime governance baseline.

Current mainline scope includes:

gateway-based chat serving
governance and routing
backend serving baselines
benchmark and regression workflows
observability and deployment validation
file-grounded chat, browser-grounded chat, and direct vision entrypoints

Current boundaries are explicit:

retrieval and embeddings are real, but still a development-baseline capability plane rather than a production retrieval platform
multimodal is real, but not yet a production VLM/OCR platform
workflow, batch/async, assistant-v2, and fit-v2 are active starter planes rather than finished product surfaces
the Python orchestrator path is currently parked legacy
the K8s story is a real baseline, but not yet a production-grade platform

This keeps the project honest, inspectable, and easier to adopt.

Good Fits

AI-Stack is a strong fit if you want to:

build an internal AI workspace on top of self-hosted models
standardize team access behind an OpenAI-compatible gateway
add governance, traces, and metrics to an existing serving stack
compare serving backends with repeatable benchmark evidence
validate K8s delivery paths with smoke and signoff workflows

Roadmap Direction

Natural next steps from the current baseline:

deeper workspace polish
richer admin controls
stronger product packaging for private deployment
broader runtime integrations
tighter release and evidence automation

Documentation

Recommended entry points:

License

This repository currently uses the Apache-2.0 license.

Contributing

Issues, discussions, and architecture feedback are welcome.

If you are working on:

self-hosted LLM deployment
runtime governance
AI workspace UX
observability and rollout safety

AI-Stack is built for exactly that layer.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
artifacts/signoff		artifacts/signoff
bench		bench
contracts		contracts
docs		docs
infra		infra
migration		migration
models/minimax		models/minimax
profiles		profiles
reports		reports
runbooks/incidents		runbooks/incidents
runtimes/llama		runtimes/llama
scripts		scripts
services		services
serving		serving
storage		storage
web/console		web/console
; cat ; done		; cat ; done
; cat ; echo; done		; cat ; echo; done
%s		%s
%sn ; done		%sn ; done
.gitignore		.gitignore
.qdrant-initialized		.qdrant-initialized
1%		1%
10%		10%
2%		2%
Citation		Citation
LICENSE		LICENSE
Main		Main
README.md		README.md
RELEASE.md		RELEASE.md
Refresh		Refresh
VLLM_INTEGRATION_NOTES.md		VLLM_INTEGRATION_NOTES.md
answer		answer
chat-file		chat-file
gateway		gateway
indexing		indexing
main		main
model=chat-file		model=chat-file
parsing		parsing
queued		queued
qwen25-fast-sglang		qwen25-fast-sglang
ready		ready
retrieve		retrieve

Folders and files

Latest commit

History

Repository files navigation

AI-Stack

What AI-Stack Provides

Team-facing workspace

Admin and governance layer

Serving and delivery toolchain

Why AI-Stack

Who It Is For

Core Highlights

Architecture At A Glance

Screenshots

Chat Workspace

Admin Console

Observability / Events

Chat Workspace Design

Admin Console Surfaces

Expanded Capability Surface

Quick Start

1. Start the local gateway development path

2. Start the steady FP8 workstation stack

3. Run the K8s smoke path

4. Run the non-GPU kind CI smoke lane

Current Capability Map

What Makes AI-Stack Different

Current Verified Runtime Baselines

Workstation baseline

K8s baseline

Current Public API Surface

Repository Layout

Scope And Current Boundaries

Good Fits

Roadmap Direction

Documentation

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages