Skip to content

dhr3065301055-jpg/Nexus-Foundry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Documentation class: live truth This is live truth. Default recovery path: README -> PROJECT_STATE_AUDIT -> LATEST_EVIDENCE_INDEX Most authoritative source for the current state in this scope: this file. Updated: 2026-04-07

AI-Stack

Self-hosted AI workspace for teams

AI-Stack is a self-hosted AI workspace built for private LLM deployment. It combines a team-facing web chat experience and admin/runtime controls on top, with an OpenAI-compatible runtime governance layer underneath.

It is designed for teams that want more than "a model that runs":

  • a usable internal AI entrypoint
  • centralized model access
  • tenant-aware policy and routing
  • traceable and observable runtime behavior
  • safer benchmark, regression, and rollout workflows

The public-facing project name is AI-Stack. Some repository paths and historical runtime identifiers still use ai-stack, which is normal for the current codebase.

What AI-Stack Provides

Team-facing workspace

  • web chat for internal AI usage
  • a product surface suitable for private deployment
  • a foundation for multi-model conversations and controlled rollout
  • file-grounded chat, browser-grounded chat, and direct vision entrypoints on the same /chat surface

Admin and governance layer

  • tenant-aware runtime policy
  • rate, inflight, and timeout controls
  • static and weighted gray routing
  • runtime health, readiness, and metrics endpoints
  • trace output for request-level diagnosis
  • deployment publication, route binding, quarantine, rebound, and catalog-backed routing

Serving and delivery toolchain

  • OpenAI-compatible chat serving path
  • OpenAI-compatible embeddings path
  • vLLM as the primary serving baseline
  • SGLang as a secondary comparison and fast-lane path
  • benchmark, repeated-run comparison, and regression workflows
  • K8s smoke, observability export, rollback, and signoff workflows

Why AI-Stack

Many teams can already launch a model server.

The harder part starts after that:

  • turning model access into a usable internal product
  • controlling runtime behavior across teams or tenants
  • diagnosing failures and stream issues
  • comparing backend performance with real reports
  • validating rollouts before they hit real users
  • keeping deployment and release evidence organized

AI-Stack focuses on that layer.

Who It Is For

AI-Stack is built for:

  • teams deploying internal AI assistants in private environments
  • engineers building self-hosted LLM platforms
  • organizations that need governance, observability, and safer rollout workflows
  • teams that want a workspace on top and a controllable runtime underneath

Core Highlights

  • OpenAI-compatible gateway
  • Tenant-aware policy enforcement
  • Gray routing and rollback
  • SSE passthrough for chat streaming
  • Trace JSON write-out for request diagnosis
  • Prometheus-friendly metrics
  • vLLM primary baseline
  • SGLang comparison and fast-lane path
  • Benchmark / regression / signoff workflows
  • Kubernetes smoke and observability export

Architecture At A Glance

flowchart TB
  subgraph Product["Team-facing product layer"]
    Chat["Web Chat Workspace"]
    Admin["Admin Console / Runtime UI"]
  end

  subgraph Governance["OpenAI-compatible runtime governance layer"]
    Gateway["gateway-go\npolicy / routing / SSE / traces / metrics"]
    CP["control-plane + deployment-controller\nidentity / placement / publish"]
    HA["host-agent\nruntime lifecycle"]
  end

  subgraph Serving["Serving backend layer"]
    VLLM["vLLM\nprimary baseline"]
    SGLang["SGLang\nsecondary compare / fast lane"]
    Llama["llama.cpp\nintegrated runtime path"]
  end

  subgraph Ops["Validation, delivery, and operations"]
    Bench["benchmark / aggregates / regression"]
    K8s["K8s smoke / observability / rollback / signoff"]
  end

  Product --> Gateway
  Gateway --> CP
  CP --> HA
  HA --> VLLM
  HA --> SGLang
  HA --> Llama
  Gateway --> Bench
  CP --> Bench
  HA --> Bench
  Bench --> K8s
Loading

Screenshots

Chat Workspace

AI-Stack chat workspace

Admin Console

AI-Stack admin console overview

Observability / Events

AI-Stack observability events view

Chat Workspace Design

The current /chat surface is designed as a governed workspace, not just a model picker. It separates evidence collection from answer generation so teams can keep one chat entrypoint while still routing requests through different grounded paths.

flowchart LR
  User["User prompt / attachment"] --> Mode["Evidence mode\nAuto / Chat / File / Browser / Vision Direct"]
  Mode -->|Auto| Intent["Intent resolver"]
  Mode -->|Explicit File| FileChain["File evidence chain"]
  Mode -->|Explicit Browser| BrowserChain["Browser evidence chain"]
  Mode -->|Explicit Vision| VisionDirect["Vision direct path"]
  Intent --> ChatPath["Plain chat path"]
  Intent --> FileChain
  Intent --> BrowserChain
  Intent --> Utility["Live utility guardrail"]
  ChatPath --> Answer["Answer model\nMain / Fast / Vision"]
  FileChain --> Answer
  BrowserChain --> Answer
  VisionDirect --> Answer
  Utility --> Answer
  Answer --> UX["Citations / evidence rail / session context"]
Loading

Current workspace design principles:

Design axis Current behavior Why it matters
Unified entrypoint plain chat, file-grounded chat, browser-grounded chat, and Vision Direct live on the same /chat surface teams do not need separate products for each interaction mode
Evidence modes Auto, Chat, File, Browser, and Vision Direct are live now; Mixed is visible as the next mode but not wired end-to-end yet the UI exposes real capability boundaries instead of hiding them
Governed routing Auto resolves intent first; explicit web-search directives force Browser mode, and current date/time prompts can resolve to utility_live the workspace can escalate into grounded flows without relying only on prompt wording
Answer-model split Main, Fast, or Vision can be chosen independently from the evidence chain teams can trade off latency, quality, and modality without changing the product flow
Grounded UX citations, evidence rail metadata, source chips, and attachment status are part of the page model grounded answers stay inspectable instead of feeling like a black box
Operator defaults same-origin console proxy, concise default answers, and session-local context are already part of the current experience the workspace behaves like an internal product surface, not just a raw API tester

Admin Console Surfaces

The current admin/runtime console is already split into concrete operational surfaces:

Surface Current purpose
Overview project health, route mix, recent releases, and operator shortcuts
Models & Routing model inventory, route selection, candidate lanes, gray rollout, and publish/rollback workflows
Deployments host view, placement decisions, deployment health, and release state
Observability events, trace-oriented investigation, logs, and benchmark context
Access & Credentials service accounts, API keys, and OAuth client management
Cost & Policy budget posture, policy triggers, tenant or project cost views, and current enforcement posture
Settings workspace defaults, shortcuts, and console-level preferences

Expanded Capability Surface

AI-Stack now reaches beyond "serve a model" into a broader team and platform capability layer:

Capability area Current business-facing outcome
Internal AI workspace a single team-facing chat entrypoint for everyday internal AI usage
File-grounded Q&A upload, process, retrieve, and answer with citations on top of a file evidence chain
Browser-grounded research search, read, rerank, and answer with web citations for freshness-seeking prompts
Vision interaction explicit Vision Direct path for image-aware requests on the same workspace surface
Governed model access centralized main, fast, vision, and embeddings access behind API keys and aliases
Runtime governance tenant-aware policy, routing, rate controls, timeout controls, and gray rollout
Deployment operations placement, publish, quarantine, rebound, route-binding correction, and catalog-backed routing
Access management service-account, API-key, and OAuth-client control for team or project use
Runtime observability health, readiness, metrics, events, traces, and evidence-oriented debugging views
External integration OpenAI-compatible APIs for other internal tools and apps
Release validation benchmark, repeated-run comparison, regression, kind smoke, rollback, and signoff workflows

Quick Start

1. Start the local gateway development path

GO_BIN=/home/fiscan/.local/tools/go/bin/go bash ai-stack/scripts/dev.sh

2. Start the steady FP8 workstation stack

cd ai-stack
bash scripts/steady_fp8_stack.sh start

3. Run the K8s smoke path

bash infra/k8s/scripts/install.sh
bash infra/k8s/scripts/smoke_k8s.sh

4. Run the non-GPU kind CI smoke lane

bash infra/k8s/scripts/kind_ci_smoke.sh

Current Capability Map

Area Current status
Team-facing chat workspace Available
Admin console / runtime UI Available
Governed Auto routing with answer-model selection Available
OpenAI-compatible chat completions Available
SSE streaming path Available
Models and routing console Available
Deployment and release console Available
Access and credential management Available
Cost and policy console Available
Observability and events console Available
Tenant-aware policy enforcement Available
Gray routing / rollback Available
Tracing and metrics Available
vLLM primary serving baseline Available
SGLang comparison / fast-lane path Available
File-grounded chat Available as a development baseline
Browser-grounded chat Available as a development baseline
Direct vision path Available
Embeddings runtime support Available as a development baseline
Benchmark / repeated-run reports Available
K8s smoke / observability / rollback workflows Available

What Makes AI-Stack Different

Problem Typical setup AI-Stack
Internal AI entrypoint Separate demo UI or raw API Workspace-oriented product surface
Runtime governance Ad hoc scripts Gateway-based policy and routing
Multi-tenant control Usually missing Built into runtime story
Streaming diagnosis Hard to inspect Trace + metrics friendly
Backend comparison Manual experiments Benchmark / aggregate / regression workflows
Safer rollout One-shot switch Gray routing + rollback path
K8s verification Hand-built scripts Smoke and observability workflows included

Current Verified Runtime Baselines

Workstation baseline

The current local serving baseline is the steady FP8 workstation stack.

Current verified ports:

Component Port
control-plane 8030
gateway 8020
console 4173
qdrant 6333
main 8100
vision 8101
fast 8102
embed 8103
rerank 8104
degrade 8105

Current ready public baseline:

  • main
  • fast
  • vision
  • embed
  • rerank
  • qdrant

Current non-default / non-ready public lane:

  • degrade

K8s baseline

The current Kubernetes baseline is a minimal multi-node kind cluster:

  • 1 control-plane node
  • 2 worker nodes
  • active control-plane, deployment-controller, gateway, prometheus, grafana, and vllm
  • passing smoke_k8s.sh

This is a verified development baseline, not a production-grade HA cluster story.

Current Public API Surface

Current verified public-facing APIs:

  • GET /v1/models
  • POST /v1/chat/completions
  • POST /v1/embeddings
  • GET /v1/files
  • POST /v1/files

Current note:

  • the gateway code exposes /v1/responses, but the current verified running instance still returns 404 there, so it should not be treated as part of today's promised external contract

Current verified aliases:

  • chat-default
  • qwen-local-main
  • qwen-local-fast
  • chat-vision
  • chat-file
  • chat-browser
  • retrieval-embed-dev

Important boundary:

  • chat-default and qwen-local-main currently route to the same main lane
  • qwen-local-fast and chat-vision are distinct backends

Repository Layout

Scope And Current Boundaries

AI-Stack currently focuses on a chat-first workspace and runtime governance baseline.

Current mainline scope includes:

  • gateway-based chat serving
  • governance and routing
  • backend serving baselines
  • benchmark and regression workflows
  • observability and deployment validation
  • file-grounded chat, browser-grounded chat, and direct vision entrypoints

Current boundaries are explicit:

  • retrieval and embeddings are real, but still a development-baseline capability plane rather than a production retrieval platform
  • multimodal is real, but not yet a production VLM/OCR platform
  • workflow, batch/async, assistant-v2, and fit-v2 are active starter planes rather than finished product surfaces
  • the Python orchestrator path is currently parked legacy
  • the K8s story is a real baseline, but not yet a production-grade platform

This keeps the project honest, inspectable, and easier to adopt.

Good Fits

AI-Stack is a strong fit if you want to:

  • build an internal AI workspace on top of self-hosted models
  • standardize team access behind an OpenAI-compatible gateway
  • add governance, traces, and metrics to an existing serving stack
  • compare serving backends with repeatable benchmark evidence
  • validate K8s delivery paths with smoke and signoff workflows

Roadmap Direction

Natural next steps from the current baseline:

  • deeper workspace polish
  • richer admin controls
  • stronger product packaging for private deployment
  • broader runtime integrations
  • tighter release and evidence automation

Documentation

Recommended entry points:

License

This repository currently uses the Apache-2.0 license.

Contributing

Issues, discussions, and architecture feedback are welcome.

If you are working on:

  • self-hosted LLM deployment
  • runtime governance
  • AI workspace UX
  • observability and rollout safety

AI-Stack is built for exactly that layer.