AI Infrastructure Scaling Simulator

Overview

This project simulates AI/ML infrastructure scaling for prompt processing workloads on Kubernetes. It demonstrates how compute-intensive AI applications (like language model inference) can be deployed and scaled using Horizontal Pod Autoscaling (HPA) based on CPU utilization.

Since real GPU infrastructure can be expensive and complex to set up, this simulation uses CPU-based matrix operations to mimic the computational patterns of GPU-accelerated AI workloads (e.g., attention mechanisms in transformers).

Features

AI Workload Simulation: CPU-based matrix multiplications (100x100) to simulate GPU compute for prompt processing.
Prompt Caching: In-memory storage of prompts with computed results.
Intelligent Routing: Load balancing based on cached prompts.
Kubernetes Autoscaling: HPA scales pods when CPU > 70%.
Load Testing: Concurrent request simulation to trigger scaling.
Mermaid Diagrams: Architecture visualization in Markdown.

Architecture

Flask Web Service: REST API for prompt caching and routing.
NumPy Compute: Matrix operations simulating AI inference.
Docker Containerization: Portable deployment.
Kubernetes Deployment: LoadBalancer service, HPA, and rolling updates.
Monitoring: Health probes, metrics-server integration.

Quick Start

Local Development

Clone the repository:

git clone https://github.com/your-username/ai-infra-scaling-simulator.git
cd ai-infra-scaling-simulator

Build and run locally:

docker build -t ai-infra-simulator .
docker run -p 5000:5000 ai-infra-simulator

Test the API:

curl -X POST http://localhost:5000/cache -H "Content-Type: application/json" -d '{"prompt": "Hello AI"}'
curl http://localhost:5000/route

Kubernetes Deployment

Prerequisites:
- Kubernetes cluster (e.g., DigitalOcean DOKS, minikube)
- kubectl configured
- Docker registry access

Deploy:

# Update image in k8s/deployment.yaml
kubectl apply -f k8s/
kubectl get services  # Note LoadBalancer IP

Load Test and Monitor:

./load_test.sh http://<load-balancer-ip>
kubectl get hpa --watch  # Monitor scaling
kubectl get pods --watch

API Endpoints

GET /: Health check
POST /cache: Cache a prompt with simulated compute
GET /route: Route based on cached prompts

Scaling Demonstration

The load test sends concurrent requests performing matrix operations. When CPU utilization exceeds 70%, HPA automatically scales pods from 2 to up to 10 replicas.

Monitor in real-time:

kubectl get hpa --watch
kubectl get pods --watch

Production Considerations

Real GPU Deployment: For actual AI/ML, deploy on GPU-enabled clusters (AWS EKS with GPU instances, GCP GKE, etc.).
Metrics: In production, use GPU metrics (e.g., via Prometheus) instead of CPU.
Persistence: Add databases for prompt caching.
Security: Implement authentication, rate limiting, and monitoring.

Architecture Diagrams

View ARCHITECTURE.md for Mermaid diagrams of the deployment architecture.

Contributing

Fork the repository
Create a feature branch
Submit a pull request

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
QBR		QBR
k8s		k8s
.DS_Store		.DS_Store
ARCHITECTURE.md		ARCHITECTURE.md
DELIVERABLES.md		DELIVERABLES.md
Dockerfile		Dockerfile
Dockerfile.rocm		Dockerfile.rocm
QBR_SUMMARY.md		QBR_SUMMARY.md
README.md		README.md
README_SIMULATION.md		README_SIMULATION.md
SETUP_GUIDE.md		SETUP_GUIDE.md
VERIFICATION_GUIDE.md		VERIFICATION_GUIDE.md
app.py		app.py
diagram1.mmd		diagram1.mmd
diagram2.mmd		diagram2.mmd
load_test.sh		load_test.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Infrastructure Scaling Simulator

Overview

Features

Architecture

Quick Start

Local Development

Kubernetes Deployment

API Endpoints

Scaling Demonstration

Production Considerations

Architecture Diagrams

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Infrastructure Scaling Simulator

Overview

Features

Architecture

Quick Start

Local Development

Kubernetes Deployment

API Endpoints

Scaling Demonstration

Production Considerations

Architecture Diagrams

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages