This project simulates AI/ML infrastructure scaling for prompt processing workloads on Kubernetes. It demonstrates how compute-intensive AI applications (like language model inference) can be deployed and scaled using Horizontal Pod Autoscaling (HPA) based on CPU utilization.
Since real GPU infrastructure can be expensive and complex to set up, this simulation uses CPU-based matrix operations to mimic the computational patterns of GPU-accelerated AI workloads (e.g., attention mechanisms in transformers).
- AI Workload Simulation: CPU-based matrix multiplications (100x100) to simulate GPU compute for prompt processing.
- Prompt Caching: In-memory storage of prompts with computed results.
- Intelligent Routing: Load balancing based on cached prompts.
- Kubernetes Autoscaling: HPA scales pods when CPU > 70%.
- Load Testing: Concurrent request simulation to trigger scaling.
- Mermaid Diagrams: Architecture visualization in Markdown.
- Flask Web Service: REST API for prompt caching and routing.
- NumPy Compute: Matrix operations simulating AI inference.
- Docker Containerization: Portable deployment.
- Kubernetes Deployment: LoadBalancer service, HPA, and rolling updates.
- Monitoring: Health probes, metrics-server integration.
-
Clone the repository:
git clone https://github.com/your-username/ai-infra-scaling-simulator.git cd ai-infra-scaling-simulator -
Build and run locally:
docker build -t ai-infra-simulator . docker run -p 5000:5000 ai-infra-simulator -
Test the API:
curl -X POST http://localhost:5000/cache -H "Content-Type: application/json" -d '{"prompt": "Hello AI"}' curl http://localhost:5000/route
-
Prerequisites:
- Kubernetes cluster (e.g., DigitalOcean DOKS, minikube)
- kubectl configured
- Docker registry access
-
Deploy:
# Update image in k8s/deployment.yaml kubectl apply -f k8s/ kubectl get services # Note LoadBalancer IP
-
Load Test and Monitor:
./load_test.sh http://<load-balancer-ip> kubectl get hpa --watch # Monitor scaling kubectl get pods --watch
GET /: Health checkPOST /cache: Cache a prompt with simulated computeGET /route: Route based on cached prompts
The load test sends concurrent requests performing matrix operations. When CPU utilization exceeds 70%, HPA automatically scales pods from 2 to up to 10 replicas.
Monitor in real-time:
kubectl get hpa --watch
kubectl get pods --watch- Real GPU Deployment: For actual AI/ML, deploy on GPU-enabled clusters (AWS EKS with GPU instances, GCP GKE, etc.).
- Metrics: In production, use GPU metrics (e.g., via Prometheus) instead of CPU.
- Persistence: Add databases for prompt caching.
- Security: Implement authentication, rate limiting, and monitoring.
View ARCHITECTURE.md for Mermaid diagrams of the deployment architecture.
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - see LICENSE file for details.