Skip to content

Anjali1425/ai-infra-scaling-simulator

Repository files navigation

AI Infrastructure Scaling Simulator

Overview

This project simulates AI/ML infrastructure scaling for prompt processing workloads on Kubernetes. It demonstrates how compute-intensive AI applications (like language model inference) can be deployed and scaled using Horizontal Pod Autoscaling (HPA) based on CPU utilization.

Since real GPU infrastructure can be expensive and complex to set up, this simulation uses CPU-based matrix operations to mimic the computational patterns of GPU-accelerated AI workloads (e.g., attention mechanisms in transformers).

Features

  • AI Workload Simulation: CPU-based matrix multiplications (100x100) to simulate GPU compute for prompt processing.
  • Prompt Caching: In-memory storage of prompts with computed results.
  • Intelligent Routing: Load balancing based on cached prompts.
  • Kubernetes Autoscaling: HPA scales pods when CPU > 70%.
  • Load Testing: Concurrent request simulation to trigger scaling.
  • Mermaid Diagrams: Architecture visualization in Markdown.

Architecture

  • Flask Web Service: REST API for prompt caching and routing.
  • NumPy Compute: Matrix operations simulating AI inference.
  • Docker Containerization: Portable deployment.
  • Kubernetes Deployment: LoadBalancer service, HPA, and rolling updates.
  • Monitoring: Health probes, metrics-server integration.

Quick Start

Local Development

  1. Clone the repository:

    git clone https://github.com/your-username/ai-infra-scaling-simulator.git
    cd ai-infra-scaling-simulator
  2. Build and run locally:

    docker build -t ai-infra-simulator .
    docker run -p 5000:5000 ai-infra-simulator
  3. Test the API:

    curl -X POST http://localhost:5000/cache -H "Content-Type: application/json" -d '{"prompt": "Hello AI"}'
    curl http://localhost:5000/route

Kubernetes Deployment

  1. Prerequisites:

    • Kubernetes cluster (e.g., DigitalOcean DOKS, minikube)
    • kubectl configured
    • Docker registry access
  2. Deploy:

    # Update image in k8s/deployment.yaml
    kubectl apply -f k8s/
    kubectl get services  # Note LoadBalancer IP
  3. Load Test and Monitor:

    ./load_test.sh http://<load-balancer-ip>
    kubectl get hpa --watch  # Monitor scaling
    kubectl get pods --watch

API Endpoints

  • GET /: Health check
  • POST /cache: Cache a prompt with simulated compute
  • GET /route: Route based on cached prompts

Scaling Demonstration

The load test sends concurrent requests performing matrix operations. When CPU utilization exceeds 70%, HPA automatically scales pods from 2 to up to 10 replicas.

Monitor in real-time:

kubectl get hpa --watch
kubectl get pods --watch

Production Considerations

  • Real GPU Deployment: For actual AI/ML, deploy on GPU-enabled clusters (AWS EKS with GPU instances, GCP GKE, etc.).
  • Metrics: In production, use GPU metrics (e.g., via Prometheus) instead of CPU.
  • Persistence: Add databases for prompt caching.
  • Security: Implement authentication, rate limiting, and monitoring.

Architecture Diagrams

View ARCHITECTURE.md for Mermaid diagrams of the deployment architecture.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

License

MIT License - see LICENSE file for details.

About

Simulates AI/ML prompt-processing workloads with CPU-based compute, caching, and autoscaling using Kubernetes HPA.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors