Reliability, SLOs, and Incident Management for GenAI Systems

This repository contains the hands-on demos for the Pluralsight course:

Reliability, SLOs, and Incident Management for GenAI Systems

What you will learn

By running these demos you will learn how to operate GenAI systems with an SRE mindset:

Define service health beyond HTTP 200
Instrument GenAI SLIs with Prometheus and FastAPI
Build SLO compliance dashboards with burn-rate alerts and feature freeze triggers
Implement timeouts, retries with exponential backoff, and circuit breakers
Run chaos experiments with Toxiproxy and Redis fallbacks
Build alerting and escalation pipelines with Alertmanager
Conduct blameless postmortems and track MTTR and reliability backlogs

Repository structure

.
├── README.md
├── course_setup/
│   ├── README.md
│   └── setup_all.sh
├── demos/
│   └── README.md                          ← course-wide demo index
├── module-01-production-reliability-fundamentals/
│   └── demo-clip4-otel-health-probes/
├── module-02-slis-slos-slas/
│   ├── demo-clip2-prom-metrics-fastapi/
│   └── demo-clip4-slo-burnrate-freeze/
├── module-03-failure-handling-resilience/
│   ├── demo-clip2-retries-circuit-breakers/
│   └── demo-clip4-chaos-toxiproxy-redis-fallbacks/
└── module-04-incident-response-operational-excellence/
    ├── demo-clip2-alerting-escalation-alertmanager/
    └── demo-clip4-blameless-postmortems-mttr-backlog/

Requirements

macOS
Docker Desktop
Python 3.9 or higher
Git, curl, jq

One-time setup

From the repository root:

./course_setup/setup_all.sh

Start Docker Desktop and verify:

open -a Docker
docker version
docker compose version

How to run any demo

Every demo follows the same three-step pattern:

Step 1 — Pre-flight check

Run this before every session. It verifies Docker, ports, services, and the dashboard — and opens an HTML report with fix instructions for anything that fails.

./scripts/preflight_check.sh

All checks must pass before you proceed.

Step 2 — Start and run

./scripts/demo_up.sh
./scripts/run_story.sh

Step 3 — Stop

./scripts/demo_down.sh

See demos/README.md for the full index of all demos with paths, LOs, and port references.

Demo index

Module	Clip	Demo	LO
1	4	OpenTelemetry Health Checks and Synthetic Probes	1c
2	2	Instrument GenAI SLIs with Prometheus and FastAPI	2a
2	4	SLO Dashboards, Burn-Rate Alerts, and Feature Freeze Triggers	2d
3	2	Retries and Circuit Breakers with HTTPX	3a
3	4	Chaos Testing with Toxiproxy, Redis, and Fallbacks	3b, 3c, 3d
4	2	Alerting and Escalation with Prometheus Alertmanager	4b
4	4	Blameless Postmortems, MTTR Metrics, and Reliability Backlogs	5a, 5b, 5c

Troubleshooting

Docker not responding

Hard restart Docker Desktop on Mac:

killall Docker\ Desktop 2>/dev/null; killall com.docker.backend 2>/dev/null
sleep 5
open -a Docker

Wait for the whale icon to stop animating, then:

docker context use default
docker info | grep "Server Version"

Port already in use

Find and kill the process holding the port:

lsof -nP -iTCP:<PORT> -sTCP:LISTEN
kill -9 <PID>

App does not start

Check the application log:

tail -n 50 .run/app.log

Safety notes

All demos run locally using Docker and local Python processes. No cloud resources are created. No API keys are required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reliability, SLOs, and Incident Management for GenAI Systems

What you will learn

Repository structure

Requirements

One-time setup

How to run any demo

Step 1 — Pre-flight check

Step 2 — Start and run

Step 3 — Stop

Demo index

Troubleshooting

Docker not responding

Port already in use

App does not start

Safety notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
course_setup		course_setup
demos		demos
module-01-production-reliability-fundamentals		module-01-production-reliability-fundamentals
module-02-slis-slos-slas		module-02-slis-slos-slas
module-03-failure-handling-resilience		module-03-failure-handling-resilience
module-04-incident-response-operational-excellence		module-04-incident-response-operational-excellence
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Reliability, SLOs, and Incident Management for GenAI Systems

What you will learn

Repository structure

Requirements

One-time setup

How to run any demo

Step 1 — Pre-flight check

Step 2 — Start and run

Step 3 — Stop

Demo index

Troubleshooting

Docker not responding

Port already in use

App does not start

Safety notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages