Skip to content

Commit bed041b

Browse files
bnsoniclaude
andcommitted
docs: add AMI-based launch pipeline documentation
Complete documentation of the automated pipeline for launching isolated staging environments from pre-built AMIs, running Playwright E2E tests via OCI, and tearing down. Covers: architecture, AMI contents, all 6 pipeline steps in detail, timing, repository map, secrets/variables, IAM policy, known behaviors (DM collectstatic, Mentor SPA, ALB split-brain, OAuth sync), and AMI creation process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 025fbf8 commit bed041b

1 file changed

Lines changed: 252 additions & 0 deletions

File tree

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# AMI-Based Launch Pipeline
2+
3+
Automated pipeline for launching isolated staging environments from pre-built AMIs, running E2E tests, and tearing down — all via GitHub Actions.
4+
5+
## Overview
6+
7+
```
8+
┌─────────────────────────────────────────────────────────────────┐
9+
│ GitHub Actions Workflow │
10+
│ │
11+
│ ┌──────────────┐ ┌──────────────┐ │
12+
│ │ Build │ │ Launch EC2 │ │
13+
│ │ Playwright │ │ from AMI │ (parallel) │
14+
│ │ Image (OCIR) │ │ + Service │ │
15+
│ └──────┬───────┘ │ Update │ │
16+
│ │ └──────┬───────┘ │
17+
│ │ │ │
18+
│ └────────┬─────────┘ │
19+
│ ▼ │
20+
│ ┌──────────────┐ │
21+
│ │ Run Playwright│ (OCI Container Instances │
22+
│ │ Tests │ hit mentorai.stgX.iblai.org) │
23+
│ └──────┬───────┘ │
24+
│ ▼ │
25+
│ ┌──────────────┐ │
26+
│ │ Terminate │ │
27+
│ │ EC2 Instance │ │
28+
│ └──────────────┘ │
29+
└─────────────────────────────────────────────────────────────────┘
30+
```
31+
32+
## Architecture
33+
34+
Each staging environment (stg1–stg4) has permanent AWS infrastructure:
35+
36+
| Resource | Purpose | Persists between launches |
37+
|----------|---------|--------------------------|
38+
| VPC + Subnets | Networking | Yes |
39+
| ALB + Target Group | Load balancer with TLS termination | Yes |
40+
| ACM Certificates | SSL for `*.stgX.iblai.org` | Yes |
41+
| Route53 Records | DNS → ALB | Yes |
42+
| Security Groups | Firewall rules | Yes |
43+
| S3 Buckets | Media + static storage | Yes |
44+
| **EC2 Instance** | **Platform server** | **No — ephemeral** |
45+
46+
The EC2 is the only component created and destroyed per pipeline run. Everything else is pre-provisioned via Terraform and reused.
47+
48+
## Pre-Built AMI Contents
49+
50+
Each AMI is a snapshot of a fully configured staging environment:
51+
52+
- **OS**: Ubuntu 22.04 with Docker, pyenv, Python 3.11.8, AWS CLI
53+
- **Platform CLI**: iblai-cli-ops installed via iblai-prod-images
54+
- **Services** (Docker containers):
55+
- iblai-dm-pro (Django, PostgreSQL, Redis, Celery, Langfuse, ClickHouse, MinIO)
56+
- iblai-edx-pro (LMS, CMS, MySQL, MongoDB, Redis, Elasticsearch, MFE)
57+
- Auth SPA, Mentor SPA, Skills SPA
58+
- Nginx reverse proxy
59+
- **Data**: Test platforms, users, RBAC, analytics views pre-seeded
60+
- **Config**: S3 buckets, AWS credentials, TimescaleDB enabled
61+
62+
## Pipeline Steps — Detailed
63+
64+
### Step 1: Build Playwright Image
65+
66+
**What**: Builds a Docker image containing the Playwright test suite from the mentorai repo and pushes it to Oracle Cloud Container Registry (OCIR).
67+
68+
**Where**: GitHub Actions runner (ubuntu-latest) → OCIR
69+
70+
**Image**: `iad.ocir.io/idcwyla5j5cr/ibl-mentor-playwright:{tag}`
71+
72+
**Contents**: Playwright browsers (Chromium, Firefox, WebKit), test specs from `e2e/journeys/`, page objects, test utilities, AWS CLI for S3 log upload.
73+
74+
**Caching**: Checks if image with the same tag already exists — skips build if so.
75+
76+
### Step 2: Launch EC2 from AMI
77+
78+
**What**: Provisions a fresh EC2 instance from the pre-built AMI into the existing VPC/subnet/security group.
79+
80+
**How** (via boto3 in the iblai-infra-cli tool):
81+
1. `ec2:RunInstances` with the AMI ID, instance type (t3.2xlarge), 200GB gp3 volume
82+
2. Wait for instance to enter `running` state
83+
3. Get public IP address
84+
85+
**Security**: The workflow opens port 22 on the security group for the GitHub Actions runner IP, and revokes it after completion (always, even on failure).
86+
87+
### Step 3: Service Update (Ansible)
88+
89+
**What**: Ensures all services on the launched EC2 are running and configured correctly.
90+
91+
**Tool**: `iblai infra service-update --host <IP>` from [iblai-infra-cli](https://github.com/iblai/iblai-infra-cli)
92+
93+
**Ansible Playbook** (`service_update_playbook.yml`, 2 roles):
94+
95+
#### Role: ibl_cli_ops
96+
- Installs latest `iblai-prod-images` package from `iblai/iblai-prod-images@main`
97+
- This pins all container image versions and includes ibl-cli-ops
98+
99+
#### Role: ibl_service_update
100+
1. **Restore postgres data dir ownership** to uid 999 (fixes chown from pre-tasks)
101+
2. **ECR login** — authenticate Docker with AWS ECR (using server's existing AWS creds)
102+
3. **Save platform config**`ibl config save` regenerates all compose files
103+
4. **Save edX tutor config**`ibl tutor config save`
104+
5. **Ensure edX running**`ibl edx start -d`
105+
6. **Wait for LMS** — curl `localhost:8600/heartbeat` (40 retries × 15s)
106+
7. **Ensure DM containers running**`docker compose up -d` in background (avoids timeout on collectstatic)
107+
8. **Wait for DM** — curl `localhost:8400` (60 retries × 15s = 15 min max for collectstatic)
108+
9. **Run DM migrations**`docker compose exec web ./manage.py migrate --noinput`
109+
10. **Restart SPAs**`docker compose down; docker compose up -d` for auth, mentor, skills (with auto-restart for Mentor empty reply)
110+
11. **OAuth/OIDC integrations**`ibl launch --ibl-oauth --ibl-oidc --ibl-edx-manager` + `ibl dm auth-setup`
111+
12. **Sync edX users**`ibl edx sync-with-manager --users`
112+
13. **Sync SSO credentials** — reads `spa-sso` and `ibl_web` client IDs from LMS database, writes to config, restarts Auth SPA
113+
14. **Reload proxy + restart nginx**
114+
115+
### Step 4: Register in ALB Target Group
116+
117+
**What**: Deregisters any existing targets from the ALB target group, then registers the new EC2 instance.
118+
119+
**Why deregister first**: Prevents split-brain routing where the ALB sends some requests to an old instance with stale OAuth credentials.
120+
121+
**Health check**: ALB verifies the instance returns HTTP 200-399 on `/` before routing traffic.
122+
123+
### Step 5: Run Playwright Tests (OCI)
124+
125+
**What**: Launches Docker containers on Oracle Cloud Infrastructure (OCI) Container Instances that run the Playwright test suite against the staging environment.
126+
127+
**Test target**: `mentorai.stgX.iblai.org` (via ALB → EC2)
128+
129+
**Configuration**:
130+
- Browsers: chrome, firefox, safari, edge (configurable, default: all 4 parallel)
131+
- Workers: 3 per browser
132+
- Max wait: 5400s (90 minutes)
133+
- Retries: 2 per test
134+
135+
**Test users**: Each browser has its own dedicated test user to avoid conflicts:
136+
- Chrome: `iblaiuserchromenew`
137+
- Firefox: `iblaiuserfirefoxnew`
138+
- Safari: `iblaiusersafarinew`
139+
- Edge: `iblaiuseredgenew`
140+
141+
**Results**: Uploaded to S3 for resumption on subsequent runs.
142+
143+
### Step 6: Terminate EC2
144+
145+
**What**: `aws ec2 terminate-instances --instance-ids <id>`
146+
147+
**When**: Always runs, even if tests fail. The `if: always()` condition ensures cleanup.
148+
149+
**What persists**: VPC, ALB, Route53, S3 buckets — all reused on next launch.
150+
151+
## Timing
152+
153+
| Step | Duration |
154+
|------|----------|
155+
| Build Playwright image | 2-5 min (cached: instant) |
156+
| Launch EC2 | ~20s |
157+
| SSH ready | ~45s |
158+
| Service update (Ansible) | 20-40 min (DM collectstatic dominates) |
159+
| ALB health check | ~30s |
160+
| Playwright tests (4 browsers) | 15-90 min |
161+
| Terminate | instant |
162+
| **Total** | **40-90 min** |
163+
164+
## Repository Map
165+
166+
| Repository | Role |
167+
|------------|------|
168+
| [iblai-infra-cli](https://github.com/iblai/iblai-infra-cli) | CLI tool with `service-update` command, Ansible playbooks, Terraform templates |
169+
| [iblai-web-ops](https://github.com/iblai/iblai-web-ops) | Reusable GitHub Actions workflows (OCI test runner, Docker builds, domain locking) |
170+
| [iblai-prod-images](https://github.com/iblai/iblai-prod-images) | Container image version pins (DM, edX, SPAs) |
171+
| [mentorai](https://github.com/iblai/mentorai) | SPA source code, Playwright tests, PR validation workflows |
172+
173+
## Secrets & Variables
174+
175+
### Variables (on mentorai repo)
176+
177+
| Variable | Example |
178+
|----------|---------|
179+
| `STG1_AMI_ID` | `ami-02dff3992891505ba` |
180+
| `STG1_SUBNET_ID` | `subnet-022ff062fe90b23b1` |
181+
| `STG1_SG_ID` | `sg-0d56a7433d4b2a364` |
182+
| `STG1_TG_ARN` | `arn:aws:elasticloadbalancing:...` |
183+
| `STG1_KEY_PAIR` | `stg1-staging-key` |
184+
185+
Repeat for STG2, STG3, STG4.
186+
187+
### Secrets
188+
189+
| Secret | Purpose |
190+
|--------|---------|
191+
| `SERVICE_UPDATE_ACCESS_KEY` | AWS IAM key for EC2 launch/terminate + SG rule management |
192+
| `SERVICE_UPDATE_SECRET_KEY` | AWS IAM secret |
193+
| `STG1_SSH_KEY``STG4_SSH_KEY` | SSH private keys for each stg environment |
194+
| `GIT_TOKEN` | GitHub PAT for private repo access |
195+
| `SSH_PRIVATE_DEPLOY_OPS` | SSH key for OCI/deployment operations |
196+
| OCI secrets | Oracle Cloud credentials for container instances |
197+
| S3 secrets | AWS credentials for test log storage |
198+
199+
### IAM Policy (SERVICE_UPDATE keys)
200+
201+
```json
202+
{
203+
"Statement": [
204+
{
205+
"Action": [
206+
"ec2:RunInstances", "ec2:DescribeInstances", "ec2:DescribeImages",
207+
"ec2:CreateTags", "ec2:TerminateInstances",
208+
"ec2:AuthorizeSecurityGroupIngress", "ec2:RevokeSecurityGroupIngress"
209+
],
210+
"Resource": "*"
211+
},
212+
{
213+
"Action": [
214+
"elasticloadbalancing:RegisterTargets",
215+
"elasticloadbalancing:DeregisterTargets",
216+
"elasticloadbalancing:DescribeTargetHealth"
217+
],
218+
"Resource": "*"
219+
}
220+
]
221+
}
222+
```
223+
224+
## Known Behaviors
225+
226+
### DM collectstatic (15-20 min cold boot)
227+
The DM container entrypoint runs `collectstatic --noinput` before starting gunicorn. This takes 15-20 minutes on a fresh AMI boot at 100% CPU. The service-update flow uses `docker compose up -d` (idempotent, no recreate) to avoid triggering collectstatic unnecessarily.
228+
229+
### Mentor SPA empty reply
230+
Mentor SPA occasionally returns empty HTTP replies for 60-90s after startup despite reporting "Ready". The service-update role detects this and auto-restarts the container, with `ignore_errors` so the pipeline continues.
231+
232+
### ALB split-brain routing
233+
If old EC2 instances remain registered in the ALB target group, the ALB load-balances between old and new instances with different OAuth credentials — causing intermittent 409 auth errors. The pipeline deregisters all existing targets before registering the new instance.
234+
235+
### OAuth credential sync
236+
`ibl config save` regenerates `auth.yml` but doesn't preserve SSO credentials. The pipeline reads `spa-sso` and `ibl_web` client credentials directly from the LMS database and writes them to config before restarting the Auth SPA.
237+
238+
## Creating New AMIs
239+
240+
When the platform or test data changes, create new AMIs:
241+
242+
1. Launch a stg env from an existing AMI
243+
2. Make changes (add platforms, users, config)
244+
3. Verify all services healthy
245+
4. Create AMI from the EC2 instance
246+
5. Update `STGx_AMI_ID` variables on mentorai (and skillsai)
247+
248+
AMI requirements:
249+
- All containers must be in a startable state (they may not be running — the service-update handles startup)
250+
- S3 config must be baked in (`ENABLE_S3_BUCKET_STORAGE=True`, bucket names, region, credentials)
251+
- Test platforms and users must be pre-seeded
252+
- `iblai-cli-ops` virtualenv must exist with pyenv

0 commit comments

Comments
 (0)