|
| 1 | +# AMI-Based Launch Pipeline |
| 2 | + |
| 3 | +Automated pipeline for launching isolated staging environments from pre-built AMIs, running E2E tests, and tearing down — all via GitHub Actions. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +``` |
| 8 | +┌─────────────────────────────────────────────────────────────────┐ |
| 9 | +│ GitHub Actions Workflow │ |
| 10 | +│ │ |
| 11 | +│ ┌──────────────┐ ┌──────────────┐ │ |
| 12 | +│ │ Build │ │ Launch EC2 │ │ |
| 13 | +│ │ Playwright │ │ from AMI │ (parallel) │ |
| 14 | +│ │ Image (OCIR) │ │ + Service │ │ |
| 15 | +│ └──────┬───────┘ │ Update │ │ |
| 16 | +│ │ └──────┬───────┘ │ |
| 17 | +│ │ │ │ |
| 18 | +│ └────────┬─────────┘ │ |
| 19 | +│ ▼ │ |
| 20 | +│ ┌──────────────┐ │ |
| 21 | +│ │ Run Playwright│ (OCI Container Instances │ |
| 22 | +│ │ Tests │ hit mentorai.stgX.iblai.org) │ |
| 23 | +│ └──────┬───────┘ │ |
| 24 | +│ ▼ │ |
| 25 | +│ ┌──────────────┐ │ |
| 26 | +│ │ Terminate │ │ |
| 27 | +│ │ EC2 Instance │ │ |
| 28 | +│ └──────────────┘ │ |
| 29 | +└─────────────────────────────────────────────────────────────────┘ |
| 30 | +``` |
| 31 | + |
| 32 | +## Architecture |
| 33 | + |
| 34 | +Each staging environment (stg1–stg4) has permanent AWS infrastructure: |
| 35 | + |
| 36 | +| Resource | Purpose | Persists between launches | |
| 37 | +|----------|---------|--------------------------| |
| 38 | +| VPC + Subnets | Networking | Yes | |
| 39 | +| ALB + Target Group | Load balancer with TLS termination | Yes | |
| 40 | +| ACM Certificates | SSL for `*.stgX.iblai.org` | Yes | |
| 41 | +| Route53 Records | DNS → ALB | Yes | |
| 42 | +| Security Groups | Firewall rules | Yes | |
| 43 | +| S3 Buckets | Media + static storage | Yes | |
| 44 | +| **EC2 Instance** | **Platform server** | **No — ephemeral** | |
| 45 | + |
| 46 | +The EC2 is the only component created and destroyed per pipeline run. Everything else is pre-provisioned via Terraform and reused. |
| 47 | + |
| 48 | +## Pre-Built AMI Contents |
| 49 | + |
| 50 | +Each AMI is a snapshot of a fully configured staging environment: |
| 51 | + |
| 52 | +- **OS**: Ubuntu 22.04 with Docker, pyenv, Python 3.11.8, AWS CLI |
| 53 | +- **Platform CLI**: iblai-cli-ops installed via iblai-prod-images |
| 54 | +- **Services** (Docker containers): |
| 55 | + - iblai-dm-pro (Django, PostgreSQL, Redis, Celery, Langfuse, ClickHouse, MinIO) |
| 56 | + - iblai-edx-pro (LMS, CMS, MySQL, MongoDB, Redis, Elasticsearch, MFE) |
| 57 | + - Auth SPA, Mentor SPA, Skills SPA |
| 58 | + - Nginx reverse proxy |
| 59 | +- **Data**: Test platforms, users, RBAC, analytics views pre-seeded |
| 60 | +- **Config**: S3 buckets, AWS credentials, TimescaleDB enabled |
| 61 | + |
| 62 | +## Pipeline Steps — Detailed |
| 63 | + |
| 64 | +### Step 1: Build Playwright Image |
| 65 | + |
| 66 | +**What**: Builds a Docker image containing the Playwright test suite from the mentorai repo and pushes it to Oracle Cloud Container Registry (OCIR). |
| 67 | + |
| 68 | +**Where**: GitHub Actions runner (ubuntu-latest) → OCIR |
| 69 | + |
| 70 | +**Image**: `iad.ocir.io/idcwyla5j5cr/ibl-mentor-playwright:{tag}` |
| 71 | + |
| 72 | +**Contents**: Playwright browsers (Chromium, Firefox, WebKit), test specs from `e2e/journeys/`, page objects, test utilities, AWS CLI for S3 log upload. |
| 73 | + |
| 74 | +**Caching**: Checks if image with the same tag already exists — skips build if so. |
| 75 | + |
| 76 | +### Step 2: Launch EC2 from AMI |
| 77 | + |
| 78 | +**What**: Provisions a fresh EC2 instance from the pre-built AMI into the existing VPC/subnet/security group. |
| 79 | + |
| 80 | +**How** (via boto3 in the iblai-infra-cli tool): |
| 81 | +1. `ec2:RunInstances` with the AMI ID, instance type (t3.2xlarge), 200GB gp3 volume |
| 82 | +2. Wait for instance to enter `running` state |
| 83 | +3. Get public IP address |
| 84 | + |
| 85 | +**Security**: The workflow opens port 22 on the security group for the GitHub Actions runner IP, and revokes it after completion (always, even on failure). |
| 86 | + |
| 87 | +### Step 3: Service Update (Ansible) |
| 88 | + |
| 89 | +**What**: Ensures all services on the launched EC2 are running and configured correctly. |
| 90 | + |
| 91 | +**Tool**: `iblai infra service-update --host <IP>` from [iblai-infra-cli](https://github.com/iblai/iblai-infra-cli) |
| 92 | + |
| 93 | +**Ansible Playbook** (`service_update_playbook.yml`, 2 roles): |
| 94 | + |
| 95 | +#### Role: ibl_cli_ops |
| 96 | +- Installs latest `iblai-prod-images` package from `iblai/iblai-prod-images@main` |
| 97 | +- This pins all container image versions and includes ibl-cli-ops |
| 98 | + |
| 99 | +#### Role: ibl_service_update |
| 100 | +1. **Restore postgres data dir ownership** to uid 999 (fixes chown from pre-tasks) |
| 101 | +2. **ECR login** — authenticate Docker with AWS ECR (using server's existing AWS creds) |
| 102 | +3. **Save platform config** — `ibl config save` regenerates all compose files |
| 103 | +4. **Save edX tutor config** — `ibl tutor config save` |
| 104 | +5. **Ensure edX running** — `ibl edx start -d` |
| 105 | +6. **Wait for LMS** — curl `localhost:8600/heartbeat` (40 retries × 15s) |
| 106 | +7. **Ensure DM containers running** — `docker compose up -d` in background (avoids timeout on collectstatic) |
| 107 | +8. **Wait for DM** — curl `localhost:8400` (60 retries × 15s = 15 min max for collectstatic) |
| 108 | +9. **Run DM migrations** — `docker compose exec web ./manage.py migrate --noinput` |
| 109 | +10. **Restart SPAs** — `docker compose down; docker compose up -d` for auth, mentor, skills (with auto-restart for Mentor empty reply) |
| 110 | +11. **OAuth/OIDC integrations** — `ibl launch --ibl-oauth --ibl-oidc --ibl-edx-manager` + `ibl dm auth-setup` |
| 111 | +12. **Sync edX users** — `ibl edx sync-with-manager --users` |
| 112 | +13. **Sync SSO credentials** — reads `spa-sso` and `ibl_web` client IDs from LMS database, writes to config, restarts Auth SPA |
| 113 | +14. **Reload proxy + restart nginx** |
| 114 | + |
| 115 | +### Step 4: Register in ALB Target Group |
| 116 | + |
| 117 | +**What**: Deregisters any existing targets from the ALB target group, then registers the new EC2 instance. |
| 118 | + |
| 119 | +**Why deregister first**: Prevents split-brain routing where the ALB sends some requests to an old instance with stale OAuth credentials. |
| 120 | + |
| 121 | +**Health check**: ALB verifies the instance returns HTTP 200-399 on `/` before routing traffic. |
| 122 | + |
| 123 | +### Step 5: Run Playwright Tests (OCI) |
| 124 | + |
| 125 | +**What**: Launches Docker containers on Oracle Cloud Infrastructure (OCI) Container Instances that run the Playwright test suite against the staging environment. |
| 126 | + |
| 127 | +**Test target**: `mentorai.stgX.iblai.org` (via ALB → EC2) |
| 128 | + |
| 129 | +**Configuration**: |
| 130 | +- Browsers: chrome, firefox, safari, edge (configurable, default: all 4 parallel) |
| 131 | +- Workers: 3 per browser |
| 132 | +- Max wait: 5400s (90 minutes) |
| 133 | +- Retries: 2 per test |
| 134 | + |
| 135 | +**Test users**: Each browser has its own dedicated test user to avoid conflicts: |
| 136 | +- Chrome: `iblaiuserchromenew` |
| 137 | +- Firefox: `iblaiuserfirefoxnew` |
| 138 | +- Safari: `iblaiusersafarinew` |
| 139 | +- Edge: `iblaiuseredgenew` |
| 140 | + |
| 141 | +**Results**: Uploaded to S3 for resumption on subsequent runs. |
| 142 | + |
| 143 | +### Step 6: Terminate EC2 |
| 144 | + |
| 145 | +**What**: `aws ec2 terminate-instances --instance-ids <id>` |
| 146 | + |
| 147 | +**When**: Always runs, even if tests fail. The `if: always()` condition ensures cleanup. |
| 148 | + |
| 149 | +**What persists**: VPC, ALB, Route53, S3 buckets — all reused on next launch. |
| 150 | + |
| 151 | +## Timing |
| 152 | + |
| 153 | +| Step | Duration | |
| 154 | +|------|----------| |
| 155 | +| Build Playwright image | 2-5 min (cached: instant) | |
| 156 | +| Launch EC2 | ~20s | |
| 157 | +| SSH ready | ~45s | |
| 158 | +| Service update (Ansible) | 20-40 min (DM collectstatic dominates) | |
| 159 | +| ALB health check | ~30s | |
| 160 | +| Playwright tests (4 browsers) | 15-90 min | |
| 161 | +| Terminate | instant | |
| 162 | +| **Total** | **40-90 min** | |
| 163 | + |
| 164 | +## Repository Map |
| 165 | + |
| 166 | +| Repository | Role | |
| 167 | +|------------|------| |
| 168 | +| [iblai-infra-cli](https://github.com/iblai/iblai-infra-cli) | CLI tool with `service-update` command, Ansible playbooks, Terraform templates | |
| 169 | +| [iblai-web-ops](https://github.com/iblai/iblai-web-ops) | Reusable GitHub Actions workflows (OCI test runner, Docker builds, domain locking) | |
| 170 | +| [iblai-prod-images](https://github.com/iblai/iblai-prod-images) | Container image version pins (DM, edX, SPAs) | |
| 171 | +| [mentorai](https://github.com/iblai/mentorai) | SPA source code, Playwright tests, PR validation workflows | |
| 172 | + |
| 173 | +## Secrets & Variables |
| 174 | + |
| 175 | +### Variables (on mentorai repo) |
| 176 | + |
| 177 | +| Variable | Example | |
| 178 | +|----------|---------| |
| 179 | +| `STG1_AMI_ID` | `ami-02dff3992891505ba` | |
| 180 | +| `STG1_SUBNET_ID` | `subnet-022ff062fe90b23b1` | |
| 181 | +| `STG1_SG_ID` | `sg-0d56a7433d4b2a364` | |
| 182 | +| `STG1_TG_ARN` | `arn:aws:elasticloadbalancing:...` | |
| 183 | +| `STG1_KEY_PAIR` | `stg1-staging-key` | |
| 184 | + |
| 185 | +Repeat for STG2, STG3, STG4. |
| 186 | + |
| 187 | +### Secrets |
| 188 | + |
| 189 | +| Secret | Purpose | |
| 190 | +|--------|---------| |
| 191 | +| `SERVICE_UPDATE_ACCESS_KEY` | AWS IAM key for EC2 launch/terminate + SG rule management | |
| 192 | +| `SERVICE_UPDATE_SECRET_KEY` | AWS IAM secret | |
| 193 | +| `STG1_SSH_KEY` – `STG4_SSH_KEY` | SSH private keys for each stg environment | |
| 194 | +| `GIT_TOKEN` | GitHub PAT for private repo access | |
| 195 | +| `SSH_PRIVATE_DEPLOY_OPS` | SSH key for OCI/deployment operations | |
| 196 | +| OCI secrets | Oracle Cloud credentials for container instances | |
| 197 | +| S3 secrets | AWS credentials for test log storage | |
| 198 | + |
| 199 | +### IAM Policy (SERVICE_UPDATE keys) |
| 200 | + |
| 201 | +```json |
| 202 | +{ |
| 203 | + "Statement": [ |
| 204 | + { |
| 205 | + "Action": [ |
| 206 | + "ec2:RunInstances", "ec2:DescribeInstances", "ec2:DescribeImages", |
| 207 | + "ec2:CreateTags", "ec2:TerminateInstances", |
| 208 | + "ec2:AuthorizeSecurityGroupIngress", "ec2:RevokeSecurityGroupIngress" |
| 209 | + ], |
| 210 | + "Resource": "*" |
| 211 | + }, |
| 212 | + { |
| 213 | + "Action": [ |
| 214 | + "elasticloadbalancing:RegisterTargets", |
| 215 | + "elasticloadbalancing:DeregisterTargets", |
| 216 | + "elasticloadbalancing:DescribeTargetHealth" |
| 217 | + ], |
| 218 | + "Resource": "*" |
| 219 | + } |
| 220 | + ] |
| 221 | +} |
| 222 | +``` |
| 223 | + |
| 224 | +## Known Behaviors |
| 225 | + |
| 226 | +### DM collectstatic (15-20 min cold boot) |
| 227 | +The DM container entrypoint runs `collectstatic --noinput` before starting gunicorn. This takes 15-20 minutes on a fresh AMI boot at 100% CPU. The service-update flow uses `docker compose up -d` (idempotent, no recreate) to avoid triggering collectstatic unnecessarily. |
| 228 | + |
| 229 | +### Mentor SPA empty reply |
| 230 | +Mentor SPA occasionally returns empty HTTP replies for 60-90s after startup despite reporting "Ready". The service-update role detects this and auto-restarts the container, with `ignore_errors` so the pipeline continues. |
| 231 | + |
| 232 | +### ALB split-brain routing |
| 233 | +If old EC2 instances remain registered in the ALB target group, the ALB load-balances between old and new instances with different OAuth credentials — causing intermittent 409 auth errors. The pipeline deregisters all existing targets before registering the new instance. |
| 234 | + |
| 235 | +### OAuth credential sync |
| 236 | +`ibl config save` regenerates `auth.yml` but doesn't preserve SSO credentials. The pipeline reads `spa-sso` and `ibl_web` client credentials directly from the LMS database and writes them to config before restarting the Auth SPA. |
| 237 | + |
| 238 | +## Creating New AMIs |
| 239 | + |
| 240 | +When the platform or test data changes, create new AMIs: |
| 241 | + |
| 242 | +1. Launch a stg env from an existing AMI |
| 243 | +2. Make changes (add platforms, users, config) |
| 244 | +3. Verify all services healthy |
| 245 | +4. Create AMI from the EC2 instance |
| 246 | +5. Update `STGx_AMI_ID` variables on mentorai (and skillsai) |
| 247 | + |
| 248 | +AMI requirements: |
| 249 | +- All containers must be in a startable state (they may not be running — the service-update handles startup) |
| 250 | +- S3 config must be baked in (`ENABLE_S3_BUCKET_STORAGE=True`, bucket names, region, credentials) |
| 251 | +- Test platforms and users must be pre-seeded |
| 252 | +- `iblai-cli-ops` virtualenv must exist with pyenv |
0 commit comments