Deploy Rancher High Availability (HA) clusters on AWS using RKE2 with automated setup and secure configuration.
For the planned scheduled GitHub Actions alpha/webhook sign-off automation, see docs/README.md.
- No Cert Manager required — SSL is handled via AWS ACM
- Secure by default — HTTPS enabled from deployment
- Fully automated — Rancher installation happens automatically
- Simple workflow:
- Configure your Helm commands in
tool-config.yml - Run the test command
- Configure your Helm commands in
Rancher is installed with --set tls=external since ACM certificates handle TLS termination.
This repository provides:
- Deploy 3-node RKE2 HA clusters with Terraform
- Auto-configure each node with secure ALB integration
- Use AWS ACM for certificates (no cert-manager required)
- Generate and execute custom installation scripts
- Automatically inject correct URLs into Helm commands
- Single test command deployment
Place tool-config.yml at the project root:
.
├── README.md
├── tool-config.yml
├── go.mod
├── terratest/
│ └── test.go
├── modules/
│ └── aws/
Run the following command to deploy the infrastructure:
go test -v -run '^TestHaSetup$' -timeout 60m ./terratestThis command will:
- Launch EC2 instances, ALBs, and Route53 DNS records
- Configure TLS with AWS ACM certificates
- Bootstrap and join all 3 nodes into RKE2 cluster
- Generate and execute Rancher installation scripts
- Automatically inject correct URLs into Helm commands
Rancher is installed automatically during the setup process:
- Correct URLs are injected into each Helm command
- Install scripts are generated for each HA instance
- Scripts are executed to install Rancher
Installation uses ALB with ACM certificates for secure HTTPS access without requiring cert-manager.
Note: Install scripts remain available in each high-availability-X/ directory for manual re-execution if needed.
To destroy all resources:
go test -v -run '^TestHACleanup$' -timeout 30m ./terratestThis will:
- Destroy all infrastructure via Terraform
- Clean up generated files and folders
- Remove all AWS resources
To open the optional local-only Rancher control panel:
go test -v -run '^TestHAControlPanel$' -timeout 0 -count=1 ./terratestThis starts a browser-based control panel bound to 127.0.0.1 only. It is separate from setup and cleanup, so you can open it any time after provisioning, close it when you're done, and start it again later to re-check cluster health.
-count=1 is recommended here so go test does not reuse a cached prior success and immediately exit instead of starting a fresh panel.
If you prefer using the IDE run button, TestHAControlPanel is also available alongside TestHaSetup and TestHACleanup in terratest/ha_test.go.
Live infrastructure tests are guarded on purpose. They only run when the -run
pattern is exactly the test name, or an anchored regex for only that test. This
prevents a broad package run such as go test ./terratest or a generic IDE play
button from accidentally creating or destroying cloud resources.
Use these commands for the normal local lifecycle:
# Create Rancher HA infrastructure
go test -v -run '^TestHaSetup$' -timeout 60m ./terratest
# Wait until Rancher and rancher-webhook are healthy
go test -v -run '^TestHAWaitReady$' -timeout 35m ./terratest
# Open the local control panel
go test -v -run '^TestHAControlPanel$' -timeout 0 -count=1 ./terratest
# Destroy AWS infrastructure
go test -v -run '^TestHACleanup$' -timeout 30m ./terratestFor GoLand, create or edit a Go Test run configuration and set:
- Test kind / Run kind:
Package,Directory, orPatternis fine if the Pattern is exact. - Package path:
github.com/brudnak/ha-rancher-rke2/terratest - Pattern: one exact pattern, for example
^TestHACleanup$or^TestHAControlPanel$ - Go tool arguments / Additional go test arguments: add
-timeout 30mfor cleanup, or-timeout 0 -count=1for the control panel
If GoLand shows Test ignored with a message like uses live infrastructure; run it explicitly, the run configuration is using a broader pattern. Change the pattern to the anchored command above.
The control panel currently provides:
- Per-HA Rancher cards with URL, kubeconfig path, and reachability
cattle-systemvisibility focused on Rancher and Rancher webhook pods- Recent pod logs and live log streaming
- Active Rancher leader detection with a badge and change highlighting
- A guarded cleanup button that requires typing
cleanup
The cleanup button calls the existing canonical cleanup flow (TestHACleanup) rather than introducing a separate destroy path.
Use one of these checked-in examples as your starting point:
Then copy the one you want to tool-config.yml and adjust the non-secret values.
These four secrets are now read from environment variables only:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYDOCKERHUB_USERNAMEDOCKERHUB_PASSWORD
The cleanest setup on your machine is to put them in ~/.zprofile:
export AWS_ACCESS_KEY_ID="your-aws-access-key"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-key"
export DOCKERHUB_USERNAME="your-dockerhub-username"
export DOCKERHUB_PASSWORD="your-dockerhub-password"Then reload your shell:
source ~/.zprofileIf you do not want Docker Hub authentication, leave both Docker Hub environment variables unset.
For available RKE2 Kubernetes versions, refer to: RKE2 v1.32.X Release Notes
rancher.modesupports:manualto provide full Helm commands yourselfautoto provide one or more Rancher versions and let the tool resolve chart source, image source, RKE2 version, and installer checksum for you
- In
manualmode, the number of Helm commands underrancher.helm_commandsmust matchtotal_has - In
automode:- use
rancher.versionfor a single HA - use
rancher.versionsfor multiple HAs, with exactly one version per HA
- use
- Each Helm command will be used for a specific HA instance (first command for first instance, etc.)
- You can customize each Helm command with different parameters (bootstrap password, version, etc.)
- The
hostnameparameter in each Helm command will be automatically replaced with the correct URL- You can leave it blank, use a placeholder, or include your own value (it will be overridden)
- The tool validates your config shape and fails early if the number of versions or Helm commands does not match
total_has - The install script is automatically executed for each HA instance during setup
- In
manualmode:- use
k8s.versionfor a single HA - use
k8s.versionsfor multiple HAs, with exactly one RKE2 version per HA - use
rke2.install_script_sha256for a single HA - use
rke2.install_script_sha256sfor multiple HAs, keyed by exact RKE2 version
- use
rke2.preload_images: truedownloads the RKE2 image bundle before install to help avoid Docker Hub rate limitsAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYmust be set in your shell environmentDOCKERHUB_USERNAMEandDOCKERHUB_PASSWORDare optional environment variables- If you set them, the tool creates
/etc/rancher/rke2/registries.yamlso RKE2 can authenticate to Docker Hub - If you leave them unset, the tool skips Docker Hub authentication
- If you set them, the tool creates
- In
automode, the tool prints a resolved plan for each HA and asks you to continue before provisioning starts - All
curldownloads are checksum-validated before use — nocurl | bashpatterns exist in this project
Every file downloaded over the network is verified against a known-good SHA256 before it is used or executed. This protects against compromised upstream releases (e.g. a tampered GitHub release asset) reaching your nodes.
| Pattern | Location | Status |
|---|---|---|
curl | bash |
— | Eliminated — no instances exist |
| RKE2 installer curl | preflight.go:302 | Hardened — SHA256 validated twice (Go + bash) |
| RKE2 images curl | preflight.go:274, cluster_setup.go:210, cluster_setup.go:437 | Hardened — SHA256 validated via official release checksum file |
RKE2 installer script (preflight.go, cluster_setup.go)
The installer is validated twice — once in Go before provisioning starts, and once in bash on the remote node:
- Before provisioning, the Go preflight downloads
install.shfor the pinned RKE2 version and computes its SHA256 usingcrypto/sha256. If the hash does not match the expected value, provisioning is blocked entirely. - On each node, the install command downloads the script to a temp file, runs
sha256sum -cagainst the same hash, and refuses to execute if validation fails — with a clearSECURITY ERRORmessage in stderr.
Where the expected hash comes from depends on mode:
manualmode — you supplyrke2.install_script_sha256(single HA) orrke2.install_script_sha256s(multiple HAs) explicitly in your config. The tool checks your pinned value before any node is touched.automode — the tool fetches the versionedinstall.shat plan time, computes its SHA256, and stores it in the resolved plan. That computed hash is then used for both the Go preflight check and the per-node bash validation, so the same two-step process applies in both modes.
RKE2 images tarball (preflight.go, cluster_setup.go)
When rke2.preload_images: true is set, the image bundle is also validated before it is moved into place. This applies equally in manual and auto modes — in both cases the RKE2 version is known before the download starts (from your config in manual mode, from the resolved plan in auto mode), so the same validation runs regardless:
- The tarball (
rke2-images.linux-amd64.tar.zst) is downloaded to/tmp. - The official
sha256sum-amd64.txtfor that exact RKE2 release is downloaded from the same GitHub release page. sha256sum -cis run against the matching entry in that checksum file.- If validation fails the tarball and checksum file are deleted and the script exits with a
SECURITY ERROR— the corrupted file never reaches/var/lib/rancher/rke2/agent/images/.
Use auto mode when you want to provide a Rancher version and let the tool resolve the rest.
rancher:
mode: auto
versions:
- "2.13-head"
- "2.14.0"
distro: auto
bootstrap_password: "your-password"
auto_approve: false
rke2:
preload_images: true
total_has: 2 # Number of HA clusters to create (must match number of rancher.versions in auto mode)
tf_vars:
aws_region: "us-east-2"
aws_prefix: "xyz" # your initials, keep it short!
aws_vpc: ""
aws_subnet_a: ""
aws_subnet_b: ""
aws_subnet_c: ""
aws_ami: ""
aws_subnet_id: ""
aws_security_group_id: ""
aws_pem_key_name: ""
aws_route53_fqdn: ""In auto mode, the tool will:
- Resolve the Rancher chart repo and chart version for each HA version you requested
- Resolve the Rancher image settings for each HA
- Look up a supported RKE2 minor from the Rancher support matrix
- Pick the latest patch release in that RKE2 line
- Resolve the installer SHA256 for that exact RKE2 version
- Generate one Helm command per HA and inject the correct URL later during setup
- Print the generated plan(s)
- Ask you to continue or cancel before provisioning
For a single HA, you can use this shorter config:
rancher:
mode: auto
version: "2.13-head"
distro: auto
bootstrap_password: "your-password"
auto_approve: false
total_has: 1If you do not want Docker Hub authentication, leave both DOCKERHUB_USERNAME and DOCKERHUB_PASSWORD unset in your shell.
Use manual mode when you want full control over the Helm commands.
rancher:
mode: manual
helm_commands:
- |
helm install rancher rancher-latest/rancher \
--namespace cattle-system \
--set hostname=placeholder \
--set bootstrapPassword=your-password \
--set tls=external \
--set global.cattle.psp.enabled=false \
--set rancherImageTag=v2.14.0 \
--version 2.14.0 \
--set agentTLSMode=system-store
- |
helm install rancher rancher-latest/rancher \
--namespace cattle-system \
--set hostname=placeholder \
--set bootstrapPassword=your-password \
--set tls=external \
--set global.cattle.psp.enabled=false \
--set rancherImageTag=v2.14.0 \
--version 2.14.0 \
--set agentTLSMode=system-store
total_has: 2
k8s:
versions:
- "v1.33.7+rke2r1"
- "v1.34.6+rke2r1"
rke2:
install_script_sha256s:
v1.33.7+rke2r1: "bfbd978d603b7070f5748c934326db509bf1470c97d3f61a3aaa6e2eed6bd054"
v1.34.6+rke2r1: "2d24db2184dd6b1a5e281fa45cc9a8234c889394721746f89b5fe953fdaaf40a"
preload_images: trueFor a single manual HA, the older shorter form still works:
k8s:
version: "v1.33.7+rke2r1"
rke2:
install_script_sha256: "bfbd978d603b7070f5748c934326db509bf1470c97d3f61a3aaa6e2eed6bd054"You only need to update the checksum values manually when you use manual mode and change the matching RKE2 version.
- Pick the RKE2 version you want.
- Download that exact installer script.
- Compute its SHA256.
- Paste the hash into
tool-config.yml.
Run:
export RKE2_VERSION="v1.33.7+rke2r1"
curl -fsSL "https://raw.githubusercontent.com/rancher/rke2/${RKE2_VERSION}/install.sh" -o /tmp/rke2-install.sh
shasum -a 256 /tmp/rke2-install.shYou will get output like:
bfbd978d603b7070f5748c934326db509bf1470c97d3f61a3aaa6e2eed6bd054 /tmp/rke2-install.sh
Copy only the hash on the left and put it into tool-config.yml:
k8s:
version: "v1.33.7+rke2r1"
rke2:
install_script_sha256: "bfbd978d603b7070f5748c934326db509bf1470c97d3f61a3aaa6e2eed6bd054"
preload_images: trueIf the downloaded installer does not match the pinned hash, the setup stops immediately and refuses to run it.
TestHACleanup now prints a best-effort AWS cost estimate after destroy for:
- EC2 runtime
- EBS root volumes
This is only an estimate, not an AWS bill.
The estimate uses:
- live AWS pricing data for EC2 and EBS unit prices
- actual EC2 instance launch times from AWS to estimate runtime
- actual attached root EBS volumes from AWS to estimate storage cost
It does not include everything AWS might charge for, such as:
- ALB usage
- Route53 charges
- data transfer
- request-driven costs
So the number is meant to be helpful and roughly right for the main infrastructure cost drivers, not a final billing total.
Each HA setup creates a folder like:
high-availability-1/
├── install.sh # Rancher installation script
├── kube_config.yaml # RKE2 kubeconfig
Pull requests and questions are welcome.
Built with Go, Terraform, and Rancher.