DAGs for user-level DR replication and failover in Airflow 3 using Starship
%%{init: {"architecture": {"randomize": true}}}%%
architecture-beta
group primary(database)[Primary cluster]
service active(server)[Active deployment] in primary
group secondary(database)[DR cluster]
service dr(server)[DR deployment] in secondary
service standby(server)[Standby deployment] in secondary
active:T <-- L:dr
dr:R --> T:standby
- Primary: The primary cluster which hosts active deployments under normal circumstances.
- DR: The disaster recovery cluster which hosts standby deployments and is expected to remain available in case of disaster for the primary cluster.
- Active: The deployment which is running user workloads under normal circumstances.
- Standby: The deployment which is not running user workloads under normal circumstances and available to take over upon a failover in case of disaster. It is recommended to run this deployment on a different cluster than the active deployment.
- DR: A deployment which is responsible to perform replication and failover between the active and standby deployment. It is recommended to run this deployment on a different cluster than the active deployment.
The active and standby deployments for which we perform replication and failover must fulfill the following requirements:
- Airflow 3.0 - 3.1 installed
- Astronomer Starship 2.8+ installed
Important
The code intended to run on the DR deployment is expected to run on Airflow 3.1.
-
Install Starship on active/standby deployments
Install Astronomer Starship on the active and standby deployments by adding the following to requirements.txt:
astronomer-starship~=2.8
-
Deploy DR code to DR deployment
Deploy the following files from this repo as is to the DR deployment:
Any configuration of the DR Dags is done via environment variables and it should not be necessary to make any changes to those files.
-
Configure DR deployment
-
Set the environment variable
DR_API_KEYto an Astro API key with owner permissions for the corresponding active/standby deployments. -
Set the environment variable
DR_DEPLOYMENTSto a JSON mapping from active deployment ID to standby deployment ID:{ "<active-deployment-id>": "<standby-deployment-id>" }Multiple active-standby pairs are possible.
-
Set the environment variable
DR_ORGANIZATION_IDto the organization ID of the Astro org in which the DR Dags operate. -
If you want to perform a regularly scheduled DR replication, then set the environment variable
DR_SCHEDULEto a schedule for thedr_replicationDag. Example values:@dailyor0 3 * * *.
-
The following environment variables can be used to configure the DR Dags.
| Name | Description | Default |
|---|---|---|
DR_API_KEY |
An API token with owner permissions for the active and standby deployments. | |
DR_DEPLOYMENTS |
A JSON mapping of deployment IDs from active to standby. | |
DR_ORGANIZATION_ID |
The ID of the Astronomer organization containing the deployments. | |
DR_SCHEDULE |
Cron schedule for DR replication Dag. | None |
DR_WAKE_WAIT_PERIOD |
Time in seconds to wait after triggering a wake up before checking deployment status. | 60 |
On a high level the DR replication performs the following steps:
- Wake up active and standby deployments which are currently hibernated.
- Disable scheduling on standby deployments.
- Verify Starship and Airflow version on active and standby deployments.
- Replicate Dags state and history from active to standby with Starship:
- Dags paused/unpaused
- Dagruns
- Task instances
- Task instance history
- Variables
- Pools
- Connections
- Hibernate active and standby deployments which were previously hibernated.
The replication and failover mechanisms take advantage of the Airflow setting scheduler.use_job_schedule in the tasks disable_scheduling_standby and enable_scheduling_active. If scheduling is disabled, no Dags with a schedule will be triggered by the scheduler.
Caution
This setting only disables the scheduling of Dags. Externally triggered Dags or Dags in state running will run on the standby deployment.
This setting is configured as an environment variable on deployment level in order to ensure it is configured independent of whether the deployment is hibernated or the deployment's Airflow API is not available at all.
Note
When scheduling is disabled for a deployment a warning will be shown in the Astro UI, which can be ignored in this case:
No. Although the ability to hibernate deployments can help saving costs, it is not required for performing DR replication or failover.
The replication generally copies all dagrun and task states as is. Thus, if some Dags are in state running, then those Dags would also be in state running on the standby deployment after the replication.
Therefore it is generally recommended to perform the replication during quiet hours, when no tasks are running.
No. Due to operational considerations each replication will always perform a full copy of the entire task history.
The DR replication relies on the capabilities provided by Astronomer Starship and therefore the replication of the following entities is currently not supported:
- XComs
- Task logs
- Deployment environment variables
- Variables and connections defined outside the Airflow Metadata DB (i.e. Astro env manager or secrets backends)
Moreover replication of deployment configuration in Astro is not supported. This should be handled by the user as part of the setup.
No. The DR implementation in this repo should only be viewed as a stop-gap solution for customers which currently don't have access to Astro's official disaster recovery implementation. The infrastructure-level implementation provided by Astro is superior in scope and reliability to the user-level implementation provided in this repo.
