The Tuva Project is a dbt package for transforming raw healthcare data into analytics-ready data. The package includes:
- Input Layer
- Claims Preprocessing
- Core Data Model
- Data Marts
- Terminology and Value Sets
The docs source for the getting-started runbook lives in docs/docs/getting-started.md.
Recommended local setup:
- Python 3.10 or later
- DuckDB
dbt-coreanddbt-duckdb
Use integration_tests as your development project. Configure a DuckDB connection in profiles.yml, then run from the repo root:
./scripts/dbt-local deps
./scripts/dbt-local build --full-refreshdbt seed and dbt build load synthetic data into raw_data from versioned S3 artifacts. dbt run assumes those relations already exist, so on a fresh database you should run seed or build first.
Once the synthetic data is loaded, iterate with:
./scripts/dbt-local runIf you are using coding agents in this repo, the local workflow guidance lives in AGENTS.md.
Set Tuva vars under the vars: key in your dbt_project.yml. Use dbt selectors to run individual marts; the vars below control broad data domains, shared runtime behavior, and the synthetic bootstrap flow used by integration_tests.
| Variable | Root Default | integration_tests Default |
Description |
|---|---|---|---|
claims_enabled |
false |
true |
Enable claims-based models. |
clinical_enabled |
false |
true |
Enable clinical-based models. |
provider_attribution_enabled |
false |
true |
Enable provider attribution models. Claims input must also be enabled. |
semantic_layer_enabled |
false |
true |
Enable semantic-layer models. Claims-dependent semantic models also require claims_enabled. |
fhir_preprocessing_enabled |
false |
false |
Enable FHIR preprocessing models. |
| Variable | Default | Description |
|---|---|---|
tuva_last_run |
Current UTC timestamp | Populates the tuva_last_run column in output models. |
tuva_schema_prefix |
unset | Prefixes output schemas, for example myprefix_core. |
cms_hcc_payment_year |
Current year | CMS-HCC payment year used for risk scoring. |
quality_measures_period_end |
Current year-end | Optional reporting-period end date for quality measures. |
record_type |
"ip" |
CCSR record type: "ip" for inpatient or "op" for outpatient. |
dxccsr_version |
"2023.1" |
CCSR diagnosis mapping version. |
prccsr_version |
"2023.1" |
CCSR procedure mapping version. |
provider_attribution_as_of_date |
unset | Optional YYYY-MM-DD override for provider attribution current-state calculations. |
| Variable | Default | Description |
|---|---|---|
custom_bucket_name |
"tuva-public-resources" |
Default bucket for versioned Tuva seed artifacts. |
tuva_seed_version |
"1.0.0" |
Default versioned seed folder used when no per-database override is provided. Leading v is optional. |
tuva_seed_versions |
{concept_library: "1.0.1", reference_data: "1.0.0", terminology: "1.0.0", value_sets: "1.0.0", provider_data: "1.0.0", synthetic_data: "1.0.0"} |
Optional per-database version overrides keyed by concept_library, reference_data, terminology, value_sets, provider_data, or synthetic_data. |
tuva_seed_buckets |
{} |
Optional per-database bucket overrides for concept_library, reference_data, terminology, value_sets, provider_data, or synthetic_data. |
synthetic_data_size |
small in integration_tests |
Selects the small or large synthetic input payload when running integration_tests. |
enable_input_layer_testing |
true |
Runs DQI checks on the input layer. |
enable_legacy_data_quality |
false |
Builds the legacy pre-DQI data-quality models. |
enable_normalize_engine |
false |
Set to unmapped to surface unmapped code models, or true to also use custom mappings. |
See the maintained docs reference at thetuvaproject.com/dbt-variables for examples and more detail.
Use scripts/publish-dolthub-seeds to publish the latest public DoltHub databases to versioned S3 folders.
Required inputs:
--version v1.0.0- AWS CLI credentials via
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY
Optional inputs:
--bucket reference_data=my-bucket--bucket value_sets=my-other-bucket/prefix--database terminology--download-only
The script publishes to the normalized layout:
s3://<bucket>/<database-folder>/<version>/<table>.csv.gz
Use scripts/mirror-seed-release after an S3 publish to copy the same versioned release to GCS and Azure Blob Storage.
Required access:
- AWS CLI access to read
s3://tuva-public-resources gsutilaccess to writegs://tuva-public-resources- Azure
Storage Blob Data Contributoror equivalent on storage accounttuvapublicresources, containertuva-public-resources
Example:
scripts/mirror-seed-release --version v1.0.0The script mirrors:
s3://tuva-public-resources/<database-folder>/<version>/...gs://tuva-public-resources/<database-folder>/<version>/...https://tuvapublicresources.blob.core.windows.net/tuva-public-resources/<database-folder>/<version>/...
Current published defaults:
concept-libraryuses1.0.1reference-data,terminology,value-sets,provider-data, andsynthetic-datause1.0.0
