GitHub - tuva-health/tuva: Main repo including core data model, data marts, data quality tests, and terminology sets.

What is the Tuva Project?

The Tuva Project is a dbt package for transforming raw healthcare data into analytics-ready data. The package includes:

Input Layer
Claims Preprocessing
Core Data Model
Data Marts
Terminology and Value Sets

Docs

The docs source for the getting-started runbook lives in docs/docs/getting-started.md.

Local Development

Recommended local setup:

Python 3.10 or later
DuckDB
dbt-core and dbt-duckdb

Use integration_tests as your development project. Configure a DuckDB connection in profiles.yml, then run from the repo root:

./scripts/dbt-local deps
./scripts/dbt-local build --full-refresh

dbt seed and dbt build load synthetic data into raw_data from versioned S3 artifacts. dbt run assumes those relations already exist, so on a fresh database you should run seed or build first.

Once the synthetic data is loaded, iterate with:

./scripts/dbt-local run

Agentic Workflow

If you are using coding agents in this repo, the local workflow guidance lives in AGENTS.md.

dbt Variables

Set Tuva vars under the vars: key in your dbt_project.yml. Use dbt selectors to run individual marts; the vars below control broad data domains, shared runtime behavior, and the synthetic bootstrap flow used by integration_tests.

Broad Enablement

Variable	Root Default	`integration_tests` Default	Description
`claims_enabled`	`false`	`true`	Enable claims-based models.
`clinical_enabled`	`false`	`true`	Enable clinical-based models.
`provider_attribution_enabled`	`false`	`true`	Enable provider attribution models. Claims input must also be enabled.
`semantic_layer_enabled`	`false`	`true`	Enable semantic-layer models. Claims-dependent semantic models also require `claims_enabled`.
`fhir_preprocessing_enabled`	`false`	`false`	Enable FHIR preprocessing models.

Shared Runtime Configuration

Variable	Default	Description
`tuva_last_run`	Current UTC timestamp	Populates the `tuva_last_run` column in output models.
`tuva_schema_prefix`	unset	Prefixes output schemas, for example `myprefix_core`.
`cms_hcc_payment_year`	Current year	CMS-HCC payment year used for risk scoring.
`quality_measures_period_end`	Current year-end	Optional reporting-period end date for quality measures.
`record_type`	`"ip"`	CCSR record type: `"ip"` for inpatient or `"op"` for outpatient.
`dxccsr_version`	`"2023.1"`	CCSR diagnosis mapping version.
`prccsr_version`	`"2023.1"`	CCSR procedure mapping version.
`provider_attribution_as_of_date`	unset	Optional `YYYY-MM-DD` override for provider attribution current-state calculations.

Seed And Feature Configuration

Variable	Default	Description
`custom_bucket_name`	`"tuva-public-resources"`	Default bucket for versioned Tuva seed artifacts.
`tuva_seed_version`	`"1.0.0"`	Default versioned seed folder used when no per-database override is provided. Leading `v` is optional.
`tuva_seed_versions`	`{concept_library: "1.0.1", reference_data: "1.0.0", terminology: "1.0.0", value_sets: "1.0.0", provider_data: "1.0.0", synthetic_data: "1.0.0"}`	Optional per-database version overrides keyed by `concept_library`, `reference_data`, `terminology`, `value_sets`, `provider_data`, or `synthetic_data`.
`tuva_seed_buckets`	`{}`	Optional per-database bucket overrides for `concept_library`, `reference_data`, `terminology`, `value_sets`, `provider_data`, or `synthetic_data`.
`synthetic_data_size`	`small` in `integration_tests`	Selects the `small` or `large` synthetic input payload when running `integration_tests`.
`enable_input_layer_testing`	`true`	Runs DQI checks on the input layer.
`enable_legacy_data_quality`	`false`	Builds the legacy pre-DQI data-quality models.
`enable_normalize_engine`	`false`	Set to `unmapped` to surface unmapped code models, or `true` to also use custom mappings.

See the maintained docs reference at thetuvaproject.com/dbt-variables for examples and more detail.

Publishing Versioned Seed Artifacts

Use scripts/publish-dolthub-seeds to publish the latest public DoltHub databases to versioned S3 folders.

Required inputs:

--version v1.0.0
AWS CLI credentials via AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

Optional inputs:

--bucket reference_data=my-bucket
--bucket value_sets=my-other-bucket/prefix
--database terminology
--download-only

The script publishes to the normalized layout:

s3://<bucket>/<database-folder>/<version>/<table>.csv.gz

Mirroring Seed Releases To GCS And Azure

Use scripts/mirror-seed-release after an S3 publish to copy the same versioned release to GCS and Azure Blob Storage.

Required access:

AWS CLI access to read s3://tuva-public-resources
gsutil access to write gs://tuva-public-resources
Azure Storage Blob Data Contributor or equivalent on storage account tuvapublicresources, container tuva-public-resources

Example:

scripts/mirror-seed-release --version v1.0.0

The script mirrors:

s3://tuva-public-resources/<database-folder>/<version>/...
gs://tuva-public-resources/<database-folder>/<version>/...
https://tuvapublicresources.blob.core.windows.net/tuva-public-resources/<database-folder>/<version>/...

Current published defaults:

concept-library uses 1.0.1
reference-data, terminology, value-sets, provider-data, and synthetic-data use 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 1,253 Commits
.github		.github
analyses		analyses
docs		docs
integration_tests		integration_tests
license		license
macros		macros
models		models
scripts		scripts
seeds		seeds
snapshots		snapshots
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
dbt_project.yml		dbt_project.yml
packages.yml		packages.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is the Tuva Project?

Docs

Local Development

Agentic Workflow

dbt Variables

Broad Enablement

Shared Runtime Configuration

Seed And Feature Configuration

Publishing Versioned Seed Artifacts

Mirroring Seed Releases To GCS And Azure

About

Uh oh!

Releases 132

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is the Tuva Project?

Docs

Local Development

Agentic Workflow

dbt Variables

Broad Enablement

Shared Runtime Configuration

Seed And Feature Configuration

Publishing Versioned Seed Artifacts

Mirroring Seed Releases To GCS And Azure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 132

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages