This repository supports communities in building voice datasets for language preservation and AI development. It is a digital companion to the chapter "AI Techniques for Indigenous Cultural Expression" in the book "Envisioning Indigenous Methods in Digital Media and Ecologies".
flowchart TD
accTitle: Community voice dataset creation workflow
accDescr: Decision flowchart beginning with community agreement and Indigenous datasheet, proceeding through the what-not-to-digitize and what-not-to-train-on ethical frameworks, branching into recording new audio or loading existing recordings, then a segmentation decision point for script-based silence detection or manual marking in Audacity, then SNR quality check, language-based transcription via Whisper, MMS, or manual, followed by transcript review, metadata entry and export, optional augmentation for small datasets, and finally LJSpeech export with a decision point to either train a TTS model or archive for preservation.
A([Start]) --> B[Community agreement]
B --> B2[Indigenous datasheet]
B2 --> C[What not to digitize]
C --> C2[What not to train on]
C2 --> D{Existing\nrecordings?}
D -->|No| P1[Record new audio]
D -->|Yes| SEG
P1 --> SEG{Segmentation\nmethod?}
SEG -->|Script| SEG1[Segment on silence]
SEG -->|Manual| SEG2[Mark and export]
SEG1 --> SNR[SNR quality check]
SEG2 --> SNR
SNR --> LANG{Language?}
LANG -->|Whisper| W[03a Whisper]
LANG -->|MMS| M[03b MMS]
LANG -->|Manual| MAN[03c Manual]
W & M & MAN --> REV[Review transcripts]
REV --> META
META[Fill metadata] --> EMETA[Export metadata]
EMETA --> AUG{Dataset\ntoo small?}
AUG -->|Yes| P3[Augment]
AUG -->|No| EXP
P3 --> EXP
EXP[Export LJSpeech] --> TRAIN_Q{Train model?}
TRAIN_Q -->|Yes| DONE([Train TTS model])
TRAIN_Q -->|No| ARCHIVE([Archive for preservation])
| Component | Purpose |
|---|---|
| Community Agreement | Speaker consent template with ownership tiers and withdrawal rights |
| Indigenous Datasheet | OCAP®-based institutional governance for the dataset as a whole |
| What Not to Digitize | Decision framework for recordings that should not become training data |
| What Not to Train On | Extends the framework to images, text, and cultural artifacts |
| 01 Record & Segment | Record audio, split on silence |
| 02 SNR Analysis | Signal-to-noise quality gate |
| 03a Whisper | Transcribe English, Spanish, Māori, etc. |
| 03b MMS | Transcribe Cree syllabics, Ojibwe, O'odham, etc. (530+ languages) |
| 03c Manual | Manual transcription for unsupported languages |
| 04 Augmentation | Speed, pitch, noise variations of consented recordings |
| 05 Export | Filter by consent tier, output LJSpeech format |
| export_metadata.py | Merge markers with CARE metadata template |
| segment_on_silence.py | Split long recordings into utterances |
| batch_transcribe_whisper.py | CLI wrapper for Whisper batch processing |
Before beginning, read docs/what_not_to_digitize.md. Not all recordings should become training data. This is the most important decision in the workflow.
This repository is designed in alignment with the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, Ethics). See CARE_PRINCIPLES.md for how each principle is enacted here, and CHANGELOG.md for the full history of changes from the 2020 original.
- Community Voice Dataset Creation
Requirements: Python 3.11+, uv, ffmpeg, sox
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create environment and install exact locked dependencies (verifies hashes)
uv sync --extra dev
source .venv/bin/activate # Windows: .venv\Scripts\activateuv sync installs from uv.lock, which pins exact versions and verifies package hashes — protecting against supply chain attacks like the 2026 litellm incident. Never run uv pip install without the lockfile when working with sensitive community data.
No cloud credentials are required. All transcription runs locally.
Updating dependencies (maintainers): Edit pyproject.toml, then run uv lock to regenerate the lockfile before committing.
Notebook: notebooks/01_record_and_segment.ipynb
This pathway is for communities recording new speech with consenting speakers.
- Complete the Community Data Agreement with each speaker. For institutional-level governance documentation (dataset ownership, funding, access policies), fill out the Adapted Datasheet for Indigenous Datasets.
- Work through docs/what_not_to_digitize.md to identify any material that should not be recorded or should be restricted.
- Prepare
metadata/metadata_template.csv— fill in speaker consent tiers and cultural protocol notes before the session.
- Omni-directional or cardioid head-mounted microphone
- Quiet, acoustically treated room
- Sample rate: 22050 Hz or higher, mono, 16-bit PCM
In Audacity:
- Open your recording and select
Analyze→Sound Finder - Adjust dB thresholds until clips are 3–10 seconds
- Export labels:
File→Export→Export Labels→ save asLabel Track.txt - Export WAVs:
File→Export→Export Multiple...- Format: WAV, Signed 16-bit PCM
- Split on Labels, name by Label/Track Name
- Output folder:
wavs_export
Or run the segmentation script directly:
python scripts/segment_on_silence.py --input recording.wav --output-dir test_data/wavs_export_audacityscripts/wavdurations2csv.shRun notebooks/02_snr_quality_analysis.ipynb to check audio quality, then choose a transcription notebook from the Pathway 2 language table below.
This pathway is for communities digitizing existing recordings (cassettes, reel-to-reel, field recordings). All transcription runs locally — no audio leaves community infrastructure.
Follow the same Audacity segmentation steps from Pathway 1, or use Adobe Audition:
Diagnostics→Mark Audio→Mark the Speechpreset- Export markers to
Markers.csvand WAVs towavs_export/
Run notebooks/02_snr_quality_analysis.ipynb to identify and remove poor-quality recordings before transcribing.
Choose the notebook for your language:
| Language | Notebook |
|---|---|
| English, Spanish, Māori, or other Whisper-supported language | notebooks/03a_transcribe_whisper.ipynb |
| Plains Cree or other language in the MMS list | notebooks/03b_transcribe_mms.ipynb |
| Language not covered by either tool, or community prefers manual | notebooks/03c_transcribe_manual.ipynb |
python scripts/export_metadata.py audacity \
--metadata-template metadata/metadata_template.csvNotebook: notebooks/04_augmentation.ipynb
This pathway extends an existing authentic dataset without introducing synthetic voices. It is appropriate when a community has a small number of real recordings and needs more training data.
Note: The Te Hiku Media approach of collecting 310+ hours from real speakers is the gold standard. Augmentation is a practical compromise, not a substitute for real recordings.
Techniques available: speed perturbation, pitch shifting, additive noise (white, brown, room impulse response).
All augmented files are flagged in metadata with provenance_note: "augmented from {source_id}".
Every recording gets structured provenance. See docs/metadata_schema.md for full field documentation.
Template: metadata/metadata_template.csv
Key fields:
| Field | Purpose |
|---|---|
consent_tier |
open / community / restricted |
cultural_protocol |
Free text — e.g., "seasonal restriction: winter only" |
knowledge_keeper_reviewed |
Boolean |
exclude_from_training |
Boolean — keeps recording in archive but out of model |
exclude_reason |
Required when exclude_from_training is true |
Notebook: notebooks/05_export_ljspeech.ipynb
The final step packages reviewed, consented recordings into the LJSpeech format used by most TTS fine-tuning pipelines.
Only recordings where exclude_from_training == False and consent_tier is open or community are included. Restricted and sacred recordings are preserved in the archive but never exported.
scripts/resamplewav.shffmpeg is used (not resampy) — it preserves more high-frequency content. See spectrograms in assets/.
The teaching/ directory contains course materials developed for AI Ethics courses.
| Notebook | Purpose |
|---|---|
| Student Protocol Exercise | Design exercise: draft a community agreement, categorize hypothetical recordings, build a metadata schema (no audio tools required) |
| Consent-Tier Filtering Demo | Shows how governance decisions become a DataFrame filter at export time |
These notebooks reference the same docs and metadata used by the pipeline but can be run independently with only pandas.
- CARE Principles for Indigenous Data Governance: https://www.gida-global.org/care
- Te Hiku Media Kaitiakitanga License: https://github.com/TeHikuMedia/Kaitiakitanga-License
- Whisper (local speech recognition): https://github.com/openai/whisper
- LJSpeech Dataset format: https://keithito.com/LJ-Speech-Dataset/
- Datasheets for Datasets (Gebru et al.): https://arxiv.org/abs/1803.09010
- Adapted Datasheet for Indigenous Datasets (OCAP®-based): docs/adapted_datasheet_for_indigenous_datasets.md
- Mozilla TTS: https://github.com/mozilla/TTS