Community Voice Dataset Creation

This repository supports communities in building voice datasets for language preservation and AI development. It is a digital companion to the chapter "AI Techniques for Indigenous Cultural Expression" in the book "Envisioning Indigenous Methods in Digital Media and Ecologies".

flowchart TD
    accTitle: Community voice dataset creation workflow
    accDescr: Decision flowchart beginning with community agreement and Indigenous datasheet, proceeding through the what-not-to-digitize and what-not-to-train-on ethical frameworks, branching into recording new audio or loading existing recordings, then a segmentation decision point for script-based silence detection or manual marking in Audacity, then SNR quality check, language-based transcription via Whisper, MMS, or manual, followed by transcript review, metadata entry and export, optional augmentation for small datasets, and finally LJSpeech export with a decision point to either train a TTS model or archive for preservation.
    A([Start]) --> B[Community agreement]
    B --> B2[Indigenous datasheet]
    B2 --> C[What not to digitize]
    C --> C2[What not to train on]
    C2 --> D{Existing\nrecordings?}

    D -->|No| P1[Record new audio]
    D -->|Yes| SEG

    P1 --> SEG{Segmentation\nmethod?}

    SEG -->|Script| SEG1[Segment on silence]
    SEG -->|Manual| SEG2[Mark and export]
    SEG1 --> SNR[SNR quality check]
    SEG2 --> SNR
    SNR --> LANG{Language?}
    LANG -->|Whisper| W[03a Whisper]
    LANG -->|MMS| M[03b MMS]
    LANG -->|Manual| MAN[03c Manual]
    W & M & MAN --> REV[Review transcripts]
    REV --> META

    META[Fill metadata] --> EMETA[Export metadata]
    EMETA --> AUG{Dataset\ntoo small?}

    AUG -->|Yes| P3[Augment]
    AUG -->|No| EXP
    P3 --> EXP

    EXP[Export LJSpeech] --> TRAIN_Q{Train model?}
    TRAIN_Q -->|Yes| DONE([Train TTS model])
    TRAIN_Q -->|No| ARCHIVE([Archive for preservation])

Components

Component	Purpose
Community Agreement	Speaker consent template with ownership tiers and withdrawal rights
Indigenous Datasheet	OCAP®-based institutional governance for the dataset as a whole
What Not to Digitize	Decision framework for recordings that should not become training data
What Not to Train On	Extends the framework to images, text, and cultural artifacts
01 Record & Segment	Record audio, split on silence
02 SNR Analysis	Signal-to-noise quality gate
03a Whisper	Transcribe English, Spanish, Māori, etc.
03b MMS	Transcribe Cree syllabics, Ojibwe, O'odham, etc. (530+ languages)
03c Manual	Manual transcription for unsupported languages
04 Augmentation	Speed, pitch, noise variations of consented recordings
05 Export	Filter by consent tier, output LJSpeech format
export_metadata.py	Merge markers with CARE metadata template
segment_on_silence.py	Split long recordings into utterances
batch_transcribe_whisper.py	CLI wrapper for Whisper batch processing

Before beginning, read docs/what_not_to_digitize.md. Not all recordings should become training data. This is the most important decision in the workflow.

Guiding Principles

This repository is designed in alignment with the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, Ethics). See CARE_PRINCIPLES.md for how each principle is enacted here, and CHANGELOG.md for the full history of changes from the 2020 original.

Setup

Requirements: Python 3.11+, uv, ffmpeg, sox

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create environment and install exact locked dependencies (verifies hashes)
uv sync --extra dev
source .venv/bin/activate   # Windows: .venv\Scripts\activate

uv sync installs from uv.lock, which pins exact versions and verifies package hashes — protecting against supply chain attacks like the 2026 litellm incident. Never run uv pip install without the lockfile when working with sensitive community data.

No cloud credentials are required. All transcription runs locally.

Updating dependencies (maintainers): Edit pyproject.toml, then run uv lock to regenerate the lockfile before committing.

Pathway 1: Record Your Own Voice

Notebook: notebooks/01_record_and_segment.ipynb

This pathway is for communities recording new speech with consenting speakers.

Before Recording

Complete the Community Data Agreement with each speaker. For institutional-level governance documentation (dataset ownership, funding, access policies), fill out the Adapted Datasheet for Indigenous Datasets.
Work through docs/what_not_to_digitize.md to identify any material that should not be recorded or should be restricted.
Prepare metadata/metadata_template.csv — fill in speaker consent tiers and cultural protocol notes before the session.

Recording Requirements

Omni-directional or cardioid head-mounted microphone
Quiet, acoustically treated room
Sample rate: 22050 Hz or higher, mono, 16-bit PCM

Segment and Label

In Audacity:

Open your recording and select Analyze → Sound Finder
Adjust dB thresholds until clips are 3–10 seconds
Export labels: File → Export → Export Labels → save as Label Track.txt
Export WAVs: File → Export → Export Multiple...
- Format: WAV, Signed 16-bit PCM
- Split on Labels, name by Label/Track Name
- Output folder: wavs_export

Or run the segmentation script directly:

python scripts/segment_on_silence.py --input recording.wav --output-dir test_data/wavs_export_audacity

Check Sentence Lengths

scripts/wavdurations2csv.sh

Analyze SNR and Transcribe

Run notebooks/02_snr_quality_analysis.ipynb to check audio quality, then choose a transcription notebook from the Pathway 2 language table below.

Pathway 2: Transcribe Existing Recordings

This pathway is for communities digitizing existing recordings (cassettes, reel-to-reel, field recordings). All transcription runs locally — no audio leaves community infrastructure.

Mark and Export

Follow the same Audacity segmentation steps from Pathway 1, or use Adobe Audition:

Diagnostics → Mark Audio → Mark the Speech preset
Export markers to Markers.csv and WAVs to wavs_export/

Analyze Signal-to-Noise Ratio

Run notebooks/02_snr_quality_analysis.ipynb to identify and remove poor-quality recordings before transcribing.

Transcribe

Choose the notebook for your language:

Language	Notebook
English, Spanish, Māori, or other Whisper-supported language	notebooks/03a_transcribe_whisper.ipynb
Plains Cree or other language in the MMS list	notebooks/03b_transcribe_mms.ipynb
Language not covered by either tool, or community prefers manual	notebooks/03c_transcribe_manual.ipynb

Export to Metadata

python scripts/export_metadata.py audacity \
  --metadata-template metadata/metadata_template.csv

Pathway 3: Augment a Small Dataset

Notebook: notebooks/04_augmentation.ipynb

This pathway extends an existing authentic dataset without introducing synthetic voices. It is appropriate when a community has a small number of real recordings and needs more training data.

Note: The Te Hiku Media approach of collecting 310+ hours from real speakers is the gold standard. Augmentation is a practical compromise, not a substitute for real recordings.

Techniques available: speed perturbation, pitch shifting, additive noise (white, brown, room impulse response).

All augmented files are flagged in metadata with provenance_note: "augmented from {source_id}".

Metadata Schema

Every recording gets structured provenance. See docs/metadata_schema.md for full field documentation.

Template: metadata/metadata_template.csv

Key fields:

Field	Purpose
`consent_tier`	`open` / `community` / `restricted`
`cultural_protocol`	Free text — e.g., "seasonal restriction: winter only"
`knowledge_keeper_reviewed`	Boolean
`exclude_from_training`	Boolean — keeps recording in archive but out of model
`exclude_reason`	Required when `exclude_from_training` is true

Export to LJSpeech Format

Notebook: notebooks/05_export_ljspeech.ipynb

The final step packages reviewed, consented recordings into the LJSpeech format used by most TTS fine-tuning pipelines.

Only recordings where exclude_from_training == False and consent_tier is open or community are included. Restricted and sacred recordings are preserved in the archive but never exported.

Utilities

Upsample WAV files (16 kHz → 22050 Hz)

scripts/resamplewav.sh

ffmpeg is used (not resampy) — it preserves more high-frequency content. See spectrograms in assets/.

Teaching Materials

The teaching/ directory contains course materials developed for AI Ethics courses.

Notebook	Purpose
Student Protocol Exercise	Design exercise: draft a community agreement, categorize hypothetical recordings, build a metadata schema (no audio tools required)
Consent-Tier Filtering Demo	Shows how governance decisions become a DataFrame filter at export time

These notebooks reference the same docs and metadata used by the pipeline but can be run independently with only pandas.

References

CARE Principles for Indigenous Data Governance: https://www.gida-global.org/care
Te Hiku Media Kaitiakitanga License: https://github.com/TeHikuMedia/Kaitiakitanga-License
Whisper (local speech recognition): https://github.com/openai/whisper
LJSpeech Dataset format: https://keithito.com/LJ-Speech-Dataset/
Datasheets for Datasets (Gebru et al.): https://arxiv.org/abs/1803.09010
Adapted Datasheet for Indigenous Datasets (OCAP®-based): docs/adapted_datasheet_for_indigenous_datasets.md
Mozilla TTS: https://github.com/mozilla/TTS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Community Voice Dataset Creation

Components

Guiding Principles

Table of Contents

Setup

Pathway 1: Record Your Own Voice

Before Recording

Recording Requirements

Segment and Label

Check Sentence Lengths

Analyze SNR and Transcribe

Pathway 2: Transcribe Existing Recordings

Mark and Export

Analyze Signal-to-Noise Ratio

Transcribe

Export to Metadata

Pathway 3: Augment a Small Dataset

Metadata Schema

Export to LJSpeech Format

Utilities

Upsample WAV files (16 kHz → 22050 Hz)

Teaching Materials

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
docs		docs
metadata		metadata
notebooks		notebooks
scripts		scripts
teaching		teaching
test_data		test_data
tests		tests
.gitignore		.gitignore
CARE_PRINCIPLES.md		CARE_PRINCIPLES.md
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Community Voice Dataset Creation

Components

Guiding Principles

Table of Contents

Setup

Pathway 1: Record Your Own Voice

Before Recording

Recording Requirements

Segment and Label

Check Sentence Lengths

Analyze SNR and Transcribe

Pathway 2: Transcribe Existing Recordings

Mark and Export

Analyze Signal-to-Noise Ratio

Transcribe

Export to Metadata

Pathway 3: Augment a Small Dataset

Metadata Schema

Export to LJSpeech Format

Utilities

Upsample WAV files (16 kHz → 22050 Hz)

Teaching Materials

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages