Skip to content

hollygrimm/voice-dataset-creation

Repository files navigation

Community Voice Dataset Creation

This repository supports communities in building voice datasets for language preservation and AI development. It is a digital companion to the chapter "AI Techniques for Indigenous Cultural Expression" in the book "Envisioning Indigenous Methods in Digital Media and Ecologies".

flowchart TD
    accTitle: Community voice dataset creation workflow
    accDescr: Decision flowchart beginning with community agreement and Indigenous datasheet, proceeding through the what-not-to-digitize and what-not-to-train-on ethical frameworks, branching into recording new audio or loading existing recordings, then a segmentation decision point for script-based silence detection or manual marking in Audacity, then SNR quality check, language-based transcription via Whisper, MMS, or manual, followed by transcript review, metadata entry and export, optional augmentation for small datasets, and finally LJSpeech export with a decision point to either train a TTS model or archive for preservation.
    A([Start]) --> B[Community agreement]
    B --> B2[Indigenous datasheet]
    B2 --> C[What not to digitize]
    C --> C2[What not to train on]
    C2 --> D{Existing\nrecordings?}

    D -->|No| P1[Record new audio]
    D -->|Yes| SEG

    P1 --> SEG{Segmentation\nmethod?}

    SEG -->|Script| SEG1[Segment on silence]
    SEG -->|Manual| SEG2[Mark and export]
    SEG1 --> SNR[SNR quality check]
    SEG2 --> SNR
    SNR --> LANG{Language?}
    LANG -->|Whisper| W[03a Whisper]
    LANG -->|MMS| M[03b MMS]
    LANG -->|Manual| MAN[03c Manual]
    W & M & MAN --> REV[Review transcripts]
    REV --> META

    META[Fill metadata] --> EMETA[Export metadata]
    EMETA --> AUG{Dataset\ntoo small?}

    AUG -->|Yes| P3[Augment]
    AUG -->|No| EXP
    P3 --> EXP

    EXP[Export LJSpeech] --> TRAIN_Q{Train model?}
    TRAIN_Q -->|Yes| DONE([Train TTS model])
    TRAIN_Q -->|No| ARCHIVE([Archive for preservation])
Loading

Components

Component Purpose
Community Agreement Speaker consent template with ownership tiers and withdrawal rights
Indigenous Datasheet OCAP®-based institutional governance for the dataset as a whole
What Not to Digitize Decision framework for recordings that should not become training data
What Not to Train On Extends the framework to images, text, and cultural artifacts
01 Record & Segment Record audio, split on silence
02 SNR Analysis Signal-to-noise quality gate
03a Whisper Transcribe English, Spanish, Māori, etc.
03b MMS Transcribe Cree syllabics, Ojibwe, O'odham, etc. (530+ languages)
03c Manual Manual transcription for unsupported languages
04 Augmentation Speed, pitch, noise variations of consented recordings
05 Export Filter by consent tier, output LJSpeech format
export_metadata.py Merge markers with CARE metadata template
segment_on_silence.py Split long recordings into utterances
batch_transcribe_whisper.py CLI wrapper for Whisper batch processing

Before beginning, read docs/what_not_to_digitize.md. Not all recordings should become training data. This is the most important decision in the workflow.


Guiding Principles

This repository is designed in alignment with the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, Ethics). See CARE_PRINCIPLES.md for how each principle is enacted here, and CHANGELOG.md for the full history of changes from the 2020 original.


Table of Contents


Setup

Requirements: Python 3.11+, uv, ffmpeg, sox

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create environment and install exact locked dependencies (verifies hashes)
uv sync --extra dev
source .venv/bin/activate   # Windows: .venv\Scripts\activate

uv sync installs from uv.lock, which pins exact versions and verifies package hashes — protecting against supply chain attacks like the 2026 litellm incident. Never run uv pip install without the lockfile when working with sensitive community data.

No cloud credentials are required. All transcription runs locally.

Updating dependencies (maintainers): Edit pyproject.toml, then run uv lock to regenerate the lockfile before committing.


Pathway 1: Record Your Own Voice

Notebook: notebooks/01_record_and_segment.ipynb

This pathway is for communities recording new speech with consenting speakers.

Before Recording

  1. Complete the Community Data Agreement with each speaker. For institutional-level governance documentation (dataset ownership, funding, access policies), fill out the Adapted Datasheet for Indigenous Datasets.
  2. Work through docs/what_not_to_digitize.md to identify any material that should not be recorded or should be restricted.
  3. Prepare metadata/metadata_template.csv — fill in speaker consent tiers and cultural protocol notes before the session.

Recording Requirements

  • Omni-directional or cardioid head-mounted microphone
  • Quiet, acoustically treated room
  • Sample rate: 22050 Hz or higher, mono, 16-bit PCM

Segment and Label

In Audacity:

  • Open your recording and select AnalyzeSound Finder
  • Adjust dB thresholds until clips are 3–10 seconds
  • Export labels: FileExportExport Labels → save as Label Track.txt
  • Export WAVs: FileExportExport Multiple...
    • Format: WAV, Signed 16-bit PCM
    • Split on Labels, name by Label/Track Name
    • Output folder: wavs_export

Or run the segmentation script directly:

python scripts/segment_on_silence.py --input recording.wav --output-dir test_data/wavs_export_audacity

Check Sentence Lengths

scripts/wavdurations2csv.sh

Analyze SNR and Transcribe

Run notebooks/02_snr_quality_analysis.ipynb to check audio quality, then choose a transcription notebook from the Pathway 2 language table below.


Pathway 2: Transcribe Existing Recordings

This pathway is for communities digitizing existing recordings (cassettes, reel-to-reel, field recordings). All transcription runs locally — no audio leaves community infrastructure.

Mark and Export

Follow the same Audacity segmentation steps from Pathway 1, or use Adobe Audition:

  • DiagnosticsMark AudioMark the Speech preset
  • Export markers to Markers.csv and WAVs to wavs_export/

Analyze Signal-to-Noise Ratio

Run notebooks/02_snr_quality_analysis.ipynb to identify and remove poor-quality recordings before transcribing.

Transcribe

Choose the notebook for your language:

Language Notebook
English, Spanish, Māori, or other Whisper-supported language notebooks/03a_transcribe_whisper.ipynb
Plains Cree or other language in the MMS list notebooks/03b_transcribe_mms.ipynb
Language not covered by either tool, or community prefers manual notebooks/03c_transcribe_manual.ipynb

Export to Metadata

python scripts/export_metadata.py audacity \
  --metadata-template metadata/metadata_template.csv

Pathway 3: Augment a Small Dataset

Notebook: notebooks/04_augmentation.ipynb

This pathway extends an existing authentic dataset without introducing synthetic voices. It is appropriate when a community has a small number of real recordings and needs more training data.

Note: The Te Hiku Media approach of collecting 310+ hours from real speakers is the gold standard. Augmentation is a practical compromise, not a substitute for real recordings.

Techniques available: speed perturbation, pitch shifting, additive noise (white, brown, room impulse response).

All augmented files are flagged in metadata with provenance_note: "augmented from {source_id}".


Metadata Schema

Every recording gets structured provenance. See docs/metadata_schema.md for full field documentation.

Template: metadata/metadata_template.csv

Key fields:

Field Purpose
consent_tier open / community / restricted
cultural_protocol Free text — e.g., "seasonal restriction: winter only"
knowledge_keeper_reviewed Boolean
exclude_from_training Boolean — keeps recording in archive but out of model
exclude_reason Required when exclude_from_training is true

Export to LJSpeech Format

Notebook: notebooks/05_export_ljspeech.ipynb

The final step packages reviewed, consented recordings into the LJSpeech format used by most TTS fine-tuning pipelines.

Only recordings where exclude_from_training == False and consent_tier is open or community are included. Restricted and sacred recordings are preserved in the archive but never exported.


Utilities

Upsample WAV files (16 kHz → 22050 Hz)

scripts/resamplewav.sh

ffmpeg is used (not resampy) — it preserves more high-frequency content. See spectrograms in assets/.


Teaching Materials

The teaching/ directory contains course materials developed for AI Ethics courses.

Notebook Purpose
Student Protocol Exercise Design exercise: draft a community agreement, categorize hypothetical recordings, build a metadata schema (no audio tools required)
Consent-Tier Filtering Demo Shows how governance decisions become a DataFrame filter at export time

These notebooks reference the same docs and metadata used by the pipeline but can be run independently with only pandas.


References

About

Community-controlled voice data collection for language preservation and AI development. Companion to 'AI Techniques for Indigenous Cultural Expression' in Envisioning Indigenous Methods in Digital Media and Ecologies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors