VibeVoice Fine-tuning - Easy Scripts

Simplified scripts for fine-tuning VibeVoice models with LoRA. The goal is to make the process as painless as possible with reasonable defaults and clear instructions.

🚀 New: Try the Google Colab notebook for zero-setup training in the cloud!

Quick Start

Choose your preferred method:

Option A: Google Colab (Easiest - No Setup Required)

Click the badge above to open the notebook in Google Colab. You'll get:

✅ Free GPU access (T4)
✅ No local installation needed
✅ Direct upload of audio files
✅ Automatic saving to Google Drive

Recommended for: Beginners, quick experiments, or if you don't have a GPU.

Option B: Local Setup

1. Setup Environment

bash 01-setup.sh

This script will:

Install system dependencies (ffmpeg, git-lfs, etc.)
Create a Python virtual environment
Clone the VibeVoice-finetuning repository
Install all Python dependencies
Download the VibeVoice model (1.5B or 7B)
Create helper scripts

2. Prepare Your Dataset

source activate_env.sh
python 02-prepare_dataset.py --audio_dir /path/to/your/audio --auto-transcribe --output data/dataset.jsonl

Or with existing transcripts:

python 02-prepare_dataset.py --audio_dir /path/to/audio --transcript_dir /path/to/transcripts --output data/dataset.jsonl

3. Train the Model

python 03-train.py --dataset data/dataset.jsonl --model 1.5B

That's it! Your fine-tuned model will be saved in the output/ directory.

Architecture Overview

The VibeVoice model uses a dual-loss training approach:

Cross-Entropy Loss on text tokens for language modeling
MSE Loss on acoustic latents for speech generation quality

LoRA adapters are applied to the Qwen2 LLM backbone while the diffusion head can be fully fine-tuned.

Detailed Documentation

Script 1: Setup (`01-setup.sh`)

The setup script handles all environment setup automatically.

What it does:

Checks for CUDA availability
Installs system dependencies (Ubuntu/Debian, RHEL/CentOS, or macOS)
Creates a Python 3.11 virtual environment
Clones the VibeVoice-finetuning repository
Installs PyTorch with CUDA support
Installs the VibeVoice-finetuning package with compatible dependencies
Downloads your chosen model (1.5B or 7B)
Creates helper scripts (activate_env.sh)

After setup:

source activate_env.sh  # Activate the environment

Script 2: Dataset Preparation (`02-prepare_dataset.py`)

This script prepares your audio data into the JSONL format required by VibeVoice.

Basic Usage

Option 1: Auto-transcribe with Whisper

python 02-prepare_dataset.py \
  --audio_dir ./my_audio \
  --auto-transcribe \
  --output data/dataset.jsonl

Option 2: With transcript files

python 02-prepare_dataset.py \
  --audio_dir ./my_audio \
  --transcript_dir ./transcripts \
  --output data/dataset.jsonl

Option 3: From CSV

python 02-prepare_dataset.py \
  --csv metadata.csv \
  --csv_audio_col audio_path \
  --csv_text_col transcript \
  --output data/dataset.jsonl

All Options

Option	Description	Default
`--audio_dir`	Directory containing audio files	Required (or --csv)
`--csv`	CSV file with metadata	Alternative to --audio_dir
`--transcript_dir`	Directory with transcript .txt files	Auto-detect
`--auto_transcribe`	Transcribe audio using Whisper	False
`--whisper_model`	Whisper model size (tiny/base/small/medium/large)	base
`--voice_prompts_dir`	Directory with voice prompt audio	None
`--speaker_prefix`	Speaker label prefix	"Speaker 0"
`--output`	Output JSONL file	data/dataset.jsonl
`--val_split`	Validation split ratio	0.1 (10%)
`--no_val_split`	Don't create validation set	False

JSONL Format

The output file contains one JSON object per line:

{"text": "Speaker 0: Hello, this is a sample transcription.", "audio": "/path/to/audio.wav"}
{"text": "Speaker 0: Another example with voice prompt.", "audio": "/path/to/audio2.wav", "voice_prompts": "/path/to/prompt.wav"}

For multi-speaker training, include multiple voice prompts:

{"text": "Speaker 0: Hello there!\nSpeaker 1: Hi, how are you?", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_prompt.wav", "/path/to/speaker1_prompt.wav"]}

Script 3: Training (`03-train.py`)

A simplified training script with sensible defaults.

Basic Usage

Quick test (1 epoch, faster)

python 03-train.py --dataset data/dataset.jsonl --preset fast

Standard training (1.5B model)

python 03-train.py --dataset data/dataset.jsonl --model 1.5B

High quality (7B model)

python 03-train.py --dataset data/dataset.jsonl --model 7B --preset quality

Resume from checkpoint

python 03-train.py --dataset data/dataset.jsonl --resume_from_checkpoint ./output/checkpoint-500

Presets

Preset	Description	Epochs	Learning Rate	Use Case
`fast`	Quick training	1	5e-5	Testing the pipeline
`default`	Balanced	5	2.5e-5	General fine-tuning
`quality`	Best quality	10	1e-5	Production models

Model Options

Model	VRAM Required	Batch Size	Description
`1.5B`	16GB	4	Smaller, faster training
`7B`	48GB	2	Higher quality output

All Options

Option	Description	Default
`--dataset`	Training dataset JSONL file	Required
`--val_dataset`	Validation dataset JSONL	None
`--model`	Model size (1.5B/7B)	1.5B
`--preset`	Training preset (fast/default/quality)	default
`--output_dir`	Output directory	output
`--num_epochs`	Number of epochs	From preset
`--learning_rate`	Learning rate	From preset
`--batch_size`	Batch size per device	From preset
`--gradient_accumulation`	Gradient accumulation steps	From preset
`--lora_r`	LoRA rank	8
`--lora_alpha`	LoRA alpha	32
`--wandb`	Enable Weights & Biases logging	False
`--tensorboard`	Enable TensorBoard logging	False
`--gradient_checkpointing`	Save memory (slower)	False
`--train_diffusion_head`	Train diffusion head	True
`--resume_from_checkpoint`	Resume from checkpoint path	None

Directory Structure

After setup, your directory will look like:

VibeVoice-finetune/
├── 01-setup.sh                 # Setup script
├── 02-prepare_dataset.py       # Dataset preparation
├── 03-train.py                 # Training script
├── activate_env.sh             # Environment activation
├── venv/                       # Python virtual environment
├── VibeVoice-finetuning/       # Original repository
├── models/                     # Downloaded models
│   ├── aoi-ot--VibeVoice-Base/
│   └── aoi-ot--VibeVoice-Large/
├── data/                       # Your data
│   ├── audio/                  # Audio files
│   ├── dataset.jsonl           # Prepared dataset
│   ├── dataset.train.jsonl     # Training split
│   └── dataset.val.jsonl       # Validation split
├── output/                     # Training outputs
│   ├── checkpoint-XXX/         # Checkpoints
│   └── lora/                   # Final LoRA adapter
├── docs/
│   └── diagrams/               # Documentation diagrams
│       ├── workflow-overview.svg
│       ├── architecture.svg
│       └── dataset-pipeline.svg
└── hf_cache/                   # HuggingFace cache

Complete Example Workflow

# 1. Setup everything
bash 01-setup.sh

# 2. Activate environment
source activate_env.sh

# 3. Copy your audio files
cp -r /path/to/my/audio/* data/audio/

# 4. Prepare dataset (with Whisper transcription)
python 02-prepare_dataset.py \
  --audio_dir data/audio \
  --auto_transcribe \
  --whisper_model base \
  --output data/my_dataset.jsonl

# 5. Train the model
python 03-train.py \
  --dataset data/my_dataset.jsonl \
  --model 1.5B \
  --preset default \
  --output_dir output/my_model

# 6. Your model is ready!
# LoRA adapter saved at: output/my_model/lora/

Tips & Best Practices

Audio Quality

Use high-quality audio files (24kHz sample rate preferred)
Clean audio with minimal background noise works best
Consistent volume levels across samples

Dataset Size

Minimum: 50-100 samples for minimal fine-tuning
Good: 500-1000 samples for decent quality
Excellent: 5000+ samples for best results

Training Time Estimates

Model	Dataset Size	Preset	Estimated Time (A100)
1.5B	100 samples	fast	~5 minutes
1.5B	1000 samples	default	~2 hours
7B	1000 samples	default	~6 hours
7B	5000 samples	quality	~30 hours

Memory Optimization

If you run out of VRAM:

Enable gradient checkpointing:

python 03-train.py --dataset data/dataset.jsonl --gradient_checkpointing

Reduce batch size:

python 03-train.py --dataset data/dataset.jsonl --batch_size 1 --gradient_accumulation 64

Use smaller model:

python 03-train.py --dataset data/dataset.jsonl --model 1.5B

Multi-Speaker Training

For multi-speaker datasets:

Format your text with speaker labels:

Speaker 0: First speaker's text
Speaker 1: Second speaker's text

Provide voice prompts for each speaker:

python 02-prepare_dataset.py \
  --audio_dir data/multi_speaker \
  --voice_prompts_dir data/prompts \
  --output data/multi_speaker.jsonl

Troubleshooting

Import Errors

Make sure you've activated the environment:

source activate_env.sh

CUDA Out of Memory

Use --gradient_checkpointing
Reduce --batch_size
Use a smaller model (1.5B instead of 7B)

Model Not Found

The setup script should download models automatically. If not:

mkdir -p models
cd models
git lfs install
git clone https://huggingface.co/aoi-ot/VibeVoice-Base
git clone https://huggingface.co/aoi-ot/VibeVoice-Large

Audio Loading Errors

Make sure ffmpeg is installed:

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

🚀 Google Colab Notebook

For users who prefer a cloud-based, zero-setup approach, we provide a fully-featured Google Colab notebook.

What's Included

Interactive Setup: One-click environment setup
Multiple Data Sources: Upload directly, use Google Drive, or HuggingFace datasets
Auto-Transcription: Built-in Whisper integration
Visual Configuration: Sliders and dropdowns for all training parameters
Progress Monitoring: Real-time training progress
Easy Export: Save to Google Drive or download directly

Recommended for

🆕 Beginners new to ML fine-tuning
💻 Users without local GPU access
⚡ Quick experiments and prototyping
📚 Educational purposes

Hardware on Colab

Model	Colab GPU	Status
VibeVoice 1.5B	T4 (Free tier)	✅ Works well
VibeVoice 1.5B	A100 (Pro)	✅ Fast training
VibeVoice 7B	T4 (Free tier)	⚠️ May OOM, try gradient checkpointing
VibeVoice 7B	A100 (Pro)	✅ Recommended

Credits

Original VibeVoice: Microsoft VibeVoice
Fine-tuning repository: VibeVoice-finetuning
LoRA implementation: PEFT

License

This wrapper script is provided as-is for easier use of VibeVoice fine-tuning. Please refer to the original repositories for their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs/diagrams		docs/diagrams
.gitignore		.gitignore
01-setup.sh		01-setup.sh
02-prepare_dataset.py		02-prepare_dataset.py
03-train.py		03-train.py
README.md		README.md
VibeVoice_Fine_Tuning.ipynb		VibeVoice_Fine_Tuning.ipynb

Folders and files

Latest commit

History

Repository files navigation

VibeVoice Fine-tuning - Easy Scripts

Quick Start

Option A: Google Colab (Easiest - No Setup Required)

Option B: Local Setup

1. Setup Environment

2. Prepare Your Dataset

3. Train the Model

Architecture Overview

Detailed Documentation

Script 1: Setup (01-setup.sh)

Script 2: Dataset Preparation (02-prepare_dataset.py)

Basic Usage

All Options

JSONL Format

Script 3: Training (03-train.py)

Basic Usage

Presets

Model Options

All Options

Directory Structure

Complete Example Workflow

Tips & Best Practices

Audio Quality

Dataset Size

Training Time Estimates

Memory Optimization

Multi-Speaker Training

Troubleshooting

Import Errors

CUDA Out of Memory

Model Not Found

Audio Loading Errors

🚀 Google Colab Notebook

What's Included

Recommended for

Hardware on Colab

Credits

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Script 1: Setup (`01-setup.sh`)

Script 2: Dataset Preparation (`02-prepare_dataset.py`)

Script 3: Training (`03-train.py`)

Packages