Simplified scripts for fine-tuning VibeVoice models with LoRA. The goal is to make the process as painless as possible with reasonable defaults and clear instructions.
🚀 New: Try the Google Colab notebook for zero-setup training in the cloud!
Choose your preferred method:
Click the badge above to open the notebook in Google Colab. You'll get:
- ✅ Free GPU access (T4)
- ✅ No local installation needed
- ✅ Direct upload of audio files
- ✅ Automatic saving to Google Drive
Recommended for: Beginners, quick experiments, or if you don't have a GPU.
bash 01-setup.shThis script will:
- Install system dependencies (ffmpeg, git-lfs, etc.)
- Create a Python virtual environment
- Clone the VibeVoice-finetuning repository
- Install all Python dependencies
- Download the VibeVoice model (1.5B or 7B)
- Create helper scripts
source activate_env.sh
python 02-prepare_dataset.py --audio_dir /path/to/your/audio --auto-transcribe --output data/dataset.jsonlOr with existing transcripts:
python 02-prepare_dataset.py --audio_dir /path/to/audio --transcript_dir /path/to/transcripts --output data/dataset.jsonlpython 03-train.py --dataset data/dataset.jsonl --model 1.5BThat's it! Your fine-tuned model will be saved in the output/ directory.
The VibeVoice model uses a dual-loss training approach:
- Cross-Entropy Loss on text tokens for language modeling
- MSE Loss on acoustic latents for speech generation quality
LoRA adapters are applied to the Qwen2 LLM backbone while the diffusion head can be fully fine-tuned.
The setup script handles all environment setup automatically.
What it does:
- Checks for CUDA availability
- Installs system dependencies (Ubuntu/Debian, RHEL/CentOS, or macOS)
- Creates a Python 3.11 virtual environment
- Clones the VibeVoice-finetuning repository
- Installs PyTorch with CUDA support
- Installs the VibeVoice-finetuning package with compatible dependencies
- Downloads your chosen model (1.5B or 7B)
- Creates helper scripts (
activate_env.sh)
After setup:
source activate_env.sh # Activate the environmentThis script prepares your audio data into the JSONL format required by VibeVoice.
Option 1: Auto-transcribe with Whisper
python 02-prepare_dataset.py \
--audio_dir ./my_audio \
--auto-transcribe \
--output data/dataset.jsonlOption 2: With transcript files
python 02-prepare_dataset.py \
--audio_dir ./my_audio \
--transcript_dir ./transcripts \
--output data/dataset.jsonlOption 3: From CSV
python 02-prepare_dataset.py \
--csv metadata.csv \
--csv_audio_col audio_path \
--csv_text_col transcript \
--output data/dataset.jsonl| Option | Description | Default |
|---|---|---|
--audio_dir |
Directory containing audio files | Required (or --csv) |
--csv |
CSV file with metadata | Alternative to --audio_dir |
--transcript_dir |
Directory with transcript .txt files | Auto-detect |
--auto_transcribe |
Transcribe audio using Whisper | False |
--whisper_model |
Whisper model size (tiny/base/small/medium/large) | base |
--voice_prompts_dir |
Directory with voice prompt audio | None |
--speaker_prefix |
Speaker label prefix | "Speaker 0" |
--output |
Output JSONL file | data/dataset.jsonl |
--val_split |
Validation split ratio | 0.1 (10%) |
--no_val_split |
Don't create validation set | False |
The output file contains one JSON object per line:
{"text": "Speaker 0: Hello, this is a sample transcription.", "audio": "/path/to/audio.wav"}
{"text": "Speaker 0: Another example with voice prompt.", "audio": "/path/to/audio2.wav", "voice_prompts": "/path/to/prompt.wav"}For multi-speaker training, include multiple voice prompts:
{"text": "Speaker 0: Hello there!\nSpeaker 1: Hi, how are you?", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_prompt.wav", "/path/to/speaker1_prompt.wav"]}A simplified training script with sensible defaults.
Quick test (1 epoch, faster)
python 03-train.py --dataset data/dataset.jsonl --preset fastStandard training (1.5B model)
python 03-train.py --dataset data/dataset.jsonl --model 1.5BHigh quality (7B model)
python 03-train.py --dataset data/dataset.jsonl --model 7B --preset qualityResume from checkpoint
python 03-train.py --dataset data/dataset.jsonl --resume_from_checkpoint ./output/checkpoint-500| Preset | Description | Epochs | Learning Rate | Use Case |
|---|---|---|---|---|
fast |
Quick training | 1 | 5e-5 | Testing the pipeline |
default |
Balanced | 5 | 2.5e-5 | General fine-tuning |
quality |
Best quality | 10 | 1e-5 | Production models |
| Model | VRAM Required | Batch Size | Description |
|---|---|---|---|
1.5B |
16GB | 4 | Smaller, faster training |
7B |
48GB | 2 | Higher quality output |
| Option | Description | Default |
|---|---|---|
--dataset |
Training dataset JSONL file | Required |
--val_dataset |
Validation dataset JSONL | None |
--model |
Model size (1.5B/7B) | 1.5B |
--preset |
Training preset (fast/default/quality) | default |
--output_dir |
Output directory | output |
--num_epochs |
Number of epochs | From preset |
--learning_rate |
Learning rate | From preset |
--batch_size |
Batch size per device | From preset |
--gradient_accumulation |
Gradient accumulation steps | From preset |
--lora_r |
LoRA rank | 8 |
--lora_alpha |
LoRA alpha | 32 |
--wandb |
Enable Weights & Biases logging | False |
--tensorboard |
Enable TensorBoard logging | False |
--gradient_checkpointing |
Save memory (slower) | False |
--train_diffusion_head |
Train diffusion head | True |
--resume_from_checkpoint |
Resume from checkpoint path | None |
After setup, your directory will look like:
VibeVoice-finetune/
├── 01-setup.sh # Setup script
├── 02-prepare_dataset.py # Dataset preparation
├── 03-train.py # Training script
├── activate_env.sh # Environment activation
├── venv/ # Python virtual environment
├── VibeVoice-finetuning/ # Original repository
├── models/ # Downloaded models
│ ├── aoi-ot--VibeVoice-Base/
│ └── aoi-ot--VibeVoice-Large/
├── data/ # Your data
│ ├── audio/ # Audio files
│ ├── dataset.jsonl # Prepared dataset
│ ├── dataset.train.jsonl # Training split
│ └── dataset.val.jsonl # Validation split
├── output/ # Training outputs
│ ├── checkpoint-XXX/ # Checkpoints
│ └── lora/ # Final LoRA adapter
├── docs/
│ └── diagrams/ # Documentation diagrams
│ ├── workflow-overview.svg
│ ├── architecture.svg
│ └── dataset-pipeline.svg
└── hf_cache/ # HuggingFace cache
# 1. Setup everything
bash 01-setup.sh
# 2. Activate environment
source activate_env.sh
# 3. Copy your audio files
cp -r /path/to/my/audio/* data/audio/
# 4. Prepare dataset (with Whisper transcription)
python 02-prepare_dataset.py \
--audio_dir data/audio \
--auto_transcribe \
--whisper_model base \
--output data/my_dataset.jsonl
# 5. Train the model
python 03-train.py \
--dataset data/my_dataset.jsonl \
--model 1.5B \
--preset default \
--output_dir output/my_model
# 6. Your model is ready!
# LoRA adapter saved at: output/my_model/lora/- Use high-quality audio files (24kHz sample rate preferred)
- Clean audio with minimal background noise works best
- Consistent volume levels across samples
- Minimum: 50-100 samples for minimal fine-tuning
- Good: 500-1000 samples for decent quality
- Excellent: 5000+ samples for best results
| Model | Dataset Size | Preset | Estimated Time (A100) |
|---|---|---|---|
| 1.5B | 100 samples | fast | ~5 minutes |
| 1.5B | 1000 samples | default | ~2 hours |
| 7B | 1000 samples | default | ~6 hours |
| 7B | 5000 samples | quality | ~30 hours |
If you run out of VRAM:
-
Enable gradient checkpointing:
python 03-train.py --dataset data/dataset.jsonl --gradient_checkpointing
-
Reduce batch size:
python 03-train.py --dataset data/dataset.jsonl --batch_size 1 --gradient_accumulation 64
-
Use smaller model:
python 03-train.py --dataset data/dataset.jsonl --model 1.5B
For multi-speaker datasets:
-
Format your text with speaker labels:
Speaker 0: First speaker's text Speaker 1: Second speaker's text -
Provide voice prompts for each speaker:
python 02-prepare_dataset.py \ --audio_dir data/multi_speaker \ --voice_prompts_dir data/prompts \ --output data/multi_speaker.jsonl
Make sure you've activated the environment:
source activate_env.sh- Use
--gradient_checkpointing - Reduce
--batch_size - Use a smaller model (1.5B instead of 7B)
The setup script should download models automatically. If not:
mkdir -p models
cd models
git lfs install
git clone https://huggingface.co/aoi-ot/VibeVoice-Base
git clone https://huggingface.co/aoi-ot/VibeVoice-LargeMake sure ffmpeg is installed:
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpegFor users who prefer a cloud-based, zero-setup approach, we provide a fully-featured Google Colab notebook.
- Interactive Setup: One-click environment setup
- Multiple Data Sources: Upload directly, use Google Drive, or HuggingFace datasets
- Auto-Transcription: Built-in Whisper integration
- Visual Configuration: Sliders and dropdowns for all training parameters
- Progress Monitoring: Real-time training progress
- Easy Export: Save to Google Drive or download directly
- 🆕 Beginners new to ML fine-tuning
- 💻 Users without local GPU access
- ⚡ Quick experiments and prototyping
- 📚 Educational purposes
| Model | Colab GPU | Status |
|---|---|---|
| VibeVoice 1.5B | T4 (Free tier) | ✅ Works well |
| VibeVoice 1.5B | A100 (Pro) | ✅ Fast training |
| VibeVoice 7B | T4 (Free tier) | |
| VibeVoice 7B | A100 (Pro) | ✅ Recommended |
- Original VibeVoice: Microsoft VibeVoice
- Fine-tuning repository: VibeVoice-finetuning
- LoRA implementation: PEFT
This wrapper script is provided as-is for easier use of VibeVoice fine-tuning. Please refer to the original repositories for their respective licenses.