Qwen3-TTS One-Command Fine-Tuning

End-to-end automation for fine-tuning Qwen3-TTS with your own voice samples.

An automated pipeline that handles everything from environment setup to model training. Simply provide a directory of WAV files and a reference audio—this tool automatically sets up the environment, transcribes your audio, prepares the dataset, and fine-tunes the Qwen3-TTS model to clone your voice.

Features

True One-Command – Single command handles setup, transcription, and training
Automatic Setup – Environment and dependencies installed automatically on first run
Automatic Transcription – Uses WhisperX to transcribe your audio files in 99+ languages
Complete Pipeline – Handles transcription, JSONL creation, data preparation, and training
Multi-Language Support – Works with any language supported by Whisper
Flash Attention Fallback – Automatically uses flash_attention_2 if available, falls back to eager attention
Isolated Environment – HuggingFace models cached in venv/hf_cache/ for portability
GPU Optimized – CUDA support for faster training and inference

Architecture

The system consists of three main phases:

Auto-Setup Phase: Automatically detects if environment is ready, creates virtual environment, installs PyTorch (with CUDA detection), downloads all dependencies
Training Pipeline: Validates audio, transcribes with WhisperX, creates JSONL files, extracts audio codes, fine-tunes the model
External Models: Integrates with WhisperX for transcription and HuggingFace for Qwen3-TTS models

Data Flow

The pipeline processes your audio through 6 steps:

Audio Validation – Verifies all WAV files are loadable
WhisperX Transcription – Converts speech to text
JSONL Creation – Creates train_raw.jsonl with audio paths and transcripts
Audio Encoding – Extracts audio_codes using Qwen3 Tokenizer (16-layer codec)
JSONL with Codes – Creates train_with_codes.jsonl for training
Model Fine-Tuning – Trains Qwen3-TTS on your voice

Quick Start

Prerequisites

Python 3.12
CUDA 12.x (for GPU support, ~16GB VRAM recommended)
SoX audio library (required by qwen-tts)
~10GB disk space for models

Install SoX:

# Ubuntu/Debian
sudo apt install sox libsox-fmt-all

# RHEL/CentOS
sudo yum install sox

# macOS
brew install sox

Pinned Versions

The setup uses specific tested versions:

PyTorch: 2.8.0 (CUDA 12.8)
flash-attn: 2.8.1 (pre-built wheel for Python 3.12 + CUDA 12 + PyTorch 2.8)

One-Command Training

# Navigate to the project directory
cd /path/to/Qwen3-TTS-finetune

# Run training - setup happens automatically if needed!
./train.sh \
    --audio_dir ./my_audio_files \
    --ref_audio ./reference.wav \
    --speaker_name my_voice

That's it! The script automatically:

Detects if environment is ready
Runs setup if needed (non-interactive)
Configures HuggingFace cache
Runs the complete training pipeline

Manual Setup (Optional)

If you prefer to set up the environment separately:

# Interactive setup (prompts for model pre-download)
./setup.sh

# Non-interactive setup
./setup.sh --auto

What Gets Installed Automatically

The setup handles all dependencies:

Category	Packages
Core ML	torch 2.8.0, torchaudio 2.8.0, transformers, accelerate
TTS	qwen-tts (Qwen3-TTS models)
Audio	librosa, soundfile
Transcription	whisperx
Acceleration	flash-attn 2.8.1 (pre-built wheel, with fallback)
Utilities	tqdm, safetensors, datasets, huggingface-hub

Models are automatically downloaded from HuggingFace on first use:

Qwen/Qwen3-TTS-Tokenizer-12Hz (~2GB)
Qwen/Qwen3-TTS-12Hz-1.7B-Base (~3.5GB)
WhisperX model (~3GB for large-v3)

Usage

Directory Structure

Your audio files should be organized as follows:

my_project/
├── my_audio_files/          # Directory containing training WAV files
│   ├── sample1.wav
│   ├── sample2.wav
│   └── ...
├── reference.wav            # Reference audio for speaker embedding
├── train_from_audio.py      # Main automation script
├── train.sh                 # Shell wrapper (entry point)
└── setup.sh                 # Setup script (called automatically)

Basic Usage

# One command does everything!
./train.sh \
    --audio_dir ./my_audio_files \
    --ref_audio ./reference.wav \
    --speaker_name my_voice

Advanced Options

./train.sh \
    --audio_dir ./my_audio_files \
    --ref_audio ./reference.wav \
    --speaker_name my_voice \
    --output_dir ./my_output \
    --batch_size 4 \
    --lr 1e-5 \
    --epochs 5 \
    --whisper_model large-v3 \
    --language en

Configuration Options

Option	Description	Default
`--audio_dir`	Directory containing WAV files	required
`--ref_audio`	Path to reference audio file	required
`--speaker_name`	Name for the speaker	`my_speaker`
`--output_dir`	Output directory	`./output`
`--device`	Device to use	`cuda:0`
`--batch_size`	Training batch size	`2`
`--lr`	Learning rate	`2e-5`
`--epochs`	Number of training epochs	`3`
`--whisper_model`	Whisper model size	`large-v3`
`--language`	Language code or `auto`	`auto`

Output

After completion, you'll find:

output/
├── checkpoint-epoch-0/     # Checkpoint after epoch 0
├── checkpoint-epoch-1/     # Checkpoint after epoch 1
├── checkpoint-epoch-2/     # Checkpoint after epoch 2 (use this)
├── train_raw.jsonl         # Raw training data
└── train_with_codes.jsonl  # Data with audio codes

Inference

After training, use your fine-tuned model:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

device = "cuda:0"
tts = Qwen3TTSModel.from_pretrained(
    "output/checkpoint-epoch-2",
    device_map=device,
    dtype=torch.bfloat16,
    # Uses flash_attention_2 if available, falls back to eager
)

wavs, sr = tts.generate_custom_voice(
    text="Hello, this is a test.",
    speaker="my_voice",
)
sf.write("output.wav", wavs[0], sr)

Tips for Best Results

Audio Quality: Use clean, high-quality recordings (16kHz or higher)
Reference Audio: Choose a clear, representative sample as your reference
Dataset Size: 10-100 samples work well for single-speaker fine-tuning
Audio Length: 5-30 second clips are optimal
Consistency: All audio should be from the same speaker in similar conditions

Supported Languages

Whisper supports 99 languages. Common codes:

Code	Language
`en`	English
`zh`	Chinese
`es`	Spanish
`fr`	French
`de`	German
`ja`	Japanese
`ko`	Korean

Use auto for automatic language detection.

Troubleshooting

Out of Memory

Reduce batch size:

./train.sh --audio_dir ./audio --ref_audio ./ref.wav --batch_size 1

Slow Transcription

Use a smaller Whisper model:

./train.sh --audio_dir ./audio --ref_audio ./ref.wav --whisper_model base

Poor Transcription Quality

Specify the language explicitly:

./train.sh --audio_dir ./audio --ref_audio ./ref.wav --language en

Flash Attention Not Available

The system automatically falls back to eager attention if flash_attn cannot be installed. You'll see this message during setup:

flash-attn installation failed - will use eager attention (slower but compatible)

This is normal on some systems. Training will still work, just slightly slower.

Setup Issues

If you want to manually run setup with verbose output:

bash -x setup.sh --auto

Complete One-Command Example

From a fresh clone to a trained model:

# Clone and enter the project
git clone <repo-url>
cd Qwen3-TTS-finetune

# Prepare your audio files in ./my_audio_files/
# Place reference.wav in the project root

# One command does everything!
./train.sh \
    --audio_dir ./my_audio_files \
    --ref_audio ./reference.wav \
    --speaker_name alice \
    --epochs 3

# Your trained model will be in output/checkpoint-epoch-2/

Virtual Environment

The setup creates a virtual environment in venv/ with an isolated HuggingFace cache:

venv/
├── bin/
│   └── activate          # Contains HF_HOME exports
├── lib/
├── hf_cache/             # HuggingFace models stored here
│   ├── transformers/
│   └── datasets/
└── ...

To manage it manually:

# Activate (only needed if running Python scripts directly)
source activate.sh
# or
source venv/bin/activate

# Deactivate
deactivate

# Delete and recreate (train.sh will auto-setup again)
rm -rf venv/
./train.sh --audio_dir ./audio --ref_audio ./ref.wav

Project Structure

Qwen3-TTS-finetune/
├── README.md                   # This file
├── CLAUDE.md                   # Claude Code instructions
├── setup.sh                    # Auto-setup script (--auto for non-interactive)
├── train.sh                    # Entry point - handles everything
├── activate.sh                 # Venv activation helper (created by setup)
├── train_from_audio.py         # Main end-to-end automation
├── dataset.py                  # Dataset class for training
├── prepare_data.py             # Data preparation script
├── sft_12hz.py                 # Training script
├── venv/                       # Virtual environment (created automatically)
│   └── hf_cache/               # HuggingFace cache (isolated)
├── docs/
│   └── diagrams/
│       ├── architecture.drawio # System architecture diagram (source)
│       ├── architecture.svg    # System architecture diagram (rendered)
│       ├── data_flow.drawio    # Data flow diagram (source)
│       └── data_flow.svg       # Data flow diagram (rendered)
└── output/                     # Training output (created during training)

License

This project includes code from the Qwen3-TTS project, which is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs/diagrams		docs/diagrams
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
prepare_data.py		prepare_data.py
setup.sh		setup.sh
sft_12hz.py		sft_12hz.py
train.sh		train.sh
train_from_audio.py		train_from_audio.py

Folders and files

Latest commit

History

Repository files navigation

Qwen3-TTS One-Command Fine-Tuning

Features

Architecture

Data Flow

Quick Start

Prerequisites

Pinned Versions

One-Command Training

Manual Setup (Optional)

What Gets Installed Automatically

Usage

Directory Structure

Basic Usage

Advanced Options

Configuration Options

Output

Inference

Tips for Best Results

Supported Languages

Troubleshooting

Out of Memory

Slow Transcription

Poor Transcription Quality

Flash Attention Not Available

Setup Issues

Complete One-Command Example

Virtual Environment

Project Structure

License

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages