A suckless, high-performance CLI tool for audio transcription.
Supports two engines:
- VibeVoice (
--engine vibevoice, default) — Microsoft VibeVoice-ASR with timestamps, speaker diarization, and structured output. - Omnilingual (
--engine omnilingual) — Meta Omnilingual-ASR supporting 1600+ languages (plain text output).
It handles local files, direct URLs, and YouTube links with automatic VAD-based chunking for long audio.
- engines: VibeVoice-ASR (diarization, timestamps) and Omnilingual-ASR (1600+ languages).
- fast: Auto-detects Flash Attention 2 for maximum inference speed on CUDA.
- smart: Uses Silero VAD to intelligently split long audio (>30m) at silence boundaries, preventing context loss and hallucinations.
- versatile:
- Local files (
.wav,.mp3,.m4a, etc.) - Direct URLs
- YouTube support (via
yt-dlp)
- Local files (
- formats: Exports to Text, JSON, or SRT (subtitles).
- diarization: Speaker-aware output with color-coded terminal display.
- context: customizable hotwords/context for better accuracy on proper nouns.
Requires Python 3.12+, FFmpeg, and uv.
-
Clone:
git clone https://github.com/federicotorrielli/asr.git cd asr -
Install:
uv sync -U
Note: On CUDA systems, the tool will attempt to auto-install a pre-built
flash-attnwheel specific to your torch/cuda version on first run.Note: The omnilingual engine requires
libsndfile(apt install libsndfile1/brew install libsndfile).
uv run asr [INPUT] [OPTIONS]Local file:
uv run asr interview.mp3YouTube video (audio only):
uv run asr "https://www.youtube.com/watch?v=dQw4w9WgXcQ"Generate subtitles (SRT):
uv run asr movie.m4a -f srt -o movie.srtWith context (hotwords):
uv run asr meeting.wav -c "Project Gemini, DeepMind, API"JSON output (machine readable):
uv run asr output.wav -f json | jq .Omnilingual engine (1600+ languages):
uv run asr recording.wav -e omnilingual --lang eng_LatnOmnilingual with Italian:
uv run asr audio_italiano.wav -e omnilingual --lang ita_Latn| Option | Description |
|---|---|
INPUT |
Path to file, HTTP URL, or YouTube URL. |
-f, --format |
Output format: text (default), json, srt. |
-o, --output |
Save output to a specific file. |
-c, --context |
Hotwords/context string to guide the model (vibevoice only). |
-e, --engine |
Engine: vibevoice (default) or omnilingual. |
--lang |
Language code for omnilingual (e.g. eng_Latn, ita_Latn). |
--device |
Force device: auto (default), cuda, mps, cpu. |
--model |
Model name or path (default depends on engine). |
--no-timestamps |
Hide timestamps in text output (vibevoice only). |
--no-speakers |
Hide speaker IDs in text output (vibevoice only). |
ffmpegmust be installed and available in your$PATH(required for YouTube extraction and audio processing).
Running the full model in bfloat16 requires significant VRAM due to the 7B+ parameter backbone (Qwen2.5-7B) and long-context capabilities.
VRAM Calculation:
- Model Weights: ~15.6 GB (7.8B params @ 2 bytes/param)
- CUDA Runtime / Overhead: ~0.8 GB
- Activations (Flash Attn): ~1.2 GB
- KV Cache (30 min audio): ~1.0 GB (~15k tokens)
Total Required: ~18.6 GB
| Component | Recommendation |
|---|---|
| VRAM | 24 GB (RTX 3090, 4090, or decent server GPU). 16GB cards (4080, 4060 Ti) will OOM unless you implement 8-bit/4-bit quantization. |
| System RAM | 32 GB+ (Only relevant if falling back to CPU, which is very slow). |