Skip to content

Commit b34d9ea

Browse files
committed
feat(voice): add whisper-compatible STT flow for voice/audio messages
1 parent d34aa9c commit b34d9ea

17 files changed

Lines changed: 1151 additions & 252 deletions

File tree

PRODUCT.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ No public inbound ports are required for normal usage.
4646
### Task handling
4747

4848
- Send text prompts to OpenCode
49+
- Accept voice/audio messages, transcribe via Whisper-compatible STT API, and forward recognized text as prompts
4950
- Interrupt current task (ESC equivalent)
5051
- Handle OpenCode questions with inline options and custom text answers
5152
- Send selected/custom answers back to OpenCode (`question.reply`)
@@ -80,6 +81,7 @@ No public inbound ports are required for normal usage.
8081
- Configurable bot locale
8182
- Configurable visibility for service messages (thinking/tool calls)
8283
- Configurable max code file size in KB (default: 100)
84+
- Optional STT settings for voice transcription (`STT_API_URL`, `STT_API_KEY`, `STT_MODEL`, `STT_LANGUAGE`)
8385

8486
## Current Product Scope
8587

@@ -99,7 +101,7 @@ Current command set:
99101
- [x] `/opencode_stop` - stop local OpenCode server
100102
- [x] `/help` - show command help
101103

102-
Text messages (non-commands) are treated as prompts for OpenCode only when no blocking interaction is active.
104+
Text messages (non-commands) are treated as prompts for OpenCode only when no blocking interaction is active. Voice/audio messages are transcribed and then sent as prompts when STT is configured.
103105

104106
Interaction routing rules:
105107

@@ -123,6 +125,7 @@ Interaction routing rules:
123125
- [x] Sending code blocks as files when needed
124126
- [x] Configurable batching of service messages (thinking + tool updates): recommended `>=2` sec for Telegram rate limits; `0` = immediate
125127
- [x] Configurable service message visibility via env flags (`HIDE_THINKING_MESSAGES`, `HIDE_TOOL_CALL_MESSAGES`)
128+
- [x] Voice/audio transcription via Whisper-compatible APIs (OpenAI/Groq/Together and compatible providers)
126129
- [x] Single-user security model (allowed Telegram user ID)
127130
- [x] Persistent bot settings (`settings.json`) between restarts
128131
- [x] EN/RU localization structure via dedicated i18n files
@@ -138,7 +141,7 @@ Open tasks for upcoming iterations:
138141
- [ ] Improve Telegram-compatible message formatting for richer outputs
139142
- [ ] Support sending files from Telegram to OpenCode (screenshots, documents)
140143
- [ ] Provide a Docker image and basic container deployment guide
141-
- [ ] Add voice transcription
144+
- [x] Add voice transcription
142145

143146
## Possible Improvements
144147

README.md

Lines changed: 46 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Quick start: `npx @grinev/opencode-telegram-bot`
2525
- **Model switching** — pick any model from your OpenCode favorites directly in the chat
2626
- **Agent modes** — switch between Plan and Build modes on the fly
2727
- **Interactive Q&A** — answer agent questions and approve permissions via inline buttons
28+
- **Voice prompts** — send voice/audio messages, transcribe them via a Whisper-compatible API, then forward recognized text to OpenCode
2829
- **Context control** — compact context when it gets too large, right from the chat
2930
- **Security** — strict user ID whitelist; no one else can access your bot, even if they find it
3031
- **Localization** — English and Russian UI (`BOT_LOCALE=en|ru`)
@@ -102,7 +103,7 @@ opencode-telegram config
102103
| `/opencode_stop` | Stop the OpenCode server remotely |
103104
| `/help` | Show available commands |
104105

105-
Any regular text message is sent as a prompt to the coding agent only when no blocking interaction is active.
106+
Any regular text message is sent as a prompt to the coding agent only when no blocking interaction is active. Voice/audio messages are transcribed and then sent as prompts when STT is configured.
106107

107108
### Interaction Rules
108109

@@ -124,26 +125,53 @@ When installed via npm, the configuration wizard handles the initial setup. The
124125
- **Windows:** `%APPDATA%\opencode-telegram-bot\.env`
125126
- **Linux:** `~/.config/opencode-telegram-bot/.env`
126127

127-
| Variable | Description | Required | Default |
128-
| ------------------------------- | ------------------------------------------------------------------------------------------------------------ | :------: | ----------------------- |
129-
| `TELEGRAM_BOT_TOKEN` | Bot token from @BotFather | Yes ||
130-
| `TELEGRAM_ALLOWED_USER_ID` | Your numeric Telegram user ID | Yes ||
131-
| `TELEGRAM_PROXY_URL` | Proxy URL for Telegram API (SOCKS5/HTTP) | No ||
132-
| `OPENCODE_API_URL` | OpenCode server URL | No | `http://localhost:4096` |
133-
| `OPENCODE_SERVER_USERNAME` | Server auth username | No | `opencode` |
134-
| `OPENCODE_SERVER_PASSWORD` | Server auth password | No ||
135-
| `OPENCODE_MODEL_PROVIDER` | Default model provider | Yes | `opencode` |
136-
| `OPENCODE_MODEL_ID` | Default model ID | Yes | `big-pickle` |
137-
| `BOT_LOCALE` | Bot UI language (`en` or `ru`) | No | `en` |
138-
| `SESSIONS_LIST_LIMIT` | Max sessions shown in `/sessions` | No | `10` |
139-
| `SERVICE_MESSAGES_INTERVAL_SEC` | Service messages interval (thinking + tool calls); keep `>=2` to avoid Telegram rate limits, `0` = immediate | No | `5` |
140-
| `HIDE_THINKING_MESSAGES` | Hide `💭 Thinking...` service messages | No | `false` |
141-
| `HIDE_TOOL_CALL_MESSAGES` | Hide tool-call service messages (`💻 bash ...`, `📖 read ...`, etc.) | No | `false` |
142-
| `CODE_FILE_MAX_SIZE_KB` | Max file size (KB) to send as document | No | `100` |
143-
| `LOG_LEVEL` | Log level (`debug`, `info`, `warn`, `error`) | No | `info` |
128+
| Variable | Description | Required | Default |
129+
| ------------------------------- | ------------------------------------------------------------------------------------------------------------ | :------: | ------------------------ |
130+
| `TELEGRAM_BOT_TOKEN` | Bot token from @BotFather | Yes ||
131+
| `TELEGRAM_ALLOWED_USER_ID` | Your numeric Telegram user ID | Yes ||
132+
| `TELEGRAM_PROXY_URL` | Proxy URL for Telegram API (SOCKS5/HTTP) | No ||
133+
| `OPENCODE_API_URL` | OpenCode server URL | No | `http://localhost:4096` |
134+
| `OPENCODE_SERVER_USERNAME` | Server auth username | No | `opencode` |
135+
| `OPENCODE_SERVER_PASSWORD` | Server auth password | No ||
136+
| `OPENCODE_MODEL_PROVIDER` | Default model provider | Yes | `opencode` |
137+
| `OPENCODE_MODEL_ID` | Default model ID | Yes | `big-pickle` |
138+
| `BOT_LOCALE` | Bot UI language (`en` or `ru`) | No | `en` |
139+
| `SESSIONS_LIST_LIMIT` | Max sessions shown in `/sessions` | No | `10` |
140+
| `SERVICE_MESSAGES_INTERVAL_SEC` | Service messages interval (thinking + tool calls); keep `>=2` to avoid Telegram rate limits, `0` = immediate | No | `5` |
141+
| `HIDE_THINKING_MESSAGES` | Hide `💭 Thinking...` service messages | No | `false` |
142+
| `HIDE_TOOL_CALL_MESSAGES` | Hide tool-call service messages (`💻 bash ...`, `📖 read ...`, etc.) | No | `false` |
143+
| `CODE_FILE_MAX_SIZE_KB` | Max file size (KB) to send as document | No | `100` |
144+
| `STT_API_URL` | Whisper-compatible API base URL (enables voice/audio transcription) | No ||
145+
| `STT_API_KEY` | API key for your STT provider | No ||
146+
| `STT_MODEL` | STT model name passed to `/audio/transcriptions` | No | `whisper-large-v3-turbo` |
147+
| `STT_LANGUAGE` | Optional language hint (empty = provider auto-detect) | No ||
148+
| `LOG_LEVEL` | Log level (`debug`, `info`, `warn`, `error`) | No | `info` |
144149

145150
> **Keep your `.env` file private.** It contains your bot token. Never commit it to version control.
146151
152+
### Voice and Audio Transcription (Optional)
153+
154+
If `STT_API_URL` and `STT_API_KEY` are set, the bot will:
155+
156+
1. Accept `voice` and `audio` Telegram messages
157+
2. Transcribe them via `POST {STT_API_URL}/audio/transcriptions`
158+
3. Show recognized text in chat
159+
4. Send the recognized text to OpenCode as a normal prompt
160+
161+
Supported provider examples (Whisper-compatible):
162+
163+
- **OpenAI**
164+
- `STT_API_URL=https://api.openai.com/v1`
165+
- `STT_MODEL=whisper-1`
166+
- **Groq**
167+
- `STT_API_URL=https://api.groq.com/openai/v1`
168+
- `STT_MODEL=whisper-large-v3-turbo`
169+
- **Together**
170+
- `STT_API_URL=https://api.together.xyz/v1`
171+
- `STT_MODEL=openai/whisper-large-v3`
172+
173+
If STT variables are not set, voice/audio transcription is disabled and the bot will ask you to configure STT.
174+
147175
### Model Configuration
148176

149177
The bot picks up your **favorite models** from OpenCode. To add a model to favorites:

0 commit comments

Comments
 (0)