Use of transcribe feature for audio - speaker detection? #1455

Stryfe-Delivery · 2025-10-23T02:51:49Z

Stryfe-Delivery
Oct 23, 2025

In some tools (simple example is One Note) you can transcribe audio and it will flag each dialog line as [speaker 1] or [speaker 2] etc. based on the speakers inflection and some other metadata. you can then find/replace with the speaker names and have a full transcript of a meeting.

Is there any way to do that with this tool and I've missed it? or is that a capability not yet available? The audio samples I tried came back transcribed but with no indicators that the speaker had changed during the discussion.

pauldev-hub · 2026-04-13T05:16:51Z

pauldev-hub
Apr 13, 2026

Hey @Stryfe-Delivery ,

Short answer: Not currently supported in markitdown.

Markitdown's audio pipeline uses speech_recognition for basic speech-to-text transcription only. It outputs a flat transcript with no speaker labels — there's no diarization layer in the pipeline.

Workaround for speaker detection:

Use a dedicated diarization tool before or after markitdown. Best options:
pyannote.audio (local, open source) — detects speaker segments, outputs [SPEAKER_01], [SPEAKER_02] timestamps
WhisperX — combines Whisper transcription + pyannote diarization in one pipeline
AssemblyAI API — cloud-based, add speaker_labels=True to get [Speaker A], [Speaker B] inline

Once you have a diarized transcript from one of these, you can feed the resulting text into markitdown or directly to your LLM.

This is a feature gap worth opening as a feature request on the markitdown repo — the speech_recognition backend would need to be replaced or augmented with a diarization-capable ASR pipeline.

👍 If this helped you, please mark it as the answer — it helps others in the community who run into the same issue find the solution faster!

0 replies

txhno · 2026-04-13T12:32:35Z

txhno
Apr 13, 2026

I don’t think so.

From the code, MarkItDown does basic audio transcription, but I’m not seeing speaker diarization / speaker labels. So you can get a transcript, but not automatic Speaker 1 / Speaker 2 style output.

So at the moment: transcription yes, speaker detection no.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of transcribe feature for audio - speaker detection? #1455

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Use of transcribe feature for audio - speaker detection? #1455

Uh oh!

Stryfe-Delivery Oct 23, 2025

Replies: 2 comments

Uh oh!

pauldev-hub Apr 13, 2026

Uh oh!

txhno Apr 13, 2026

Stryfe-Delivery
Oct 23, 2025

pauldev-hub
Apr 13, 2026

txhno
Apr 13, 2026