Markitdown produces nothing? #184

zhangguochun · 2024-12-21T09:24:52Z

zhangguochun
Dec 21, 2024

Hello, does anyone have the same issue or know why. I can't use markitdown to covert a pdf. it doesn't throw any exceptions either. It just outputs blank. Thank you for any clues.

from markitdown import MarkItDown
from langchain_ollama import ChatOllama
client = ChatOllama(model="llama3.2")
# md = MarkItDown(llm_client=client)
md = MarkItDown()
result = md.convert("test.pdf")
print(result.text_content)```

or tested it via command:
markitdown test.pdf

-----
markitdown = "^0.0.1a3"
WSL

zhangguochun · 2024-12-31T05:02:50Z

zhangguochun
Dec 31, 2024
Author

The script is correct. The PDF file "test.pdf" is a pure picture PDF file. When a text PDF is tested, it's good.

2 replies

timkinnane Jan 19, 2025

Is it possible to force parsing a pure picture PDF as an image file, instead of it being read as PDF from the file type and not being able to produce anything? I would essentially like to use markitdown for OCR on a PDF as if it were an image.

mecqmx Feb 9, 2026

He would need integration with Azure Document Intelligence. https://realpython.com/python-markitdown/

arunbugkiller · 2025-06-13T05:29:53Z

arunbugkiller
Jun 13, 2025

@zhangguochun were you able to resolve this issue? I am facing the same issue wherein result is a blank file.

1 reply

mecqmx Feb 9, 2026

He said it was because the PDF was scanned, so maybe your PDF is scanned or protected. In both cases, the output would be an empty PDF file.

txhno · 2026-04-13T12:47:07Z

txhno
Apr 13, 2026

Most likely the PDF has little or no extractable text.

Plain MarkItDown works best on text PDFs. If the file is basically scanned images, blank output is expected.

In that case you would need an OCR path, for example the markitdown-ocr plugin or Azure Document Intelligence.

0 replies

VANDRANKI · 2026-04-13T19:30:53Z

VANDRANKI
Apr 13, 2026

If the PDF is scanned (just images of pages), markitdown will produce empty or minimal output because there is no embedded text to extract. pdfminer, which markitdown uses under the hood, only reads text that is actually encoded in the PDF.

To get text from a scanned PDF you need OCR. You can run the PDF through something like tesseract or a cloud OCR service first, then pass the resulting text or a searchable PDF back through markitdown. There is also ongoing discussion about adding an OCR path directly, but for now that pre-processing step is needed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markitdown produces nothing? #184

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Markitdown produces nothing? #184

Uh oh!

Uh oh!

zhangguochun Dec 21, 2024

Replies: 4 comments · 3 replies

Uh oh!

zhangguochun Dec 31, 2024 Author

Uh oh!

timkinnane Jan 19, 2025

Uh oh!

mecqmx Feb 9, 2026

Uh oh!

arunbugkiller Jun 13, 2025

Uh oh!

mecqmx Feb 9, 2026

Uh oh!

txhno Apr 13, 2026

Uh oh!

VANDRANKI Apr 13, 2026

zhangguochun
Dec 21, 2024

Replies: 4 comments 3 replies

zhangguochun
Dec 31, 2024
Author

arunbugkiller
Jun 13, 2025

txhno
Apr 13, 2026

VANDRANKI
Apr 13, 2026