Markitdown produces nothing? #184
Replies: 4 comments 3 replies
-
|
The script is correct. The PDF file "test.pdf" is a pure picture PDF file. When a text PDF is tested, it's good. |
Beta Was this translation helpful? Give feedback.
-
|
@zhangguochun were you able to resolve this issue? I am facing the same issue wherein result is a blank file. |
Beta Was this translation helpful? Give feedback.
-
|
Most likely the PDF has little or no extractable text. Plain MarkItDown works best on text PDFs. If the file is basically scanned images, blank output is expected. In that case you would need an OCR path, for example the |
Beta Was this translation helpful? Give feedback.
-
|
If the PDF is scanned (just images of pages), markitdown will produce empty or minimal output because there is no embedded text to extract. pdfminer, which markitdown uses under the hood, only reads text that is actually encoded in the PDF. To get text from a scanned PDF you need OCR. You can run the PDF through something like tesseract or a cloud OCR service first, then pass the resulting text or a searchable PDF back through markitdown. There is also ongoing discussion about adding an OCR path directly, but for now that pre-processing step is needed. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, does anyone have the same issue or know why. I can't use markitdown to covert a pdf. it doesn't throw any exceptions either. It just outputs blank. Thank you for any clues.
Beta Was this translation helpful? Give feedback.
All reactions