Skip to content

fix: add LibreOffice fallback for reliable .doc file conversion#2065

Open
octo-patch wants to merge 1 commit intobytedance:mainfrom
octo-patch:fix/issue-2002-doc-file-conversion
Open

fix: add LibreOffice fallback for reliable .doc file conversion#2065
octo-patch wants to merge 1 commit intobytedance:mainfrom
octo-patch:fix/issue-2002-doc-file-conversion

Conversation

@octo-patch
Copy link
Copy Markdown
Contributor

Fixes #2002

Problem

Legacy .doc files are listed in CONVERTIBLE_EXTENSIONS but conversion is unreliable. MarkItDown does not have a dependable path for legacy Word .doc files (only .docx works well). When conversion fails silently, the upload appears to succeed but no usable markdown is produced — the agent cannot read the file content.

Solution

Add an explicit .doc handling path in _do_convert():

  1. Try LibreOffice/soffice first — convert .doc.docx in a temp directory, then run MarkItDown on the resulting .docx. This is the reliable path when LibreOffice is installed.
  2. Fall back to MarkItDown directly if soffice is not on PATH, with a clear warning log telling the user to install LibreOffice.
  3. Raise RuntimeError explicitly if the MarkItDown fallback also produces empty output, so convert_file_to_markdown() returns None instead of writing an empty .md file. The file upload itself still succeeds; only the markdown conversion is skipped.

.docx and all other formats are unchanged.

Testing

  • Added TestTryConvertDocToDocx with 4 unit tests covering: soffice not found, soffice success, soffice non-zero exit, soffice timeout.
  • Added 4 new tests in TestDoConvert covering .doc routing: soffice available, soffice unavailable with content, soffice unavailable with empty output (raises).
  • All existing tests continue to pass (35 total in test_file_conversion.py).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Legacy .doc files are advertised as supported, but conversion is not reliable

1 participant