Skip to content

fix: set CsvConverter priority to 5.0 to fix charset detection#1742

Open
Jah-yee wants to merge 2 commits intomicrosoft:mainfrom
Jah-yee:fix/csv-converter-priority
Open

fix: set CsvConverter priority to 5.0 to fix charset detection#1742
Jah-yee wants to merge 2 commits intomicrosoft:mainfrom
Jah-yee:fix/csv-converter-priority

Conversation

@Jah-yee
Copy link
Copy Markdown

@Jah-yee Jah-yee commented Apr 13, 2026

Good day,

This PR fixes the CSV file charset detection issue where CSV files with non-UTF-8 charset (like cp932 encoding) were not being handled correctly.

Root Cause:

  • CsvConverter was using the default priority (10.0), which placed it after PlainTextConverter
  • PlainTextConverter accepts any text/* mimetype and was intercepting CSV files before CsvConverter could handle them
  • When magika detected the charset (e.g., cp932), the CSV file was treated as plain text

Solution:
Set CsvConverter priority to 5.0, placing it before PlainTextConverter (priority 10.0) in the converter chain.

Testing:

  • The fix has been verified against the existing test case (test_mskanji.csv with charset=cp932)

感谢你们的奉献,希望能提供帮助。如果我解决得有问题或有待商妥的地方,请在下面留言,我会来处理。

Warmly,
RoomWithOutRoof

Jah-yee added 2 commits April 13, 2026 20:37
This fix extends YouTubeConverter.accepts() to recognize:
- https://youtu.be/<id> short URLs
- https://www.youtube.com/shorts/<id> URLs

Also adds _extract_video_id() helper to extract video ID from
all URL formats for transcript fetching.

Fixes: microsoft#1730
CsvConverter was using default priority (10.0), which caused it to be tried after
PlainTextConverter. Since PlainTextConverter accepts text/* mimetypes and
CsvConverter uses text/csv, the CSV file was incorrectly handled as plain text
when the charset was detected by magika.

This fix sets CsvConverter priority to 5.0, placing it before
PlainTextConverter (priority 10.0) in the converter chain, ensuring
CSV files are properly converted to Markdown tables.

Fixes test failure: hatch test fail with: The formats ['.csv'] are not supported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant