Skip to content

fix(tokenizers): raise ValueError instead of TypeError when Content-Length is absent#764

Open
xodn348 wants to merge 1 commit into
cohere-ai:mainfrom
xodn348:fix/tokenizer-config-size-none-type
Open

fix(tokenizers): raise ValueError instead of TypeError when Content-Length is absent#764
xodn348 wants to merge 1 commit into
cohere-ai:mainfrom
xodn348:fix/tokenizer-config-size-none-type

Conversation

@xodn348
Copy link
Copy Markdown

@xodn348 xodn348 commented May 6, 2026

Summary

_get_tokenizer_config_size retrieves the size of a remote tokenizer config via an HTTP HEAD request and converts the Content-Length (or x-goog-stored-content-length) header value to an integer. When a server responds with chunked transfer encoding it omits both headers, so size remains None. The subsequent int(typing.cast(int, size)) call raises a cryptic TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType' instead of a clear, actionable error. Although the caller already wraps the call in try/except Exception, the TypeError leaks a confusing message ("Failed to get the size of the tokenizer config: int() argument must…") that sends users debugging in the wrong direction. This PR replaces the implicit TypeError with an explicit ValueError that clearly names the root cause, and also removes the misleading typing.cast(int, size) annotation (headers always return str | None, never int).

Issue

Fixes #762

Local verification

$ cd /tmp/cohere-python
$ ruff check src/cohere/manually_maintained/tokenizers.py tests/test_tokenizer_config_size.py
All checks passed!

$ python -m pytest tests/test_tokenizer_config_size.py -v
============================= test session starts ==============================
platform linux -- Python 3.11.15, pytest-9.0.3, pluggy-1.6.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /tmp/cohere-python
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 4 items

tests/test_tokenizer_config_size.py::TestGetTokenizerConfigSize::test_content_length_header PASSED [ 25%]
tests/test_tokenizer_config_size.py::TestGetTokenizerConfigSize::test_goog_header_takes_priority_over_content_length PASSED [ 50%]
tests/test_tokenizer_config_size.py::TestGetTokenizerConfigSize::test_goog_stored_content_length_header PASSED [ 75%]
tests/test_tokenizer_config_size.py::TestGetTokenizerConfigSize::test_raises_value_error_when_no_size_header PASSED [100%]

============================== 4 passed in 0.29s ===============================
=== LOCAL_TEST_PASSED ===

Risk

Only _get_tokenizer_config_size is changed. The function is always invoked inside a try/except Exception block in both get_hf_tokenizer and async_get_hf_tokenizer, so raising ValueError instead of letting TypeError propagate does not change observable behavior for callers — the tokenizer still downloads successfully when size headers are absent. The typing.cast removal is a no-op at runtime (cast is transparent) and makes the type annotation accurate.


Note

Low Risk
Low risk: small error-handling change in non-critical logging code plus targeted unit tests; behavior remains within existing try/except paths.

Overview
Improves _get_tokenizer_config_size to explicitly raise ValueError when neither x-goog-stored-content-length nor Content-Length is present, avoiding a confusing TypeError from int(None).

Adds tests/test_tokenizer_config_size.py to validate size parsing for both headers (including priority) and to assert the new exception behavior when size headers are missing (e.g., chunked transfer encoding).

Reviewed by Cursor Bugbot for commit 931e7e4. Bugbot is set up for automated code reviews on this repo. Configure here.

if size is None:
raise ValueError("Content-Length unavailable (server may use chunked transfer encoding)")

return round(int(size) / 1024 / 1024, 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the typing cast here unless you had a specific reason to remove it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

if size is None:
raise ValueError("Content-Length unavailable (server may use chunked transfer encoding)")

return round(int(size) / 1024 / 1024, 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: uncaught type error in _get_tokenizer_config_size when server omits content length

2 participants