Description
The WHATWG URL parser in ext/uri calls lxb_encoding_decode_valid_utf_8_single() for all input bytes >= 0x80 in paths, queries, fragments, and hostnames. This is lexbor's non-validating UTF-8 decoder, written to assume the caller already verified the input. It skips continuation byte range checks, overlong sequence rejection, and surrogate rejection.
Lexbor ships a validating variant, lxb_encoding_decode_utf_8_single(), in the same file. It rejects all three classes correctly. The URL parser calls the non-validating one instead.
What gets accepted
Overlong sequences: \xC0\xAF (overlong /) decodes to U+002F. \xC1\xA1 (overlong a) decodes to U+0061. The validating decoder rejects all 2-byte overlongs at ch < 0xC2.
Invalid continuation bytes: \xC0\x41 (0x41 is A, not a continuation byte) decodes to codepoint 0x41. The literal A gets consumed into the multi-byte sequence and percent-encoded as %C0%41 instead of being parsed as its own character.
Surrogates: \xED\xA0\x80 decodes to U+D800. The validating decoder rejects this via boundary checks on the 0xED prefix.
Chrome, Firefox, and Safari reject all three at the UTF-8 decode step, producing U+FFFD (replacement character). Lexbor's URL parser diverges from browser behavior here.
Hostname normalization attack
After percent-decoding a hostname, bytes >= 0x80 trigger IDNA processing. The IDNA code calls the same non-validating decoder. Overlong ASCII characters pass through as their target codepoints, producing valid domain names from byte sequences that look nothing like the canonical form.
Paths, queries, and fragments
Overlong bytes go through the >= 0x80 branch and get percent-encoded as raw bytes. ASCII separators (/, ?, #) are checked byte-level, so overlong / (0xC0 0xAF) doesn't split a path. The stored URL has %C0%AF. No structural confusion at the URL parser level.
If a downstream caller percent-decodes %C0%AF and feeds the result to a path resolver, they get /.
No validation warnings
For overlong sequences that map to valid URL characters, lxb_url_is_url_codepoint() returns true and no InvalidUrlUnit warning fires. A developer checking the errors array sees nothing and assumes the URL is clean.
Reproduction
<?php
// Overlong 'e', 'v', 'i', 'l' in hostname
$url = Uri\WhatWg\Url::parse("http://%C1%A5%C1%B6%C1%A9%C1%AC.com/");
var_dump($url?->getAsciiHost());
// Expected: null (parse failure, matching browser behavior)
// Actual: string(8) "evil.com"
// Overlong '/' in path, no structural issue but no warning
$url2 = Uri\WhatWg\Url::parse("http://example.com/a%C0%AFb");
var_dump($url2?->getPath());
// Actual: "/a%C0%AFb" (overlong preserved, no warning)
// Browsers: "/a%EF%BF%BDb" (U+FFFD replacement)
Root cause
lxb_encoding_decode_valid_utf_8_single in ext/lexbor/lexbor/encoding/decode.c:2889 skips continuation byte range checks, overlong sequence rejection, and surrogate rejection. The URL parser calls it at 7 sites in ext/lexbor/lexbor/url/url.c (lines 2343, 2405, 2427, 2650, 3009, 3031, 3189). The validating decoder at decode.c:2780 handles all three correctly.
Description
The WHATWG URL parser in ext/uri calls
lxb_encoding_decode_valid_utf_8_single()for all input bytes >= 0x80 in paths, queries, fragments, and hostnames. This is lexbor's non-validating UTF-8 decoder, written to assume the caller already verified the input. It skips continuation byte range checks, overlong sequence rejection, and surrogate rejection.Lexbor ships a validating variant,
lxb_encoding_decode_utf_8_single(), in the same file. It rejects all three classes correctly. The URL parser calls the non-validating one instead.What gets accepted
Overlong sequences:
\xC0\xAF(overlong/) decodes to U+002F.\xC1\xA1(overlonga) decodes to U+0061. The validating decoder rejects all 2-byte overlongs atch < 0xC2.Invalid continuation bytes:
\xC0\x41(0x41 isA, not a continuation byte) decodes to codepoint 0x41. The literalAgets consumed into the multi-byte sequence and percent-encoded as%C0%41instead of being parsed as its own character.Surrogates:
\xED\xA0\x80decodes to U+D800. The validating decoder rejects this via boundary checks on the 0xED prefix.Chrome, Firefox, and Safari reject all three at the UTF-8 decode step, producing U+FFFD (replacement character). Lexbor's URL parser diverges from browser behavior here.
Hostname normalization attack
After percent-decoding a hostname, bytes >= 0x80 trigger IDNA processing. The IDNA code calls the same non-validating decoder. Overlong ASCII characters pass through as their target codepoints, producing valid domain names from byte sequences that look nothing like the canonical form.
Paths, queries, and fragments
Overlong bytes go through the >= 0x80 branch and get percent-encoded as raw bytes. ASCII separators (
/,?,#) are checked byte-level, so overlong/(0xC0 0xAF) doesn't split a path. The stored URL has%C0%AF. No structural confusion at the URL parser level.If a downstream caller percent-decodes
%C0%AFand feeds the result to a path resolver, they get/.No validation warnings
For overlong sequences that map to valid URL characters,
lxb_url_is_url_codepoint()returns true and noInvalidUrlUnitwarning fires. A developer checking the errors array sees nothing and assumes the URL is clean.Reproduction
Root cause
lxb_encoding_decode_valid_utf_8_singleinext/lexbor/lexbor/encoding/decode.c:2889skips continuation byte range checks, overlong sequence rejection, and surrogate rejection. The URL parser calls it at 7 sites inext/lexbor/lexbor/url/url.c(lines 2343, 2405, 2427, 2650, 3009, 3031, 3189). The validating decoder at decode.c:2780 handles all three correctly.