Skip to content

ext/uri: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes in hostnames #21734

@iliaal

Description

@iliaal

Description

The WHATWG URL parser in ext/uri calls lxb_encoding_decode_valid_utf_8_single() for all input bytes >= 0x80 in paths, queries, fragments, and hostnames. This is lexbor's non-validating UTF-8 decoder, written to assume the caller already verified the input. It skips continuation byte range checks, overlong sequence rejection, and surrogate rejection.

Lexbor ships a validating variant, lxb_encoding_decode_utf_8_single(), in the same file. It rejects all three classes correctly. The URL parser calls the non-validating one instead.

What gets accepted

Overlong sequences: \xC0\xAF (overlong /) decodes to U+002F. \xC1\xA1 (overlong a) decodes to U+0061. The validating decoder rejects all 2-byte overlongs at ch < 0xC2.

Invalid continuation bytes: \xC0\x41 (0x41 is A, not a continuation byte) decodes to codepoint 0x41. The literal A gets consumed into the multi-byte sequence and percent-encoded as %C0%41 instead of being parsed as its own character.

Surrogates: \xED\xA0\x80 decodes to U+D800. The validating decoder rejects this via boundary checks on the 0xED prefix.

Chrome, Firefox, and Safari reject all three at the UTF-8 decode step, producing U+FFFD (replacement character). Lexbor's URL parser diverges from browser behavior here.

Hostname normalization attack

After percent-decoding a hostname, bytes >= 0x80 trigger IDNA processing. The IDNA code calls the same non-validating decoder. Overlong ASCII characters pass through as their target codepoints, producing valid domain names from byte sequences that look nothing like the canonical form.

Paths, queries, and fragments

Overlong bytes go through the >= 0x80 branch and get percent-encoded as raw bytes. ASCII separators (/, ?, #) are checked byte-level, so overlong / (0xC0 0xAF) doesn't split a path. The stored URL has %C0%AF. No structural confusion at the URL parser level.

If a downstream caller percent-decodes %C0%AF and feeds the result to a path resolver, they get /.

No validation warnings

For overlong sequences that map to valid URL characters, lxb_url_is_url_codepoint() returns true and no InvalidUrlUnit warning fires. A developer checking the errors array sees nothing and assumes the URL is clean.

Reproduction

<?php

// Overlong 'e', 'v', 'i', 'l' in hostname
$url = Uri\WhatWg\Url::parse("http://%C1%A5%C1%B6%C1%A9%C1%AC.com/");
var_dump($url?->getAsciiHost());
// Expected: null (parse failure, matching browser behavior)
// Actual: string(8) "evil.com"

// Overlong '/' in path, no structural issue but no warning
$url2 = Uri\WhatWg\Url::parse("http://example.com/a%C0%AFb");
var_dump($url2?->getPath());
// Actual: "/a%C0%AFb" (overlong preserved, no warning)
// Browsers: "/a%EF%BF%BDb" (U+FFFD replacement)

Root cause

lxb_encoding_decode_valid_utf_8_single in ext/lexbor/lexbor/encoding/decode.c:2889 skips continuation byte range checks, overlong sequence rejection, and surrogate rejection. The URL parser calls it at 7 sites in ext/lexbor/lexbor/url/url.c (lines 2343, 2405, 2427, 2650, 3009, 3031, 3189). The validating decoder at decode.c:2780 handles all three correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions