ext/uri: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes in hostnames

## Description

The WHATWG URL parser in ext/uri calls `lxb_encoding_decode_valid_utf_8_single()` for all input bytes >= 0x80 in paths, queries, fragments, and hostnames. This is lexbor's non-validating UTF-8 decoder, written to assume the caller already verified the input. It skips continuation byte range checks, overlong sequence rejection, and surrogate rejection.

Lexbor ships a validating variant, `lxb_encoding_decode_utf_8_single()`, in the same file. It rejects all three classes correctly. The URL parser calls the non-validating one instead.

### What gets accepted

**Overlong sequences**: `\xC0\xAF` (overlong `/`) decodes to U+002F. `\xC1\xA1` (overlong `a`) decodes to U+0061. The validating decoder rejects all 2-byte overlongs at `ch < 0xC2`.

**Invalid continuation bytes**: `\xC0\x41` (0x41 is `A`, not a continuation byte) decodes to codepoint 0x41. The literal `A` gets consumed into the multi-byte sequence and percent-encoded as `%C0%41` instead of being parsed as its own character.

**Surrogates**: `\xED\xA0\x80` decodes to U+D800. The validating decoder rejects this via boundary checks on the 0xED prefix.

Chrome, Firefox, and Safari reject all three at the UTF-8 decode step, producing U+FFFD (replacement character). Lexbor's URL parser diverges from browser behavior here.

### Hostname normalization attack

After percent-decoding a hostname, bytes >= 0x80 trigger IDNA processing. The IDNA code calls the same non-validating decoder. Overlong ASCII characters pass through as their target codepoints, producing valid domain names from byte sequences that look nothing like the canonical form.

### Paths, queries, and fragments

Overlong bytes go through the >= 0x80 branch and get percent-encoded as raw bytes. ASCII separators (`/`, `?`, `#`) are checked byte-level, so overlong `/` (0xC0 0xAF) doesn't split a path. The stored URL has `%C0%AF`. No structural confusion at the URL parser level.

If a downstream caller percent-decodes `%C0%AF` and feeds the result to a path resolver, they get `/`.

### No validation warnings

For overlong sequences that map to valid URL characters, `lxb_url_is_url_codepoint()` returns true and no `InvalidUrlUnit` warning fires. A developer checking the errors array sees nothing and assumes the URL is clean.

## Reproduction

~~~php
<?php

// Overlong 'e', 'v', 'i', 'l' in hostname
$url = Uri\WhatWg\Url::parse("http://%C1%A5%C1%B6%C1%A9%C1%AC.com/");
var_dump($url?->getAsciiHost());
// Expected: null (parse failure, matching browser behavior)
// Actual: string(8) "evil.com"

// Overlong '/' in path, no structural issue but no warning
$url2 = Uri\WhatWg\Url::parse("http://example.com/a%C0%AFb");
var_dump($url2?->getPath());
// Actual: "/a%C0%AFb" (overlong preserved, no warning)
// Browsers: "/a%EF%BF%BDb" (U+FFFD replacement)
~~~

## Root cause

`lxb_encoding_decode_valid_utf_8_single` in `ext/lexbor/lexbor/encoding/decode.c:2889` skips continuation byte range checks, overlong sequence rejection, and surrogate rejection. The URL parser calls it at 7 sites in `ext/lexbor/lexbor/url/url.c` (lines 2343, 2405, 2427, 2650, 3009, 3031, 3189). The validating decoder at decode.c:2780 handles all three correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ext/uri: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes in hostnames #21734

Description

What gets accepted

Hostname normalization attack

Paths, queries, and fragments

No validation warnings

Reproduction

Root cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ext/uri: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes in hostnames #21734

Description

Description

What gets accepted

Hostname normalization attack

Paths, queries, and fragments

No validation warnings

Reproduction

Root cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions