Commit cf00a48
optimize InnerProduct for SorfTransform and Dict-encoded constant queries
Pulls two reduction rules into `InnerProduct::execute` that together make
cosine-similarity queries against TurboQuant-compressed columns land on
direct codebook lookups instead of decoding the full column per row.
Case 1 (`try_execute_sorf_constant`): when one side of `InnerProduct` is
`ExactScalarFn<SorfTransform>` and the other is a constant-backed tensor
extension, rewrite to `InnerProduct(sorf_child, forward_rotate(zero_pad(const)))`
and recursively re-execute so case 2 can fire on the rewritten tree. This
works because SORF is orthogonal: `<T(R^{-1} x), c> = <x, R · zero_pad(c)>`
where `T` is the `padded_dim -> dim` truncation applied inside
`SorfTransform::execute`. Gated on `element_ptype == F32` because
SorfTransform's trailing `f32 -> element_ptype` cast breaks the rewrite's
semantics for F16/F64.
Case 2 (`try_execute_dict_constant`): when one side's storage is
`FSL(Dict(u8, f32))` with `values.len() <= 256` and the other side is a
constant-backed tensor extension with F32 element ptype, compute each row's
inner product via direct codebook lookup
`acc += q[j] * values[codes[j] as usize]`. An explicit product table
`P[j, k] = q[j] * values[k]` was prototyped and measured ~10% slower on
the `similarity_search` bench because the 1 KiB `values` table stays in L1
across all rows while a 1 MiB product table does not.
End-to-end `similarity_search` bench on dim=768, 10k rows (median):
- TurboQuant: 8.84 ms -> 8.01 ms (-9%), now faster than uncompressed's
10.5 ms median.
Ten new unit tests cover both fast paths, the mirrored argument orders,
empty `len == 0`, `padded_dim == dim` and `padded_dim > dim`, and the
fallback cases for plain (non-Dict) FSL storage and dicts with more than
256 values.
Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>1 parent 56770bc commit cf00a48
1 file changed
Lines changed: 795 additions & 0 deletions
0 commit comments