feat: add cosine_distance scalar function#21542
Conversation
Add cosine_distance (and list_cosine_distance alias) to compute cosine distance between two numeric arrays. Includes shared vector math primitives in vector_math.rs for reuse by follow-on functions. Part of apache#21536. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| let result = list_array1 | ||
| .iter() | ||
| .zip(list_array2.iter()) | ||
| .map(|(arr1, arr2)| compute_cosine_distance(arr1, arr2)) | ||
| .collect::<Result<Float64Array>>()?; | ||
|
|
There was a problem hiding this comment.
It's probably more efficient to iterate using offsets/values than needing to downcast each ArrayRef value
| let mut value1 = value1; | ||
| let mut value2 = value2; | ||
|
|
||
| loop { |
There was a problem hiding this comment.
What is this loop for? I don't believe this function is intended to support nested lists?
| let values1 = convert_to_f64_array(&value1)?; | ||
| let values2 = convert_to_f64_array(&value2)?; |
There was a problem hiding this comment.
Is this strictly necessary? I though signature coercion should already ensure it is a float array
| let values2 = convert_to_f64_array(&value2)?; | ||
|
|
||
| if values1.len() != values2.len() { | ||
| return exec_err!("Both arrays must have the same length"); |
There was a problem hiding this comment.
Probably need to be more descriptive for this error, especially as this can be confused as the actual input Arrow Array (when it's meant to refer to the list type inside the Arrow Array)
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
These tests seem duplicated with SLTs, so should remove these Rust unit tests
| pub mod sort; | ||
| pub mod string; | ||
| pub mod utils; | ||
| pub mod vector_math; |
There was a problem hiding this comment.
Don't think vector_math needs to be public here
| /// Computes dot product: sum(a\[i\] * b\[i\]) | ||
| pub fn dot_product_f64(a: &Float64Array, b: &Float64Array) -> f64 { | ||
| a.iter() | ||
| .zip(b.iter()) | ||
| .map(|(v1, v2)| v1.unwrap_or(0.0) * v2.unwrap_or(0.0)) | ||
| .sum() | ||
| } | ||
|
|
||
| /// Computes sum of squares: sum(a\[i\]^2) | ||
| pub fn sum_of_squares_f64(a: &Float64Array) -> f64 { | ||
| a.iter() | ||
| .map(|v| { | ||
| let val = v.unwrap_or(0.0); | ||
| val * val | ||
| }) | ||
| .sum() | ||
| } | ||
|
|
||
| /// Computes magnitude (L2 norm): sqrt(sum(a\[i\]^2)) | ||
| pub fn magnitude_f64(a: &Float64Array) -> f64 { | ||
| sum_of_squares_f64(a).sqrt() |
There was a problem hiding this comment.
These are probably best off inlined for the moment? It might make it more efficient instead of needing to pass in an array slice for every array value
Addresses review comments on apache#21542: - Iterate list offsets/values directly instead of per-row ArrayRef downcast - Remove nested-list unwrap loop (function does not support nested lists) - Drop convert_to_f64_array wrapper (coerce_types already guarantees Float64) - Remove duplicate Rust unit tests now covered by SLT - More descriptive error message for mismatched list lengths - Delete now-unused vector_math module; inline math into sole caller Adds SLT coverage for NULL-element-in-list behavior previously tested only in Rust unit tests. Part of apache#21536. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Ok(DataType::Float64) | ||
| } | ||
|
|
||
| fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> { |
There was a problem hiding this comment.
Something to keep in mind here is input lists of different types, take for example:
query R
select list_cosine_distance([1.0, 0.0], arrow_cast([0.0, 1.0], 'LargeList(Float16)'));
----
1This was highlighted by the recent PR for concat:
| pub fn new() -> Self { | ||
| Self { | ||
| signature: Signature::user_defined(Volatility::Immutable), | ||
| aliases: vec!["list_cosine_distance".to_string()], |
There was a problem hiding this comment.
list_cosine_distance seems a bit unwieldy, especially considering the regular name of the function cosine_distance doesn't mention array (usually we'd have something like array_remove then the alias just replaces array, e.g. list_remove)
| query error cosine_distance does not support type | ||
| select cosine_distance(NULL, [1.0, 2.0]); |
There was a problem hiding this comment.
We could probably still handle this to return null
| select cosine_distance(column1, column2) from (values | ||
| (make_array(1.0, 0.0), make_array(0.0, 1.0)), | ||
| (make_array(1.0, 1.0), make_array(1.0, 1.0)), | ||
| (make_array(1.0, 0.0), make_array(-1.0, 0.0)) |
There was a problem hiding this comment.
| (make_array(1.0, 0.0), make_array(-1.0, 0.0)) | |
| (make_array(1.0, 0.0), make_array(-1.0, 0.0)), | |
| (make_array(1.0, 0.0), NULL) |
Just for coverage
|
Thanks for the detailed review, @Jefffrey. Rework pushed in fc3ee90. Walking through each comment: 1. Iterate via offsets/values, not per-row ArrayRef downcast ( 2. Nested-list unwrap loop ( 3. Redundant null/Float64 check ( 4. Ambiguous length-mismatch error wording ( 5. Duplicate Rust unit tests ( 6. 7. Inline the math instead of a separate module ( Full validation matrix ( |
Summary
cosine_distance(array1, array2)/list_cosine_distance— computes cosine distance (1 - cosine similarity) between two numeric arraysvector_math.rsprimitives (dot_product_f64,magnitude_f64,convert_to_f64_array) for reuse by follow-on vector functionsPart of #21536 — first in a series of split PRs (replacing #21371).
Test plan
cosine_distance.sltcovering all edge cases including empty arrays, LargeList, integer coercion, alias, return typecargo clippy,cargo fmt,taplo,prettier,cargo machete— all clean🤖 Generated with Claude Code