perf: vectorize KV cache prefix matching with numpy#2179
Open
nausicaalii wants to merge 1 commit intoabetlen:mainfrom
Open
perf: vectorize KV cache prefix matching with numpy#2179nausicaalii wants to merge 1 commit intoabetlen:mainfrom
nausicaalii wants to merge 1 commit intoabetlen:mainfrom
Conversation
Replace O(n) Python for-loop in KV cache prefix matching and longest_token_prefix() with numpy vectorized comparison. The element-wise numpy comparison runs in optimized C/SIMD instead of Python's interpreter loop, which matters as conversation history grows (10K+ tokens). No change in behavior — both paths find the first position where cached and new token sequences diverge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generate()KV cache prefix matching andlongest_token_prefix()with numpy vectorized element-wise comparisonnp.argminon a boolean equality array to find the first mismatch position in a single vectorized passMotivation
The current prefix matching iterates token-by-token in Python to find where the cached prompt diverges from the new prompt. This is fine for short prompts, but becomes a bottleneck as conversation history grows — multi-turn chat sessions can accumulate 10K–100K+ tokens in
input_ids, and the linear Python loop runs on everygenerate()call.Numpy's vectorized comparison runs in optimized C/SIMD, giving significant speedup for large token sequences while preserving identical behavior.
Test plan
longest_token_prefixcorrectness across edge cases: empty sequences, full match, partial match, single element, no match, different lengths, large sequences (10K tokens)test_real_model— passes (low-level batch decode)test_real_llama— passes (multiple sequentialcreate_completioncalls that exercise prefix matching)test_real_llama_embeddings— passes