Implement Sequence ID pooling in BatchedExecutor to prevent native SeqMax overflow and crashes#1386
Open
zsogitbe wants to merge 1 commit into
Open
Implement Sequence ID pooling in BatchedExecutor to prevent native SeqMax overflow and crashes#1386zsogitbe wants to merge 1 commit into
zsogitbe wants to merge 1 commit into
Conversation
…qMax overflow and crashes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The Problem
Under sustained traffic or continuous batching scenarios, the
BatchedExecutoreventually causes the nativellama.cppbackend to deadlock, reject tokens, or throw a segmentation fault.The Root Cause
Currently,
BatchedExecutorassigns sequence IDs using a strictly incrementing counter (_nextSequenceId++). When aConversationis disposed, its tokens are properly removed from the KV cache viaMemorySequenceRemove, but its Sequence ID is permanently abandoned.In
llama.cpp, theSeqMax(n_seq_max) parameter dictates the static allocation of arrays within the native KV cache. It expects sequence IDs to act as strictly bounded array indices (from0toSeqMax - 1). Once the C#_nextSequenceIdcounter exceedsSeqMax, passing that ID to the native backend results in an out-of-bounds memory access.The Solution
This PR replaces the strictly incrementing counter with a Sequence ID pool.
BatchedExecutor: Updated to track active IDs (e.g., using a thread-safeHashSetor similar structure) and assign the lowest available sequence ID. Added aReleaseSequenceIdmethod.Conversation: Updated theDispose()method to callExecutor.ReleaseSequenceId(ConversationId)after clearing the KV cache.Benefits
SeqMaxlimit as long as concurrent conversations stay within bounds.BatchedExecutorto run continuously under high-traffic workloads without requiring the host application to dangerously destroy and recreate the massive native context to reset the ID counter.