[opt](function) simplify FunctionStringPad fast path#62359
[opt](function) simplify FunctionStringPad fast path#62359Mryange merged 2 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Found 1 issue.
Critical checkpoint conclusions:
- Goal and correctness: The PR clearly aims to refactor
FunctionStringPadby splitting out a const-length/const-pad fast path and a general path. The overall structure is smaller and easier to follow, but the new fast path introduces a correctness regression in its upfront output allocation logic, so the current code does not fully preserve the previous behavior. - Minimality and focus: The change is focused on a single function implementation, but it also changes the memory growth strategy, which is where the regression appears.
- Concurrency: No new concurrency is involved here; execution remains row-local within the vectorized function.
- Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
- Configuration: No configuration changes.
- Compatibility: No FE/BE protocol, symbol, or storage format compatibility issues found.
- Parallel code paths: The split between the const/const path and the general path is explicit; the issue I found is limited to the new const/const fast path.
- Special conditional logic: The new ASCII-vs-UTF8 dispatch is reasonable, but the const/const path now relies on an estimated byte count before any
check_chars_length()validation. - Test coverage: Existing BE/regression tests cover many
lpad/rpadvalue cases, but this PR changes memory-allocation behavior and was submitted without new tests. I did not see coverage that specifically exercises overflow/error-path behavior for the new fast path. - Observability: No additional observability needed for this change.
- Transaction/persistence: Not applicable.
- Data writes/atomicity: Not applicable beyond in-memory string-column writes.
- FE/BE variable passing: Not applicable.
- Performance: The refactor should help readability and may help the const/const path, but correctness needs to be fixed first.
- Other issues: None beyond the inline finding below.
Summary: The refactor is directionally good, but the new const-length/const-pad path should not pre-size ColumnString::Chars with an unchecked estimate. It needs the same overflow protection guarantees as the previous row-by-row checked growth path.
| } | ||
|
|
||
| const PadChars* padding_ptr = padding ? &*padding : nullptr; | ||
| res_chars.resize(estimate_const_output_bytes<is_utf8>(strcol.get_chars(), target_len, |
There was a problem hiding this comment.
This new fast path pre-sizes res_chars with an estimated total byte count before any ColumnString::check_chars_length() validation. That changes behavior compared with the old implementation, which validated each row's actual buffer_len + new_capacity before growing the destination.
A concrete failure case is a very large constant target length (or just a large row count) where estimate_const_output_bytes() exceeds ColumnString::MAX_STRING_SIZE: PaddedPODArray::resize() will try to allocate that size directly instead of raising the Doris STRING_OVERFLOW_IN_VEC_ENGINE exception from check_chars_length(). So this is not only a refactor; it weakens the existing overflow guard on the const/const path.
Please switch this to reserve() plus checked growth, or run ColumnString::check_chars_length() on the estimated size before the initial resize, and still keep the row-level checked growth for cases where the estimate is low.
|
/review |
|
No issues found in Critical checkpoint conclusions:
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
|
run cloud_p0 |
There was a problem hiding this comment.
I found one blocking correctness regression in the refactor.
Critical Checkpoints
- Goal of the task: The PR aims to simplify
FunctionStringPadwhile preserving existinglpad/rpadbehavior. Conclusion: not fully achieved because the new implementation changes results for non-UTF8 byte strings; no new test in this PR demonstrates compatibility. - Change size/focus: The refactor is focused on one function, but it rewrites the core execution path enough that behavior parity needs stronger validation.
- Concurrency: Not applicable here; the code is single-threaded vectorized function logic and does not introduce shared-state synchronization.
- Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
- Configuration: No new configuration items.
- Compatibility/incompatible changes: No wire/storage compatibility impact, but there is a user-visible function behavior regression on arbitrary byte strings.
- Parallel code paths:
lpadandrpadboth share this implementation, so the regression affects both paths. - Special conditional checks: The new ASCII/UTF-8 split is understandable, but the new UTF-8 string-length helper is not behavior-equivalent to the prior truncation logic.
- Test coverage: Existing regression cases cover normal ASCII/UTF-8 usage, but I did not find coverage for malformed/non-UTF8 byte strings, which is exactly where the regression appears.
- Observability: No additional observability needed for this function change.
- Transaction/persistence: Not applicable.
- Data writes/modifications: Not applicable beyond function output correctness.
- FE/BE variable passing: Not applicable.
- Performance: The const/const specialization is reasonable, but correctness must be fixed first.
- Other issues: None beyond the blocking behavior change above were confirmed.
Please fix the string-length/truncation behavior so it remains compatible with the previous implementation for arbitrary byte strings, then re-run the relevant string-function coverage.
| } | ||
|
|
||
| const size_t str_char_len = get_char_length<is_utf8>(str_data, str_byte_len); | ||
| const size_t target_char_len = static_cast<size_t>(target_len); |
There was a problem hiding this comment.
This changes lpad/rpad semantics for non-UTF8 byte strings. simd::VStringFunctions::get_char_len() counts only bytes > 0xBF, while the previous implementation decided whether the source already reached len via iterate_utf8_with_limit_length() (vstring_function.h:236), which advances by UTF8_BYTE_LENGTH[*p] and therefore treats continuation bytes 0x80..0xBF as one character.
A concrete regression is a one-byte string like unhex('80') with target length 1: before this refactor the function kept the original byte, but now str_char_len becomes 0, so we pad and return two bytes. Doris string columns can contain arbitrary byte sequences, so this is a behavior break. Please keep the source-string length/truncation logic consistent with the old iterate_utf8_with_limit_length() behavior (or make an explicit validated-UTF8-only change with tests if that is intentional).
There was a problem hiding this comment.
我们默认,对于字符串函数,里面的字符串都是合法的。
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
我们默认,对于字符串函数,里面的字符串都是合法的utf8字符。
What problem does this PR solve?
This PR refactors BE
lpad/rpadimplementation inFunctionStringPad.The previous version mixed multiple optimization dimensions in one heavily-templated path, which made the code hard to read and maintain.
This change simplifies the execution model into:
length const && pad constThe const/const path keeps the key optimizations:
ColumnString::CharsRelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)