[opt](function) simplify FunctionStringPad fast path by Mryange · Pull Request #62359 · apache/doris

Mryange · 2026-04-10T11:44:41Z

What problem does this PR solve?

This PR refactors BE lpad / rpad implementation in FunctionStringPad.

The previous version mixed multiple optimization dimensions in one heavily-templated path, which made the code hard to read and maintain.

This change simplifies the execution model into:

a fast path for length const && pad const
a general path for all other cases

The const/const path keeps the key optimizations:

direct write to ColumnString::Chars
one-time pad preprocessing
output size estimation
optional ASCII fast path

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-04-10T11:44:48Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Mryange · 2026-04-10T11:46:05Z

run buildall

hello-stephen · 2026-04-10T13:05:26Z

BE UT Coverage Report

Increment line coverage 95.85% (208/217) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.04% (20152/37996)
Line Coverage	36.58% (189486/517979)
Region Coverage	32.85% (147115/447872)
Branch Coverage	33.97% (64402/189573)

hello-stephen · 2026-04-10T14:13:27Z

BE Regression && UT Coverage Report

Increment line coverage 98.16% (213/217) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.60% (27386/37210)
Line Coverage	57.24% (295600/516396)
Region Coverage	54.64% (246976/451996)
Branch Coverage	56.17% (106813/190149)

Mryange · 2026-04-13T06:27:49Z

/review

github-actions

Found 1 issue.

Critical checkpoint conclusions:

Goal and correctness: The PR clearly aims to refactor FunctionStringPad by splitting out a const-length/const-pad fast path and a general path. The overall structure is smaller and easier to follow, but the new fast path introduces a correctness regression in its upfront output allocation logic, so the current code does not fully preserve the previous behavior.
Minimality and focus: The change is focused on a single function implementation, but it also changes the memory growth strategy, which is where the regression appears.
Concurrency: No new concurrency is involved here; execution remains row-local within the vectorized function.
Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
Configuration: No configuration changes.
Compatibility: No FE/BE protocol, symbol, or storage format compatibility issues found.
Parallel code paths: The split between the const/const path and the general path is explicit; the issue I found is limited to the new const/const fast path.
Special conditional logic: The new ASCII-vs-UTF8 dispatch is reasonable, but the const/const path now relies on an estimated byte count before any check_chars_length() validation.
Test coverage: Existing BE/regression tests cover many lpad/rpad value cases, but this PR changes memory-allocation behavior and was submitted without new tests. I did not see coverage that specifically exercises overflow/error-path behavior for the new fast path.
Observability: No additional observability needed for this change.
Transaction/persistence: Not applicable.
Data writes/atomicity: Not applicable beyond in-memory string-column writes.
FE/BE variable passing: Not applicable.
Performance: The refactor should help readability and may help the const/const path, but correctness needs to be fixed first.
Other issues: None beyond the inline finding below.

Summary: The refactor is directionally good, but the new const-length/const-pad path should not pre-size ColumnString::Chars with an unchecked estimate. It needs the same overflow protection guarantees as the previous row-by-row checked growth path.

github-actions · 2026-04-13T06:33:44Z

+        }
+
+        const PadChars* padding_ptr = padding ? &*padding : nullptr;
+        res_chars.resize(estimate_const_output_bytes<is_utf8>(strcol.get_chars(), target_len,


This new fast path pre-sizes res_chars with an estimated total byte count before any ColumnString::check_chars_length() validation. That changes behavior compared with the old implementation, which validated each row's actual buffer_len + new_capacity before growing the destination.

A concrete failure case is a very large constant target length (or just a large row count) where estimate_const_output_bytes() exceeds ColumnString::MAX_STRING_SIZE: PaddedPODArray::resize() will try to allocate that size directly instead of raising the Doris STRING_OVERFLOW_IN_VEC_ENGINE exception from check_chars_length(). So this is not only a refactor; it weakens the existing overflow guard on the const/const path.

Please switch this to reserve() plus checked growth, or run ColumnString::check_chars_length() on the estimated size before the initial resize, and still keep the row-level checked growth for cases where the estimate is low.

HappenLee · 2026-04-13T11:13:22Z

/review

github-actions · 2026-04-13T11:20:13Z

No issues found in be/src/exprs/function/function_string_concat.h after reviewing the refactor against the Doris review checklist.

Critical checkpoint conclusions:

Goal and correctness: The PR goal is to simplify FunctionStringPad by splitting a const-length/const-pad fast path from the general path, and the new implementation still preserves the previous lpad/rpad semantics in the code paths I traced.
Minimality and focus: The change is focused to one BE string-function implementation and keeps the optimization-specific logic localized.
Concurrency: No concurrency or lock-safety concerns are involved; execution remains row-local in the vectorized function path.
Lifecycle/static initialization: No special lifecycle or static initialization risks found.
Configuration: No configuration changes.
Compatibility: No FE/BE protocol, persisted metadata, or storage-format compatibility issues found.
Parallel code paths: The new fast path and general path are both covered by the refactor; I did not find a mismatch that would require mirroring changes elsewhere.
Special conditional logic: The ASCII fast-path dispatch is consistent with the underlying ColumnString::is_ascii() / UTF-8 helper behavior.
Test coverage: Existing regression and constant-fold cases already exercise lpad/rpad, including UTF-8 examples. No new tests were added, so the residual risk is mainly around untested performance/overflow corners rather than an evident functional regression.
Observability: No additional observability needed for this refactor.
Transaction/persistence: Not applicable.
Data writes/atomicity: Not applicable beyond in-memory column writes.
FE/BE variable passing: Not applicable.
Performance: The refactor should improve maintainability while keeping the intended fast path.
Other issues: None found.

HappenLee

LGTM

github-actions · 2026-04-13T11:49:58Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-04-13T11:50:00Z

PR approved by anyone and no changes requested.

Mryange · 2026-04-14T02:00:31Z

run buildall

hello-stephen · 2026-04-14T05:58:00Z

BE Regression && UT Coverage Report

Increment line coverage 98.19% (217/221) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.68% (27401/37191)
Line Coverage	57.35% (296085/516236)
Region Coverage	54.62% (246810/451850)
Branch Coverage	56.28% (106967/190049)

Mryange · 2026-04-14T06:58:37Z

/review

Mryange · 2026-04-14T06:58:47Z

run cloud_p0

github-actions

I found one blocking correctness regression in the refactor.

Critical Checkpoints

Goal of the task: The PR aims to simplify FunctionStringPad while preserving existing lpad/rpad behavior. Conclusion: not fully achieved because the new implementation changes results for non-UTF8 byte strings; no new test in this PR demonstrates compatibility.
Change size/focus: The refactor is focused on one function, but it rewrites the core execution path enough that behavior parity needs stronger validation.
Concurrency: Not applicable here; the code is single-threaded vectorized function logic and does not introduce shared-state synchronization.
Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
Configuration: No new configuration items.
Compatibility/incompatible changes: No wire/storage compatibility impact, but there is a user-visible function behavior regression on arbitrary byte strings.
Parallel code paths: lpad and rpad both share this implementation, so the regression affects both paths.
Special conditional checks: The new ASCII/UTF-8 split is understandable, but the new UTF-8 string-length helper is not behavior-equivalent to the prior truncation logic.
Test coverage: Existing regression cases cover normal ASCII/UTF-8 usage, but I did not find coverage for malformed/non-UTF8 byte strings, which is exactly where the regression appears.
Observability: No additional observability needed for this function change.
Transaction/persistence: Not applicable.
Data writes/modifications: Not applicable beyond function output correctness.
FE/BE variable passing: Not applicable.
Performance: The const/const specialization is reasonable, but correctness must be fixed first.
Other issues: None beyond the blocking behavior change above were confirmed.

Please fix the string-length/truncation behavior so it remains compatible with the previous implementation for arbitrary byte strings, then re-run the relevant string-function coverage.

github-actions · 2026-04-14T07:08:30Z

+        }
+
+        const size_t str_char_len = get_char_length<is_utf8>(str_data, str_byte_len);
+        const size_t target_char_len = static_cast<size_t>(target_len);


This changes lpad/rpad semantics for non-UTF8 byte strings. simd::VStringFunctions::get_char_len() counts only bytes > 0xBF, while the previous implementation decided whether the source already reached len via iterate_utf8_with_limit_length() (vstring_function.h:236), which advances by UTF8_BYTE_LENGTH[*p] and therefore treats continuation bytes 0x80..0xBF as one character.

A concrete regression is a one-byte string like unhex('80') with target length 1: before this refactor the function kept the original byte, but now str_char_len becomes 0, so we pad and return two bytes. Doris string columns can contain arbitrary byte sequences, so this is a behavior break. Please keep the source-string length/truncation logic consistent with the old iterate_utf8_with_limit_length() behavior (or make an explicit validated-UTF8-only change with tests if that is intentional).

我们默认，对于字符串函数，里面的字符串都是合法的。

hello-stephen · 2026-04-14T08:36:25Z

BE Regression && UT Coverage Report

Increment line coverage 98.19% (217/221) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.68% (27401/37191)
Line Coverage	57.35% (296085/516236)
Region Coverage	54.62% (246810/451850)
Branch Coverage	56.28% (106967/190049)

我们默认，对于字符串函数，里面的字符串都是合法的utf8字符。

opt

809ef10

github-actions bot reviewed Apr 13, 2026

View reviewed changes

review

8c320c8

HappenLee reviewed Apr 13, 2026

View reviewed changes

Comment thread be/src/exprs/function/function_string_concat.h

HappenLee approved these changes Apr 13, 2026

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 13, 2026

github-actions bot added the reviewed label Apr 13, 2026

github-actions bot previously requested changes Apr 14, 2026

View reviewed changes

Mryange merged commit 5cbaa4f into apache:master Apr 15, 2026
31 of 34 checks passed

Conversation

Mryange commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Apr 10, 2026

Uh oh!

Mryange commented Apr 10, 2026

Uh oh!

hello-stephen commented Apr 10, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 10, 2026

BE Regression && UT Coverage Report

Uh oh!

Mryange commented Apr 13, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

HappenLee commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Uh oh!

HappenLee left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Mryange commented Apr 14, 2026

Uh oh!

hello-stephen commented Apr 14, 2026

BE Regression && UT Coverage Report

Uh oh!

Mryange commented Apr 14, 2026

Uh oh!

Mryange commented Apr 14, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Mryange Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Apr 14, 2026

BE Regression && UT Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mryange commented Apr 10, 2026 •

edited

Loading