Skip to content

Backport of #100420/#100295 - Resolve problems with paths and compatibility problems with Spark in Azure (v2)#1785

Open
arthurpassos wants to merge 4 commits into
antalya-26.3from
backport/antalya/iceberg_path_100420
Open

Backport of #100420/#100295 - Resolve problems with paths and compatibility problems with Spark in Azure (v2)#1785
arthurpassos wants to merge 4 commits into
antalya-26.3from
backport/antalya/iceberg_path_100420

Conversation

@arthurpassos
Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos commented May 12, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This PR addresses several issues: fixes inconsistent path handling in Iceberg caused by mixed usage of storage paths and metadata paths; enforces that Iceberg tables write down a table location which is either a URL or an absolute path; adds a fallback for counting file sizes in Azure because some ClickHouse readers don't support byte counting after traversal; version-hint.txt is now handled in a manner compatible with Spark; introduces type-level abstractions that make it harder to mix up path types in the future; adds tests for Azure and Local that verify cross-engine interoperability without intermediate uploading/downloading; fixes usage of position deletes, which previously relied on path inference heuristics where that approach is inappropriate

Backport of ClickHouse#100420 and ClickHouse#100295

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9762404674

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

per_file_record_counts.push_back(static_cast<Int64>(*current_file_num_rows));
per_file_byte_sizes.push_back(static_cast<Int64>(file_bytes));
/// todo arthur fix the wrong counter for file bytes, probably by backporting something else
per_file_byte_sizes.push_back(static_cast<Int64>(buffer_bytes));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Store the Azure fallback size in per-file stats

When the write buffer does not report bytes (the Azure path handled just above), buffer_bytes remains 0, so per_file_byte_sizes records 0 even though total_bytes was corrected from object metadata. This value is returned by getDataFileEntries and written to the import/export sidecar, which later becomes the Iceberg manifest file_size_in_bytes, so exported/imported Azure data files can be committed with a zero file size. Use the fallback size for both counters instead of only total_bytes.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Workflow [PR], commit [e5ef89f]

…hunk`

In `addRequestedFileLikeStorageVirtualsToChunk`, the `_row_number` handling
block uses `return` instead of `continue` after adding the column to the
chunk. This causes the function to exit the loop early, skipping any
remaining virtual columns (e.g. `_data_lake_snapshot_version`). When a
query requests both `_row_number` and another virtual column after it, the
chunk has fewer columns than expected, resulting in:
"Invalid number of columns in chunk pushed to OutputPort."

The fix was originally in ClickHouse#100116 but was lost during merge because
ClickHouse#100208 (revert of ClickHouse#99163) had reintroduced the `return` on master
after the fix branch had already resolved it via a different code structure.

The regression test `04050_iceberg_virtual_columns_return_bug` is already
on master from ClickHouse#100116.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100177&sha=f10711f066fd101124e088ce33061de51ebae0e9&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20parallel%29

ClickHouse#87890

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@arthurpassos arthurpassos changed the title Backport of #100420 - Resolve problems with paths and compatibility problems with Spark in Azure (v2) Backport of #100420/#100295 - Resolve problems with paths and compatibility problems with Spark in Azure (v2) May 13, 2026
@ianton-ru
Copy link
Copy Markdown

Looks like tests test_named_collections_encrypted2 can't be executed multiple times in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants