Backport of #100420/#100295 - Resolve problems with paths and compatibility problems with Spark in Azure (v2)#1785
Conversation
…_spark_azure_fixes Resolve problems with paths and compatibility problems with Spark in Azure (v2)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9762404674
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| per_file_record_counts.push_back(static_cast<Int64>(*current_file_num_rows)); | ||
| per_file_byte_sizes.push_back(static_cast<Int64>(file_bytes)); | ||
| /// todo arthur fix the wrong counter for file bytes, probably by backporting something else | ||
| per_file_byte_sizes.push_back(static_cast<Int64>(buffer_bytes)); |
There was a problem hiding this comment.
Store the Azure fallback size in per-file stats
When the write buffer does not report bytes (the Azure path handled just above), buffer_bytes remains 0, so per_file_byte_sizes records 0 even though total_bytes was corrected from object metadata. This value is returned by getDataFileEntries and written to the import/export sidecar, which later becomes the Iceberg manifest file_size_in_bytes, so exported/imported Azure data files can be committed with a zero file size. Use the fallback size for both counters instead of only total_bytes.
Useful? React with 👍 / 👎.
…hunk` In `addRequestedFileLikeStorageVirtualsToChunk`, the `_row_number` handling block uses `return` instead of `continue` after adding the column to the chunk. This causes the function to exit the loop early, skipping any remaining virtual columns (e.g. `_data_lake_snapshot_version`). When a query requests both `_row_number` and another virtual column after it, the chunk has fewer columns than expected, resulting in: "Invalid number of columns in chunk pushed to OutputPort." The fix was originally in ClickHouse#100116 but was lost during merge because ClickHouse#100208 (revert of ClickHouse#99163) had reintroduced the `return` on master after the fix branch had already resolved it via a different code structure. The regression test `04050_iceberg_virtual_columns_return_bug` is already on master from ClickHouse#100116. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100177&sha=f10711f066fd101124e088ce33061de51ebae0e9&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20parallel%29 ClickHouse#87890 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Looks like tests |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
This PR addresses several issues: fixes inconsistent path handling in Iceberg caused by mixed usage of storage paths and metadata paths; enforces that Iceberg tables write down a table location which is either a URL or an absolute path; adds a fallback for counting file sizes in Azure because some ClickHouse readers don't support byte counting after traversal; version-hint.txt is now handled in a manner compatible with Spark; introduces type-level abstractions that make it harder to mix up path types in the future; adds tests for Azure and Local that verify cross-engine interoperability without intermediate uploading/downloading; fixes usage of position deletes, which previously relied on path inference heuristics where that approach is inappropriate
Backport of ClickHouse#100420 and ClickHouse#100295
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: