From c4fcfce741a784016ebdd462a1dd38b148a95345 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 10:58:19 -0400 Subject: [PATCH 1/9] docs: add upstream sync process documentation Document the three-PR workflow used to sync to a newer upstream apache/datafusion version: bump crate deps + fix breakage, consolidate transitive deps, then fill API and documentation gaps via /check-upstream. Cross-reference from dev/release/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/README.md | 6 ++ dev/release/upstream-sync.md | 134 +++++++++++++++++++++++++++++++++++ 2 files changed, 140 insertions(+) create mode 100644 dev/release/upstream-sync.md diff --git a/dev/release/README.md b/dev/release/README.md index 4833be55a..8518ece1f 100644 --- a/dev/release/README.md +++ b/dev/release/README.md @@ -33,6 +33,12 @@ release branch without blocking ongoing development in the `main` branch. We can cherry-pick commits from the `main` branch into `branch-53` as needed and then create new patch releases from that branch. +## Upstream Sync + +Between releases the `main` branch is periodically synced to a newer upstream `apache/datafusion` version. This is +broken into a three-PR workflow (bump + fix breakage, consolidate transitive deps, fill API and documentation gaps). +See [`upstream-sync.md`](upstream-sync.md) for the full process. + ## Detailed Guide ### Pre-requisites diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md new file mode 100644 index 000000000..068e71450 --- /dev/null +++ b/dev/release/upstream-sync.md @@ -0,0 +1,134 @@ + + +# Upstream Sync Process + +This document describes how to sync `datafusion-python` to a new version of the +upstream `apache/datafusion` Rust crates. This is a recurring task: between +official releases the `main` branch tracks DataFusion via crates.io or GitHub +dependencies, and we periodically bump those dependencies to pick up new +features and bug fixes. + +The work is broken into **three sequential PRs** rather than landing as one +large change. Splitting reviews along these lines keeps each PR focused, makes +breakage easier to bisect, and lets reviewers concentrate on one concern at a +time. + +## PR 1: Bump DataFusion crate dependencies and fix breakage + +**Goal:** update the upstream `datafusion` crate version and make the project +build, test, and lint cleanly against it. + +1. Update the `datafusion` dependency in the root `Cargo.toml` (workspace + section and dependencies). Any downstream `datafusion-*` crates pinned in + `crates/core/Cargo.toml` should move to the matching version. +2. Run `cargo update -p datafusion` (or `cargo update` for a broader refresh) + so `Cargo.lock` reflects the new pin. +3. Run the standard build and test commands and address compilation errors, + API renames, signature changes, and behavior changes: + - `cargo build` + - `cargo test` + - `pytest` + - `pre-commit run --all-files` +4. Fix only what's needed to restore green CI. Resist the urge to bundle + unrelated cleanups — those belong in their own PR. +5. If a breaking change in upstream requires a user-facing API change in + `datafusion-python`, add the `api change` label and document the change + in the PR description so it surfaces in the changelog. + +**Reference PRs:** [#1311](https://github.com/apache/datafusion-python/pull/1311) +(DF51), [#1337](https://github.com/apache/datafusion-python/pull/1337) (DF52). + +## PR 2: Consolidate transitive dependencies + +**Goal:** after the upstream bump, the dependency tree may have multiple +versions of the same transitive crate (for example, two `arrow` versions, two +`object_store` versions). Reconcile these so we ship a single coherent set. + +1. Inspect the lockfile for duplicates: + ```bash + cargo tree --duplicates + ``` +2. For each duplicate that matters (Arrow, `object_store`, `parquet`, + `tokio`, `arrow-flight`, etc.), update our direct dependency declarations + in `Cargo.toml` to versions compatible with what upstream DataFusion now + pulls in. The goal is one version of each ecosystem-critical crate. +3. Re-run `cargo update` and re-run the full test matrix. Some duplicates are + benign (small leaf crates with no FFI surface) and can be left alone if + reconciliation would force a much larger change. Use judgment. +4. If consolidating forces a behavioral change visible to users (for example, + a newer `pyarrow`-compatible Arrow version), call it out in the PR + description. + +Keeping this work separate from PR 1 means PR 1 stays a "make it compile" +review and PR 2 stays a "tidy the dependency graph" review. + +## PR 3: Fill API and documentation gaps + +**Goal:** with the upstream version locked in, identify new APIs that landed +upstream and decide whether to expose them, and update agent-facing +documentation so it still matches the surface we ship. + +1. Run the `check-upstream` skill (`.ai/skills/check-upstream/SKILL.md`) to + diff the upstream Rust API against what's exposed in + `python/datafusion/`. The skill covers scalar/aggregate/window/table + functions, `DataFrame` methods, `SessionContext` methods, and FFI types. + Invoke it from the assistant with `/check-upstream` (optionally scoped to + one area, e.g. `/check-upstream scalar functions`). +2. For each gap, decide whether to: + - Expose it now (small, obvious additions can land in this PR). + - File a tracking issue (anything non-trivial — separate PR per feature + keeps reviews focused). + - Skip it (internal-only or already covered by an existing API; record + the decision in the "Evaluated and not requiring exposure" sections of + the skill so future runs don't re-flag it). +3. Cross-reference the user-facing skill at + [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) + against the current public API. Look for stale function names, missing + newly exposed APIs, and examples that drifted from current behavior. + Update `SKILL.md` and the relevant RST pages under + `docs/source/user-guide/common-operations/` accordingly. (An + `audit-skill-md` skill is planned to automate this step — once it lands + under `.ai/skills/audit-skill-md/`, invoke it via `/audit-skill-md`.) +4. If new aggregate or window functions were exposed in step 2, also update: + - `docs/source/user-guide/common-operations/aggregations.rst` + - `docs/source/user-guide/common-operations/windows.rst` + +PR 3 is the natural place to land small Pythonic-interface improvements +discovered during the audit. Larger reshapes should still get their own PR. + +## Why three PRs + +- **Bisectable.** If a regression appears, `git bisect` lands on the + responsible PR (compile fix, dependency consolidation, or API addition) + rather than a single mega-commit. +- **Reviewable.** Each PR has a single concern. Reviewers reading PR 1 don't + need to also reason about whether new APIs are well-named. +- **Skippable.** Some upstream syncs are pure version bumps with no new APIs + worth exposing. PR 3 can be empty or merged as a no-op if the audit comes + back clean. + +## Related documents + +- [`README.md`](README.md) — the broader release process (this sync work + feeds into the next official release). +- [`.ai/skills/check-upstream/SKILL.md`](../../.ai/skills/check-upstream/SKILL.md) + — API coverage audit. +- [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) + — user-facing agent guide kept in sync via PR 3. From 1c4cfe9b108497961d5a439ad852a00a4e94b1e3 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 11:18:12 -0400 Subject: [PATCH 2/9] docs: add audit-skill-md skill New AI agent skill at .ai/skills/audit-skill-md/SKILL.md to keep the user-facing skills/datafusion_python/SKILL.md in sync with the public Python API. Audits SessionContext, DataFrame, Expr, and functions surfaces for new APIs not covered, stale mentions, examples that drifted from idiomatic style, and missing version notes. Wired into PR 3 of the upstream sync workflow documented in dev/release/upstream-sync.md. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/audit-skill-md/SKILL.md | 283 +++++++++++++++++++++++++++++ dev/release/upstream-sync.md | 15 +- 2 files changed, 291 insertions(+), 7 deletions(-) create mode 100644 .ai/skills/audit-skill-md/SKILL.md diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md new file mode 100644 index 000000000..4d25f4e72 --- /dev/null +++ b/.ai/skills/audit-skill-md/SKILL.md @@ -0,0 +1,283 @@ + + +--- +name: audit-skill-md +description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires upstream DataFusion vX" note. Run after upstream syncs and before each release. +argument-hint: [scope] (e.g., "session-context", "dataframe", "expr", "functions", "patterns", "pitfalls", "version-notes", "all") +--- + +# Audit `skills/datafusion_python/SKILL.md` + +You are auditing the user-facing skill at +[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) +against the current state of the Python API. The skill is the source of truth +for how AI coding assistants are taught to write `datafusion-python` code, so +it must match what the project actually ships. This skill identifies gaps +caused by upstream syncs, refactors, or renames, and (if asked) applies the +edits directly to `SKILL.md`. + +The skill is most usefully run **after** the `check-upstream` step of an +upstream sync (see `dev/release/upstream-sync.md`) — once any new APIs are +exposed, this skill makes sure they get documented. + +## What the skill covers + +The user-facing `SKILL.md` documents these public surfaces. This list is not +exhaustive — if a new top-level area is added (e.g., a new `Catalog` API +exposed at the package root), include it. + +| Surface | Module | Sections in SKILL.md | +|---|---|---| +| `SessionContext` | `python/datafusion/context.py` | "Data Loading" | +| `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" | +| `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" | +| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout | +| Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" | + +## Scope argument + +The user may specify a scope via `$ARGUMENTS` to limit the audit. If no scope +is given or `all` is specified, audit every area. + +| Scope | Audit target | +|---|---| +| `session-context` | `SessionContext` methods and the "Data Loading" section | +| `dataframe` | `DataFrame` methods and the operations / executing / patterns sections | +| `expr` | `Expr` methods/operators and the "Expression Building" section | +| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section | +| `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style | +| `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream | +| `version-notes` | Cross-check version annotations (see below) | +| `all` | Everything above | + +## Inputs to read + +Before producing the report: + +1. `skills/datafusion_python/SKILL.md` — the document being audited. +2. The relevant Python module(s) for the chosen scope. Public surface is the + `__all__` list (where defined) plus `class` and `def` symbols not prefixed + with `_`. +3. `Cargo.toml` (root) for the current `datafusion-python` version (e.g., + `version = "53.0.0"`). The major version always matches the upstream + `datafusion` crate, so a single `datafusion-python` version expresses + both. `python/datafusion/__init__.py`'s `__version__` is the same value + exposed at runtime. +4. Recent commits touching the relevant module(s) for context on what + changed since the last sync: + ```bash + git log --oneline -- python/datafusion/dataframe.py | head -20 + ``` + +## What to look for + +Walk through each scoped area and flag four kinds of issues. + +### 1. New APIs not mentioned + +For each public symbol in the module's `__all__` (or each public class +method), check whether it appears anywhere in `SKILL.md`. A symbol is +"covered" if it shows up in: + +- A code block (the strongest signal — it's demonstrated). +- The "Available Functions (Categorized)" list. +- The SQL-to-DataFrame Reference table. + +**Decide whether each missing symbol deserves an entry.** Not every public +symbol belongs in `SKILL.md` — the skill is curated for the patterns users +hit daily, not exhaustive API reference. Use these heuristics: + +- **Add it** if it replaces or supersedes something already in the skill + (e.g., a new operation that is the idiomatic alternative to a documented + workaround). +- **Add it** if it fits a category already present (a new aggregate function + goes in the aggregate list; a new join type goes in the joining section). +- **Add it** if it changes how a documented pattern should be written. +- **Skip it** if it is genuinely niche / advanced / experimental. +- **Skip it** if it is internal plumbing exposed for FFI but not user-facing. + +When you flag a missing symbol, include a one-line proposed insertion point +(which section / which table row) so a reviewer can decide quickly. + +### 2. Stale mentions + +For each function name, method name, or import shown in `SKILL.md`, verify it +still exists in the current API: + +- Function names mentioned in prose or in the categorized list should appear + in `python/datafusion/functions.py`'s `__all__`. +- Method calls in code blocks should resolve against the current class. +- Imports (`from datafusion import ...`) should succeed against the current + `__init__.py`. + +A quick way to check imports without running them: + +```bash +python -c "from datafusion import SessionContext, col, lit; from datafusion import functions as F; print('ok')" +``` + +For each stale mention, propose either: +- a rename to the current name, or +- removal if the API is gone with no replacement. + +### 3. Examples that drifted from idiomatic style + +The skill teaches a Pythonic style: prefer plain strings to `col(...)` when a +column reference is all you need; prefer raw Python values to `lit(...)` +where auto-wrapping applies. Recent refactors (see the `make-pythonic` +skill) keep moving more functions toward accepting native types. + +For each code example in `SKILL.md`, check: + +- Does it use `lit(value)` where a raw value would work? Comparison RHS, + arithmetic with a column, etc. all auto-wrap. (Reserve `lit()` for the + cases listed in pitfall #2.) +- Does it use `col("name")` where a plain string would work? `select(...)`, + `aggregate([keys], ...)`, `sort(...)`, `sort_by(...)` all accept plain + name strings. +- Do `functions.py` calls match the current pythonic signature for that + function? If `make-pythonic` recently changed a signature (e.g., + `repeat(string, n: Expr | int)`), the example should pass `3` rather than + `lit(3)`. +- Does any example use a deprecated or removed parameter name? + +For drift, propose the updated snippet. If the change is purely stylistic +and the older form still works, mark the suggestion as **non-blocking**. + +### 4. Missing or stale version notes + +When an API depends on a specific version, the skill should say so — +otherwise an agent referencing the skill in an older project will write +code that fails at import or at runtime. + +`datafusion-python` shares its major version number with the upstream +`datafusion` crate (e.g., `datafusion-python 53.x` tracks upstream +`datafusion 53`). Always express version requirements in terms of +`datafusion-python` only — there is no need to call out upstream and +package versions separately. + +Add a version note when: + +- A method or function shown in the skill was added in a specific release + (e.g., a new `DataFrame` method that didn't exist before 53). +- A breaking change altered behavior in a specific release (signature + change, default-value change, new required argument). +- A pitfall was fixed in a specific release. Either annotate the pitfall + block with "fixed in datafusion-python NN, kept here for users on older + versions" or remove it once the supported floor moves past that version. + +Format for version notes (inline, italicized): + +```markdown +*Requires datafusion-python 53 or newer.* +``` + +For each missing/stale version note, propose the exact line and where it +belongs. + +## How to discover changes since the last audit + +If the user supplies a previous version or commit SHA where the audit was +last run, diff against it: + +```bash +# Public-API-relevant changes since SHA +git log --oneline ..HEAD -- python/datafusion/ + +# Whose signatures actually moved +git diff ..HEAD -- python/datafusion/functions.py | grep '^[+-]def ' +``` + +If no prior audit point is given, fall back to "since the last upstream +sync" by inspecting commits that touch `Cargo.toml`'s `datafusion` pin: + +```bash +git log --oneline -- Cargo.toml | grep -i datafusion | head -5 +``` + +## Output Format + +Produce a report grouped by scope. Each finding is one bullet with a +proposed action, so a maintainer can review the list quickly and apply +edits in order. + +``` +## SKILL.md Audit (scope: ) + +Audited against: +- skills/datafusion_python/SKILL.md @ +- datafusion-python + +### New APIs to cover +- `DataFrame.foo()` — added in datafusion-python 53. Insert in "DataFrame Operations Quick Reference" under . + Proposed snippet: + ```python + df.foo(...) + ``` + +### Stale mentions +- "old_function_name" referenced in the categorized list (line N) — renamed to "new_function_name". Replace. + +### Drifted examples +- "Filtering" section, `df.filter(col("a") > lit(10))` — drop `lit(10)`, auto-wrap applies. (non-blocking) +- "Aggregation" section, `df.aggregate([col("region")], ...)` — pass `"region"` as a plain string per "Projection" guidance. + +### Version notes +- `DataFrame.foo()` block needs *Requires datafusion-python 53 or newer.* +- "Common Pitfalls" #N — fixed in datafusion-python 53; remove the pitfall and update the SQL-to-DataFrame row to no longer flag the workaround. + +### No-change confirmed +- `SessionContext` data-loading section — all entries match current API. +``` + +If asked to apply the changes, edit `skills/datafusion_python/SKILL.md` +directly with `Edit` tool calls, one finding at a time, and re-run the +relevant doctest sanity check at the end: + +```bash +pytest --doctest-modules python/datafusion -q +``` + +## What NOT to flag + +- **Internal helpers / underscored names.** Private symbols are not part of + the user-facing surface. +- **Functions intentionally omitted.** Niche / advanced APIs (custom + catalogs, raw FFI plumbing, low-level execution plan accessors) live in + the API reference, not the skill. If an omission was deliberate and a + comment / commit explains why, leave it out. +- **Style nits inside explanatory prose.** The skill mixes example code and + prose; only enforce the pythonic style on actual code blocks. +- **Function-by-function coverage of every `functions.py` symbol.** The + "Available Functions (Categorized)" list is curated by category, not + exhaustive. Adding a single new aggregate to the aggregate list is + enough — the user follows the pointer to the API reference for the rest. + +## Coordination with other skills + +- Run `/check-upstream` first to expose any missing upstream APIs into the + Python layer. Without that, this skill cannot recommend documenting + something that is not yet exposed. +- Run `/make-pythonic` before this skill if a Pythonic-signature pass is + planned for a release — that way this skill can update examples to the + final signature in one shot rather than churning them twice. +- The order during an upstream sync (PR 3 of `dev/release/upstream-sync.md`) + is therefore: `/check-upstream` → `/make-pythonic` (optional) → + `/audit-skill-md`. diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md index 068e71450..ee7c4d707 100644 --- a/dev/release/upstream-sync.md +++ b/dev/release/upstream-sync.md @@ -98,14 +98,15 @@ documentation so it still matches the surface we ship. - Skip it (internal-only or already covered by an existing API; record the decision in the "Evaluated and not requiring exposure" sections of the skill so future runs don't re-flag it). -3. Cross-reference the user-facing skill at +3. Run the `audit-skill-md` skill (`.ai/skills/audit-skill-md/SKILL.md`) to + cross-reference the user-facing skill at [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) - against the current public API. Look for stale function names, missing - newly exposed APIs, and examples that drifted from current behavior. - Update `SKILL.md` and the relevant RST pages under - `docs/source/user-guide/common-operations/` accordingly. (An - `audit-skill-md` skill is planned to automate this step — once it lands - under `.ai/skills/audit-skill-md/`, invoke it via `/audit-skill-md`.) + against the current public API. The skill flags stale function names, + missing newly exposed APIs, examples that drifted from idiomatic style, + and missing version notes. Invoke it from the assistant with + `/audit-skill-md` (optionally scoped, e.g. `/audit-skill-md dataframe`). + Apply the resulting edits to `SKILL.md` and to the relevant RST pages + under `docs/source/user-guide/common-operations/`. 4. If new aggregate or window functions were exposed in step 2, also update: - `docs/source/user-guide/common-operations/aggregations.rst` - `docs/source/user-guide/common-operations/windows.rst` From c88a88ab51681ab3c8af1b776ee7fe92b46192de Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 11:19:12 -0400 Subject: [PATCH 3/9] docs: verify upstream sync completed before release Add a checklist item to "Preparing the main Branch" pointing release managers at dev/release/upstream-sync.md so the crate bump, dependency consolidation, and /check-upstream and /audit-skill-md passes are confirmed done before the release branch is cut. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/dev/release/README.md b/dev/release/README.md index 8518ece1f..1b02ca576 100644 --- a/dev/release/README.md +++ b/dev/release/README.md @@ -59,6 +59,9 @@ You will also need access to the [datafusion](https://test.pypi.org/project/data Before creating a new release: - We need to ensure that the main branch does not have any GitHub dependencies +- Confirm the upstream sync workflow in [`upstream-sync.md`](upstream-sync.md) has been completed for this release cycle + (crate bump + breakage fixes, transitive dependency consolidation, and the `/check-upstream` and `/audit-skill-md` + passes). Any gaps surfaced by those skills should land before the release branch is cut. - a PR should be created and merged to update the major version number of the project - A new release branch should be created, such as `branch-53` - It is best to push this branch to the apache repository rather than a personal fork in case patch releases are required. From d841206b8538364ae5680f65f9b6c4ac60f94399 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 12:26:26 -0400 Subject: [PATCH 4/9] docs: scope upstream sync cargo update to datafusion family Replace `cargo update -p datafusion` with an explicit multi-`-p` invocation listing every `datafusion-*` workspace dependency, so PR 1 of the upstream-sync workflow refreshes only the datafusion family and leaves other transitives for PR 2 to consolidate. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/upstream-sync.md | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md index ee7c4d707..ba8ebdc69 100644 --- a/dev/release/upstream-sync.md +++ b/dev/release/upstream-sync.md @@ -38,8 +38,32 @@ build, test, and lint cleanly against it. 1. Update the `datafusion` dependency in the root `Cargo.toml` (workspace section and dependencies). Any downstream `datafusion-*` crates pinned in `crates/core/Cargo.toml` should move to the matching version. -2. Run `cargo update -p datafusion` (or `cargo update` for a broader refresh) - so `Cargo.lock` reflects the new pin. +2. Update `Cargo.lock` for the datafusion family only — leave unrelated + transitives at their current pins so PR 2 can address them deliberately. + List every `datafusion-*` workspace dependency with `-p`: + + ```bash + cargo update \ + -p datafusion \ + -p datafusion-substrait \ + -p datafusion-proto \ + -p datafusion-ffi \ + -p datafusion-catalog \ + -p datafusion-common \ + -p datafusion-functions-aggregate \ + -p datafusion-functions-window \ + -p datafusion-expr + ``` + + Or pin exact versions with `--precise`, one crate at a time: + + ```bash + cargo update -p datafusion --precise 54.0.0 + # repeat for each datafusion-* sibling + ``` + + A bare `cargo update` would refresh every transitive crate and blur the + diff between PR 1 and PR 2. 3. Run the standard build and test commands and address compilation errors, API renames, signature changes, and behavior changes: - `cargo build` From 109928d71bf2f70a79182046ae75423c7d4b08b8 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 12:27:04 -0400 Subject: [PATCH 5/9] docs: correct datafusion-* pin location in upstream sync PR 1 step 1 incorrectly stated downstream `datafusion-*` crates are pinned in `crates/core/Cargo.toml`. Pins live in the root `[workspace.dependencies]`; per-crate manifests inherit via `workspace = true`. Reword step 1 to point at the right file. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/upstream-sync.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md index ba8ebdc69..49d23d616 100644 --- a/dev/release/upstream-sync.md +++ b/dev/release/upstream-sync.md @@ -35,9 +35,9 @@ time. **Goal:** update the upstream `datafusion` crate version and make the project build, test, and lint cleanly against it. -1. Update the `datafusion` dependency in the root `Cargo.toml` (workspace - section and dependencies). Any downstream `datafusion-*` crates pinned in - `crates/core/Cargo.toml` should move to the matching version. +1. Update every `datafusion-*` entry in `[workspace.dependencies]` of the + root `Cargo.toml` to the new major version. The per-crate manifests under + `crates/` inherit these pins via `workspace = true` and need no edit. 2. Update `Cargo.lock` for the datafusion family only — leave unrelated transitives at their current pins so PR 2 can address them deliberately. List every `datafusion-*` workspace dependency with `-p`: From 40da653d1787f7525c815c1eedb98b6be2698ef8 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 12:28:02 -0400 Subject: [PATCH 6/9] docs: restore workspace.package version bump in upstream sync PR 1 step 1 must also bump `[workspace.package].version` because the `datafusion-python` major version tracks the upstream `datafusion` major. The previous reword dropped that instruction. Reinstate it alongside the `[workspace.dependencies]` updates. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/upstream-sync.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md index 49d23d616..116b3b1b9 100644 --- a/dev/release/upstream-sync.md +++ b/dev/release/upstream-sync.md @@ -35,9 +35,15 @@ time. **Goal:** update the upstream `datafusion` crate version and make the project build, test, and lint cleanly against it. -1. Update every `datafusion-*` entry in `[workspace.dependencies]` of the - root `Cargo.toml` to the new major version. The per-crate manifests under - `crates/` inherit these pins via `workspace = true` and need no edit. +1. In the root `Cargo.toml`, update: + - `[workspace.package].version` to the new major (the `datafusion-python` + major tracks the upstream `datafusion` major, so a 53→54 bump moves + this from `53.0.0` to `54.0.0`), and + - every `datafusion` / `datafusion-*` entry in `[workspace.dependencies]` + to the same new major. + + Per-crate manifests under `crates/` inherit these pins via + `workspace = true` and need no edit. 2. Update `Cargo.lock` for the datafusion family only — leave unrelated transitives at their current pins so PR 2 can address them deliberately. List every `datafusion-*` workspace dependency with `-p`: From e827547d0415a4332e5b92f86b8e27a2f0b9707c Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 12:28:38 -0400 Subject: [PATCH 7/9] docs: align audit-skill-md description with body version phrasing Frontmatter description referenced "requires upstream DataFusion vX", but the body of the skill settles on the `datafusion-python NN` form (consistent with the package/upstream-major equivalence). Switch the description to match so the skill speaks one language end to end. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/audit-skill-md/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md index 4d25f4e72..4bd0009e5 100644 --- a/.ai/skills/audit-skill-md/SKILL.md +++ b/.ai/skills/audit-skill-md/SKILL.md @@ -19,7 +19,7 @@ --- name: audit-skill-md -description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires upstream DataFusion vX" note. Run after upstream syncs and before each release. +description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires datafusion-python NN or newer" note. Run after upstream syncs and before each release. argument-hint: [scope] (e.g., "session-context", "dataframe", "expr", "functions", "patterns", "pitfalls", "version-notes", "all") --- From 06f85307349fd7cfcb92e55e002cefac0bb40437 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 12:29:56 -0400 Subject: [PATCH 8/9] docs: fold make-pythonic step into PR 3 of upstream sync Audit-skill-md documents the order `/check-upstream` -> `/make-pythonic` (optional) -> `/audit-skill-md`, but PR 3 of the upstream-sync workflow only listed the first and last. Insert the make-pythonic pass as step 3 so signatures get aligned before the SKILL.md audit, avoiding example churn. Drops the orphan trailing paragraph in favor of inline guidance on when to defer larger reshapes to their own PR. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/upstream-sync.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/dev/release/upstream-sync.md b/dev/release/upstream-sync.md index 116b3b1b9..dc05b4195 100644 --- a/dev/release/upstream-sync.md +++ b/dev/release/upstream-sync.md @@ -128,7 +128,15 @@ documentation so it still matches the surface we ship. - Skip it (internal-only or already covered by an existing API; record the decision in the "Evaluated and not requiring exposure" sections of the skill so future runs don't re-flag it). -3. Run the `audit-skill-md` skill (`.ai/skills/audit-skill-md/SKILL.md`) to +3. (Optional) Run the `make-pythonic` skill + (`.ai/skills/make-pythonic/SKILL.md`) over any newly exposed APIs to + align signatures with the project's Pythonic style (accepting plain + strings for column names, raw Python values where auto-wrapping + applies, etc.). Invoke it from the assistant with `/make-pythonic`. + Running this *before* the audit step means examples in `SKILL.md` get + updated to the final signature in one pass instead of churning twice. + Larger reshapes still belong in their own PR. +4. Run the `audit-skill-md` skill (`.ai/skills/audit-skill-md/SKILL.md`) to cross-reference the user-facing skill at [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) against the current public API. The skill flags stale function names, @@ -137,13 +145,10 @@ documentation so it still matches the surface we ship. `/audit-skill-md` (optionally scoped, e.g. `/audit-skill-md dataframe`). Apply the resulting edits to `SKILL.md` and to the relevant RST pages under `docs/source/user-guide/common-operations/`. -4. If new aggregate or window functions were exposed in step 2, also update: +5. If new aggregate or window functions were exposed in step 2, also update: - `docs/source/user-guide/common-operations/aggregations.rst` - `docs/source/user-guide/common-operations/windows.rst` -PR 3 is the natural place to land small Pythonic-interface improvements -discovered during the audit. Larger reshapes should still get their own PR. - ## Why three PRs - **Bisectable.** If a regression appears, `git bisect` lands on the From 8dd36c130a1ad61a7914215d622be81871f7dec6 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sun, 3 May 2026 12:37:46 -0400 Subject: [PATCH 9/9] docs: drop literal Cargo.toml version from audit-skill-md inputs Replace literal `version = "53.0.0"` example with a pointer to the `[workspace.package]` field plus an `NN.0.0` placeholder so the skill prose does not drift each major bump. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/audit-skill-md/SKILL.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md index 4bd0009e5..30e1a90fd 100644 --- a/.ai/skills/audit-skill-md/SKILL.md +++ b/.ai/skills/audit-skill-md/SKILL.md @@ -75,10 +75,11 @@ Before producing the report: 2. The relevant Python module(s) for the chosen scope. Public surface is the `__all__` list (where defined) plus `class` and `def` symbols not prefixed with `_`. -3. `Cargo.toml` (root) for the current `datafusion-python` version (e.g., - `version = "53.0.0"`). The major version always matches the upstream - `datafusion` crate, so a single `datafusion-python` version expresses - both. `python/datafusion/__init__.py`'s `__version__` is the same value +3. `Cargo.toml` (root) for the current `datafusion-python` version — read + the `version` field under `[workspace.package]` (format `NN.0.0`). The + major version always matches the upstream `datafusion` crate, so a + single `datafusion-python` version expresses both. + `python/datafusion/__init__.py`'s `__version__` is the same value exposed at runtime. 4. Recent commits touching the relevant module(s) for context on what changed since the last sync: