Skip to content

feat: route use_sea=True through ADBC-Rust kernel via PyO3#782

Draft
vikrantpuppala wants to merge 1 commit intomainfrom
adbc-rust-backend
Draft

feat: route use_sea=True through ADBC-Rust kernel via PyO3#782
vikrantpuppala wants to merge 1 commit intomainfrom
adbc-rust-backend

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Summary

Draft / RFC. Companion to the PyO3 satellite binding being prototyped in adbc-drivers/databricks#423.

Adds a new backend, AdbcDatabricksClient, that delegates query execution to the databricks_adbc_pyo3 extension module (PyO3 bindings over the Databricks ADBC Rust kernel). When use_sea=True is passed to sql.connect, requests now flow through the Rust kernel instead of the existing Python-SEA backend.

What this proves out

The kernel-strategy design (docs/kernel-strategy-final-recommendation.md in the kernel repo) calls for use_sea=True to be powered by a single Rust SEA implementation shared across all Databricks language drivers. This PR is the Python-side wiring to validate that path end-to-end.

Performance vs the existing Thrift backend on a dogfood warehouse, randomized interleaved benchmark, median wall time, fetchall_arrow path:

size ADBC-Rust Thrift ratio
SELECT 1 394ms 387ms 1.02×
10K 893ms 1014ms 0.88×
100K 1148ms 1145ms 1.00×
500K 2178ms 3305ms 0.66×
1M 3579ms 3814ms 0.94×
10M 8677ms 8802ms 0.99×

What's wired through the public API

  • sql.connect(..., use_sea=True) opens a Rust-kernel-backed session
  • cursor.execute(sql) runs queries (sync, PAT-only)
  • cursor.fetchone() / fetchmany(n) / fetchall() returns Row namedtuples
  • cursor.fetchall_arrow() / fetchmany_arrow(n) returns pyarrow.Table (zero-copy from Rust via Arrow C Data Interface)
  • cursor.description returns PEP-249 7-tuples derived from the Arrow schema
  • iteration (for row in cursor) and context managers

What is NOT yet wired (raises NotImplementedError)

  • Parameterized queries (parameters=[...])
  • Async execution (async_op=True) and cancel()
  • Metadata methods (cursor.catalogs() / schemas() / tables() / columns())
  • Auth: PAT only — no OAuth M2M, U2M, Azure SP, or external credential providers
  • Staging operations
  • No Ctrl-C signal handling, no logging bridge into Python logging
  • No native exception hierarchy — all kernel errors map to DatabaseError / OperationalError / ProgrammingError

Code layout

src/databricks/sql/backend/adbc/
├── __init__.py      # re-exports AdbcDatabricksClient
├── client.py        # DatabricksClient impl, delegates to PyO3
└── result_set.py    # ResultSet impl over the streaming PyO3 ResultSet,
                     # with batch buffering for fetchone / fetchmany.

The old backend/sea/ tree is left in place and unreachable from sql.connect; deletion is a separate cleanup once this backend reaches parity with the rest of the design doc.

Why draft?

databricks_adbc_pyo3 is not yet on PyPI. CI here will fail to import the new backend until the satellite is published. To run locally:

git clone https://github.com/adbc-drivers/databricks
cd databricks/rust-pyo3
python3 -m venv .venv && source .venv/bin/activate
pip install 'maturin>=1.5,<2.0' pyarrow
maturin develop --release

cd /path/to/databricks-sql-python
pip install -e .
DATABRICKS_HOST=... DATABRICKS_HTTP_PATH=... DATABRICKS_TOKEN=... python -c "
from databricks import sql
with sql.connect(server_hostname=..., http_path=..., access_token=..., use_sea=True) as c:
    print(c.cursor().execute('SELECT 1').fetchall())
"

Open questions

  1. Should this PR ship together with the kernel + satellite PRs, or sequenced (kernel first, satellite second, this third)?
  2. The original Python-SEA design doc (python-driver-rust-adbc-sea-design.md in the kernel repo) plans for deletion of the existing backend/sea/ tree. Is keeping it in place for one release acceptable for the migration window?
  3. Authentication: do we want to bring up OAuth/M2M before merging, or is PAT-only acceptable as a v0?

Test plan

  • import databricks.sql works
  • sql.connect(use_sea=True) succeeds against a dogfood warehouse with a PAT
  • Small inline result via fetchone() / fetchall_arrow()
  • 1M-row CloudFetch result via fetchall_arrow()
  • fetchmany(n) slices correctly across batch boundaries
  • cursor.description returns sensible types
  • OAuth, async, metadata, parameterized queries — all explicitly out of scope
  • CI integration — depends on the PyO3 binding being published

This pull request and its description were written by Isaac.

Adds a new backend, `AdbcDatabricksClient`, that delegates query
execution to the `databricks_adbc_pyo3` extension module (PyO3 bindings
over the Databricks ADBC Rust kernel). When `use_sea=True` is passed
to `sql.connect`, requests now flow through the Rust kernel instead of
the existing Python-SEA backend.

This is the Python-side companion to the satellite PyO3 binding being
prototyped in adbc-drivers/databricks#423. **Draft** while that binding
is not yet on PyPI — `import databricks_adbc_pyo3` will fail unless the
binding is installed locally via `maturin develop`.

What's wired through the public API:
  - sql.connect(..., use_sea=True)                  → Rust kernel
  - cursor.execute(...)                             → SEA + CloudFetch
  - cursor.fetchone() / fetchmany(n) / fetchall()   → Row tuples
  - cursor.fetchall_arrow() / fetchmany_arrow(n)    → pyarrow.Table
  - cursor.description                              → PEP-249 7-tuples
  - iteration (`for row in cursor`), context mgrs

What is NOT yet wired (raises NotImplementedError):
  - Parameterized queries (`parameters=[...]`)
  - Async execution (`async_op=True`)
  - Metadata methods (catalogs, schemas, tables, columns)

Auth: PAT only for now; OAuth M2M / U2M / Azure SP / external credential
providers are not yet plumbed through the Rust binding.

Code layout:
  src/databricks/sql/backend/adbc/
    __init__.py       — re-exports AdbcDatabricksClient
    client.py         — DatabricksClient impl, delegates to PyO3
    result_set.py     — ResultSet impl over the streaming PyO3
                        ResultSet, with batch buffering for fetchone /
                        fetchmany.

The old `backend/sea/` tree is left in place and unreachable from
sql.connect; deletion is a separate cleanup once this backend reaches
parity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant