Skip to content

BUG: DataFrame.rank() loses ExtensionArray dtype (GH#52829)#65120

Open
maheshmakvana wants to merge 1 commit intopandas-dev:mainfrom
maheshmakvana:fix-52829-dataframe-rank-preserve-ea-dtype
Open

BUG: DataFrame.rank() loses ExtensionArray dtype (GH#52829)#65120
maheshmakvana wants to merge 1 commit intopandas-dev:mainfrom
maheshmakvana:fix-52829-dataframe-rank-preserve-ea-dtype

Conversation

@maheshmakvana
Copy link
Copy Markdown

Problem

DataFrame.rank() silently converts ExtensionArray-backed columns (e.g. ArrowDtype, nullable Int64) to float64, while Series.rank() correctly preserves the original dtype.

import pandas as pd
import pyarrow as pa

s = pd.Series([1, 2], dtype=pd.ArrowDtype(pa.int32()))
df = s.to_frame(name="a")

s.rank(method="min").dtype        # uint64[pyarrow]  checkmark
df.rank(method="min")["a"].dtype  # float64  before fix
                                  # uint64[pyarrow]  after fix

Root cause

In NDFrame.ranker(), the DataFrame branch used data.values which consolidates all blocks into a single numpy array, stripping EA type info.

Fix

For axis=0 (default), use _mgr.apply() to process each block independently:

  • 1-D EAs (ArrowExtensionArray, IntegerArray, etc.) call their own _rank(), preserving dtype.
  • numpy blocks and 2-D EAs (DatetimeArray) are transposed, ranked with algos.rank(axis=0), and transposed back.
  • axis=1 keeps the old data.values path (cross-column ranking must see all columns at once).

Closes #52829

When calling .rank() on a DataFrame whose columns use ExtensionArray dtypes
(e.g. ArrowDtype, nullable Int64), the result dtype was silently downcast to
float64 because the 2-D code path used data.values which consolidates all
blocks into a single numpy array.

Fix: for axis=0 (the default), use _mgr.apply() to process each block
independently. 1-D EAs (ArrowExtensionArray, IntegerArray, etc.) are handled
by their own _rank() method, preserving dtype. numpy blocks and 2-D EAs
(DatetimeArray, TimedeltaArray) are transposed before calling algos.rank so
that the ranking axis is correct, then transposed back. For axis=1 the old
data.values consolidation path is kept because cross-column ranking cannot be
done block-by-block.

Closes: pandas-dev#52829
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.rank does not return EA types when original type was an EADtype

1 participant