BUG: DataFrame.rank() loses ExtensionArray dtype (GH#52829)#65120
Open
maheshmakvana wants to merge 1 commit intopandas-dev:mainfrom
Open
BUG: DataFrame.rank() loses ExtensionArray dtype (GH#52829)#65120maheshmakvana wants to merge 1 commit intopandas-dev:mainfrom
maheshmakvana wants to merge 1 commit intopandas-dev:mainfrom
Conversation
When calling .rank() on a DataFrame whose columns use ExtensionArray dtypes (e.g. ArrowDtype, nullable Int64), the result dtype was silently downcast to float64 because the 2-D code path used data.values which consolidates all blocks into a single numpy array. Fix: for axis=0 (the default), use _mgr.apply() to process each block independently. 1-D EAs (ArrowExtensionArray, IntegerArray, etc.) are handled by their own _rank() method, preserving dtype. numpy blocks and 2-D EAs (DatetimeArray, TimedeltaArray) are transposed before calling algos.rank so that the ranking axis is correct, then transposed back. For axis=1 the old data.values consolidation path is kept because cross-column ranking cannot be done block-by-block. Closes: pandas-dev#52829
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DataFrame.rank()silently converts ExtensionArray-backed columns (e.g.ArrowDtype, nullableInt64) tofloat64, whileSeries.rank()correctly preserves the original dtype.Root cause
In
NDFrame.ranker(), the DataFrame branch useddata.valueswhich consolidates all blocks into a single numpy array, stripping EA type info.Fix
For
axis=0(default), use_mgr.apply()to process each block independently:_rank(), preserving dtype.algos.rank(axis=0), and transposed back.axis=1keeps the olddata.valuespath (cross-column ranking must see all columns at once).Closes #52829