FIX: silent data loss in MultiIndex __setitem__ with object-dtype level#65119
Open
Bahtya wants to merge 2 commits intopandas-dev:mainfrom
Open
FIX: silent data loss in MultiIndex __setitem__ with object-dtype level#65119Bahtya wants to merge 2 commits intopandas-dev:mainfrom
Bahtya wants to merge 2 commits intopandas-dev:mainfrom
Conversation
…type level When setting a top-level column on a DataFrame with a MultiIndex where level 1 has object dtype (mixed types), assignments to single-subcolumn groups are silently dropped. Root cause: In _set_item_frame_value, the guard for GH#62518/GH#61841 (avoiding reindex into empty-string columns) used is_string_dtype(cols_droplevel.dtype) and not cols_droplevel.any(). For object-dtype Index containing integer 0, is_string_dtype returns True and any() returns False (0 is falsy), causing the early return to trigger incorrectly. Fix: Replace not cols_droplevel.any() with (cols_droplevel == "").all() to explicitly check for empty strings instead of relying on truthiness. Fixes pandas-dev#65118 Signed-off-by: bahtya <bahtyar153@qq.com>
Member
|
Pls add test |
Author
|
Thanks for the review! I'll add a test for the silent data loss case in MultiIndex |
Author
|
Hi @jbrockmendel, thanks for the review! I'll add tests for this fix. Working on it now. |
Author
|
Hi @jbrockmendel, I've added three test cases for this fix:
All tests pass locally. Ready for re-review! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
df[key] = df[key] / xsilently drops the assignment (no error, no warning) when:objectdtype due to mixed types (e.g. string + int)This is a regression from 2.3.x → 3.0.x and causes silent data loss.
Root Cause
In
_set_item_frame_value, the guard added for GH#62518/GH#61841 checks:When
maybe_droplevelsproducesIndex([0], dtype="object"):is_string_dtype(object)→TrueIndex([0]).any()→False(because0is falsy)This causes the early return to trigger incorrectly, silently discarding the assignment.
Fix
Replace
not cols_droplevel.any()with(cols_droplevel == "").all()to explicitly check for empty strings:This correctly identifies actual empty-string columns without being fooled by falsy integer values in object-dtype Indexes.
Testing
Verified locally that the fix resolves the issue from the bug report:
Also confirmed the original GH#62518/GH#61841 cases are still protected since their columns genuinely contain only empty strings.
Fixes #65118