Skip to content

Spark Execution Engine: unexpected_index_column_names does not work with Nested Columns #11381

@parkseulkee

Description

@parkseulkee

Describe the bug
For flat (non-nested) columns, unexpected_index_column_names works as expected.
However, when using nested columns, it does not work.
In addition, if either unexpected_index_column_names or domain_column_name_list is set to a nested column, it becomes difficult to utilize unexpected_index_list.

To Reproduce
Example configuration and schema:

  • Checkpoint configuration:
    • Index column: Data.evt.id
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name=checkpoint_name,
        validation_definitions=[validation_definition],
        actions=[gx.checkpoint.actions.UpdateDataDocsAction(name=checkpoint_name)],
        result_format={
            "result_format": "COMPLETE",
            "unexpected_index_column_names": [
                "Data.evt.id"
            ],
            "partial_unexpected_count": 0,
            "exclude_unexpected_values": False,
            "include_unexpected_rows": True,
            "return_unexpected_index_query": True,
        },
    )
)
  • Suite configuration:
    • Validation column: Data.evt.retry
{
  "id": "9f5cdaeb-0959-4b8b-af8b-52934a9234db",
  "type": "expect_column_values_to_be_in_set",
  "kwargs": {
    "column": "Data.evt.retry",
    "value_set": ["0", "1"]
  },
  "meta": {}
}
  • Schema before select():

    root
     |-- Data: struct (nullable = true)
     |    |-- evt: struct (nullable = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- retry: string (nullable = true)
    
  • Schema after select(columns_to_keep):

    root
     |-- id: string (nullable = true)
     |-- retry: integer (nullable = true)
    

At this point, Great Expectations no longer recognizes the original nested path (Data.evt.id), making it impossible to return unexpected_index_list.

Stack trace excerpt (simplified):

raise gx_exceptions.InvalidMetricAccessorDomainKwargsKeyError(
    f"Error: The unexpected_index_column 'Data.evt.id' does not exist in Spark DataFrame."
)

Expected behavior

  • Nested column paths should be preserved during select(columns_to_keep).
  • unexpected_index_column_names and domain_column_name_list should work properly with nested columns so that unexpected_index_list can be utilized.

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 1.5.10
  • Data Source: Spark

Additional context
The issue seems to be caused by schema pruning in map_condition_auxilliary_methods.py, where select(columns_to_keep) flattens nested structures and loses the original nested path.

# Prune the dataframe down only the columns we care about
filtered = filtered.select(columns_to_keep)

Suggested Fix

Preserve nested column paths during pruning by aliasing columns explicitly:

aliased_cols = [F.col(c).alias(c) for c in columns_to_keep]
filtered = filtered.select(*aliased_cols)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions