Describe the bug
For flat (non-nested) columns, unexpected_index_column_names works as expected.
However, when using nested columns, it does not work.
In addition, if either unexpected_index_column_names or domain_column_name_list is set to a nested column, it becomes difficult to utilize unexpected_index_list.
To Reproduce
Example configuration and schema:
- Checkpoint configuration:
- Index column:
Data.evt.id
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=[gx.checkpoint.actions.UpdateDataDocsAction(name=checkpoint_name)],
result_format={
"result_format": "COMPLETE",
"unexpected_index_column_names": [
"Data.evt.id"
],
"partial_unexpected_count": 0,
"exclude_unexpected_values": False,
"include_unexpected_rows": True,
"return_unexpected_index_query": True,
},
)
)
- Suite configuration:
- Validation column:
Data.evt.retry
{
"id": "9f5cdaeb-0959-4b8b-af8b-52934a9234db",
"type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "Data.evt.retry",
"value_set": ["0", "1"]
},
"meta": {}
}
At this point, Great Expectations no longer recognizes the original nested path (Data.evt.id), making it impossible to return unexpected_index_list.
Stack trace excerpt (simplified):
raise gx_exceptions.InvalidMetricAccessorDomainKwargsKeyError(
f"Error: The unexpected_index_column 'Data.evt.id' does not exist in Spark DataFrame."
)
Expected behavior
- Nested column paths should be preserved during
select(columns_to_keep).
unexpected_index_column_names and domain_column_name_list should work properly with nested columns so that unexpected_index_list can be utilized.
Environment (please complete the following information):
- Operating System: MacOS
- Great Expectations Version: 1.5.10
- Data Source: Spark
Additional context
The issue seems to be caused by schema pruning in map_condition_auxilliary_methods.py, where select(columns_to_keep) flattens nested structures and loses the original nested path.
|
filtered = filtered.select(columns_to_keep) |
# Prune the dataframe down only the columns we care about
filtered = filtered.select(columns_to_keep)
Suggested Fix
Preserve nested column paths during pruning by aliasing columns explicitly:
aliased_cols = [F.col(c).alias(c) for c in columns_to_keep]
filtered = filtered.select(*aliased_cols)
Describe the bug
For flat (non-nested) columns,
unexpected_index_column_namesworks as expected.However, when using nested columns, it does not work.
In addition, if either
unexpected_index_column_namesordomain_column_name_listis set to a nested column, it becomes difficult to utilizeunexpected_index_list.To Reproduce
Example configuration and schema:
Data.evt.idData.evt.retry{ "id": "9f5cdaeb-0959-4b8b-af8b-52934a9234db", "type": "expect_column_values_to_be_in_set", "kwargs": { "column": "Data.evt.retry", "value_set": ["0", "1"] }, "meta": {} }Schema before
select():Schema after
select(columns_to_keep):At this point, Great Expectations no longer recognizes the original nested path (
Data.evt.id), making it impossible to returnunexpected_index_list.Stack trace excerpt (simplified):
Expected behavior
select(columns_to_keep).unexpected_index_column_namesanddomain_column_name_listshould work properly with nested columns so thatunexpected_index_listcan be utilized.Environment (please complete the following information):
Additional context
The issue seems to be caused by schema pruning in
map_condition_auxilliary_methods.py, whereselect(columns_to_keep)flattens nested structures and loses the original nested path.great_expectations/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py
Line 740 in 264b4e0
Suggested Fix
Preserve nested column paths during pruning by aliasing columns explicitly: