Describe the bug
The ColumnValuesValueLength metric provider has inconsistent behavior between execution engines for strict_min and strict_max parameters. The Pandas execution engine properly implements strict (exclusive) bounds logic, but the Spark execution engine completely ignores these parameters, treating all comparisons as inclusive.
To Reproduce
# This should exclude values with length exactly 2 and 4 (strict bounds)
expectation_config = {
"expectation_type": "expect_column_value_lengths_to_be_between",
"kwargs": {
"column": "text_col",
"min_value": 2,
"max_value": 4,
"strict_min": True,
"strict_max": True
}
}
# With Pandas: correctly excludes length 2 and 4, only length 3 passes
# With Spark (before fix): incorrectly includes length 2 and 4 (ignores strict parameters)
Original buggy Spark implementation:
|
if min_value is not None and max_value is not None: |
|
return (column_lengths >= min_value) & (column_lengths <= max_value) |
|
|
|
elif min_value is None and max_value is not None: |
|
return column_lengths <= max_value |
|
|
|
elif min_value is not None and max_value is None: |
|
return column_lengths >= min_value |
Expected behavior
All execution engines (Pandas, Spark, SQLAlchemy) should handle strict_min and strict_max parameters consistently:
- When
strict_min=True: use > instead of >= for minimum bound
- When
strict_max=True: use < instead of <= for maximum bound
Proposed Fix:
I have implemented a fix that adds proper strict bounds logic to the Spark engine, matching the behavior of the Pandas engine:
# Fixed implementation
if min_value is None:
if strict_max:
return column_lengths < max_value
else:
return column_lengths <= max_value
elif max_value is None:
if strict_min:
return column_lengths > min_value
else:
return column_lengths >= min_value
else:
if strict_min and strict_max:
return (column_lengths > min_value) & (column_lengths < max_value)
elif strict_min:
return (column_lengths > min_value) & (column_lengths <= max_value)
elif strict_max:
return (column_lengths >= min_value) & (column_lengths < max_value)
else:
return (column_lengths >= min_value) & (column_lengths <= max_value)
Environment (please complete the following information):
- Operating System: macOS
- Great Expectations Version: 1.5.10
- Data Source: Spark
- Cloud environment: Local development
Additional context
Describe the bug
The
ColumnValuesValueLengthmetric provider has inconsistent behavior between execution engines forstrict_minandstrict_maxparameters. The Pandas execution engine properly implements strict (exclusive) bounds logic, but the Spark execution engine completely ignores these parameters, treating all comparisons as inclusive.To Reproduce
Original buggy Spark implementation:
great_expectations/great_expectations/expectations/metrics/column_map_metrics/column_value_lengths.py
Lines 241 to 248 in b1c632f
Expected behavior
All execution engines (Pandas, Spark, SQLAlchemy) should handle strict_min and strict_max parameters consistently:
strict_min=True: use > instead of >= for minimum boundstrict_max=True: use < instead of <= for maximum boundProposed Fix:
I have implemented a fix that adds proper strict bounds logic to the Spark engine, matching the behavior of the Pandas engine:
Environment (please complete the following information):
Additional context