Replies: 1 comment
-
|
thanks for proposal @framunoz, we will discuss it on the next Backlog Grooming, what do you think @ElenaKhaustova . |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
1. Problem Statement
Currently, Kedro manages versioning at the individual dataset level. When a pipeline produces multiple related artifacts (e.g., a Model, its Evaluation Metrics, and its Input Parameters), they are stored in separate directory structures. This leads to:
2. Proposed Solution:
GroupedDatasetThe
GroupedDatasetacts as a container dataset. It allows multiple files to be bundled under a single versioned directory, ensuring that all related outputs share the same timestamp and physical location.Key Features:
GroupedDataset.filepathfor each sub-dataset based on the group's versioned path.3. Implementation Blueprint
Catalog Configuration
The user defines the group and its constituents. The
extensionkey is used to build the final filename within the versioned folder.Technical Logic (Python)
The implementation leverages
AbstractVersionedDatasetto handle Kedro's versioning protocol while delegating the actual I/O to sub-datasets:grouped_dataset.py
4. Impact & Benefits
data/07_model_output/removes the complete set of related files.Beta Was this translation helpful? Give feedback.
All reactions