Atomic Versioning for Composite Datasets #5336

framunoz · 2026-01-09T02:18:00Z

framunoz
Jan 9, 2026

1. Problem Statement

Currently, Kedro manages versioning at the individual dataset level. When a pipeline produces multiple related artifacts (e.g., a Model, its Evaluation Metrics, and its Input Parameters), they are stored in separate directory structures. This leads to:

Manual Synchronization: Hard to link versions across different catalog entries.
Maintenance Risk: High risk of "orphaned" files when deleting versions manually.
Traceability Gaps: No physical guarantee that a model is tied to its specific metadata folder.

2. Proposed Solution: `GroupedDataset`

The GroupedDataset acts as a container dataset. It allows multiple files to be bundled under a single versioned directory, ensuring that all related outputs share the same timestamp and physical location.

Key Features:

Unified Versioning: All sub-datasets inherit the version of the parent GroupedDataset.
Path Injection: The class dynamically computes the filepath for each sub-dataset based on the group's versioned path.
Flexibility: Supports any Kedro dataset type and custom extensions/save arguments.

3. Implementation Blueprint

Catalog Configuration

The user defines the group and its constituents. The extension key is used to build the final filename within the versioned folder.

evaluation_bundle:
    type: project.datasets.grouped_dataset.GroupedDataset
    path: data/07_model_output/evaluation_dataset
    versioned: true
    datasets:
        metrics_parquet: 
            type: pandas.ParquetDataset
            extension: ".parquet"
        metrics_preview: 
            type: pandas.CSVDataset
            extension: ".csv"
        run_metadata:
            type: yaml.YAMLDataset
            extension: ".yaml"
            save_args:
                allow_unicode: true

Technical Logic (Python)

The implementation leverages AbstractVersionedDataset to handle Kedro's versioning protocol while delegating the actual I/O to sub-datasets:

grouped_dataset.py

import pathlib
import typing as t
from kedro.io.core import AbstractVersionedDataset, parse_dataset_definition

class GroupedDataset(AbstractVersionedDataset[dict[str, t.Any], dict[str, t.Any]]):
    def __init__(self, path: str, datasets: dict, version: str = None):
        super().__init__(filepath=pathlib.PurePosixPath(path), version=version)
        self._datasets = datasets

    def _get_sub_dataset_path(self, base_path: pathlib.Path, name: str, config: dict) -> str:
        # Dynamic path resolution: folder/version/name.extension
        extension = config.pop("extension", "")
        return str(base_path / (name + extension))

    def save(self, data: dict[str, t.Any]) -> None:
        save_path = self._get_save_path()
        for name, dataset_data in data.items():
            config = self._ensure_dataset(self._datasets[name])
            config["filepath"] = self._get_sub_dataset_path(save_path, name, config)
            
            dataset_type, dataset_config = parse_dataset_definition(config)
            dataset_type(**dataset_config).save(dataset_data)

    def load(self) -> dict[str, t.Any]:
        load_path = self._get_load_path()
        # ... implementation for loading all bundled files into a dictionary

4. Impact & Benefits

Atomic Deletion: Deleting one folder in data/07_model_output/ removes the complete set of related files.
Guaranteed Reproducibility: The parameters used for a specific model are physically located in the same folder as the model itself.
Cleaner Catalog: Instead of 3-4 separate entries for a single node's output, you have one logical "Bundle".

DimedS · 2026-01-09T11:30:41Z

DimedS
Jan 9, 2026
Collaborator

thanks for proposal @framunoz, we will discuss it on the next Backlog Grooming, what do you think @ElenaKhaustova .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic Versioning for Composite Datasets #5336

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Atomic Versioning for Composite Datasets #5336

Uh oh!

Uh oh!

framunoz Jan 9, 2026

1. Problem Statement

2. Proposed Solution: GroupedDataset

Key Features:

3. Implementation Blueprint

Catalog Configuration

Technical Logic (Python)

4. Impact & Benefits

Replies: 1 comment

Uh oh!

DimedS Jan 9, 2026 Collaborator

framunoz
Jan 9, 2026

2. Proposed Solution: `GroupedDataset`

DimedS
Jan 9, 2026
Collaborator