Handling a node that has a variable number of catalog input / output datasets #5063

iwhalen · 2025-08-25T14:48:22Z

iwhalen
Aug 25, 2025

This is a pattern that I run into quite often, and always use the same hack to fix.

Excuse me if I mix up my 1.0.0 syntax, I haven't gotten used to it yet 😁

Setup

I have an evaluation pipeline for my experiments that can be namespaced. Lets say I want to evaluate model training for model_a, model_b and model_c.

My pipeline_registry.py would have code that looks like:

# pipeline_registry.py

# ... imports...

def register_pipelines() -> Dict[str, Pipeline]:
    # ... other code...

    model_a_training = model_a.create_pipeline()
    model_b_training = model_b.create_pipeline()
    model_c_training = model_c.create_pipeline()

    model_a_eval = Pipeline(evaluation.create_pipeline(), namespace="model_a")
    model_b_eval = Pipeline(evaluation.create_pipeline(), namespace="model_b")
    model_c_eval = Pipeline(evaluation.create_pipeline(), namespace="model_c")

    # ... return...

Each evaluation pipeline outputs a namespaced results file. So my catalog.yaml might have an entry like:

# catalog.yaml 

"{model}.results":
    type: pandas.CSVDataset
    filepath: "data/{model}/results.csv"

Finally, imagine I have a final pipeline that I want do some overall evaluation of all models.

Say our combined_eval/pipeline.py looks like:

# combined_eval/pipeline.py

# ... imports...

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        Node(
            combine_results,
            ["model_a.results", "model_b.results", "model_c.results"],
            "final_result",
        )
    )

Problem

Of course, this works if I only have 3 models and never more or less.

If I want to add a model, this requires updating the pipeline registry and evaluation pipeline with the new model.

This is a small example, but in a larger project this could have a big effect.

Hacky solution

I always end up adding a list to my settings.py with:

# settings.py

# ... imports and other code...

ALL_MODELS = ["model_a", "model_b", "model_c"]

Then my pipeline_registry.py becomes:

# pipeline_registry.py

from project.settings import ALL_MODELS
# ... other imports...

def register_pipelines() -> Dict[str, Pipeline]:
    # ... other code...

    model_a_training = model_a.create_pipeline()
    model_b_training = model_b.create_pipeline()
    model_c_training = model_c.create_pipeline()

    model_evals = []
    for model in ALL_MODELS:
        model_evals.append(evaluation.create_pipeline(), namespace=model)

    eval_pipeline = sum(model_evals, start=Pipeline())

    combined_eval_pipeline = combined_eval.create_pipeline()

    # ... return...

Then the pipeline definition is:

# combined_eval/pipeline.py

from project.settings import ALL_MODELS
# ... other imports...

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        Node(
            combine_results,
            [f"{m}.results" for m in ALL_MODELS],
            "final_result",
        )
    )

Not perfect, but it does what I want.

Discussion

Is there a right way to do this? Or could there be in the future?

Something like defining the combine_results node like:

Node(combine_results, ["*.results"], "output")

Alternate scenarios

Instead of a nodes inputs, this could also be a node's outputs.

My input is an excel file with lots of sheets.
I want to split a specific subset of those sheets and output as parquet files.
Each time I want a new sheet, my node has to be updated, and likely additional changes elswhere.

Alternate solutions

A lifetime ago, I want to say I've seen @deepyaman do a fancy wildcard search on the catalog itself to populate a node's inputs for a join / concatenate operation.

However, this required constructing the catalog and giving it to the create_pipeline function.

noklam · 2025-08-27T17:42:32Z

noklam
Aug 27, 2025
Collaborator

I may have some idea about this but my idea is not touching the native Kedro API.

3 replies

iwhalen Aug 27, 2025
Author

I'd love to see either way!

I find myself doing this quite often so any way to optimize would be great.

noklam Aug 29, 2025
Collaborator

I don't have the full implementation yet, but something like this.

https://github.com/noklam/kedro-compose/blob/main/README.md

iwhalen Sep 4, 2025
Author

Finally got around to this!

Yes the branching and merging exaples you show are similar to what I'm talking about.

Specifically this merging issue probably comes up in at least half of my kedro projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling a node that has a variable number of catalog input / output datasets #5063

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling a node that has a variable number of catalog input / output datasets #5063

Uh oh!

Uh oh!

iwhalen Aug 25, 2025

Setup

Problem

Hacky solution

Discussion

Alternate scenarios

Alternate solutions

Replies: 1 comment · 3 replies

Uh oh!

noklam Aug 27, 2025 Collaborator

Uh oh!

iwhalen Aug 27, 2025 Author

Uh oh!

noklam Aug 29, 2025 Collaborator

Uh oh!

iwhalen Sep 4, 2025 Author

iwhalen
Aug 25, 2025

Replies: 1 comment 3 replies

noklam
Aug 27, 2025
Collaborator

iwhalen Aug 27, 2025
Author

noklam Aug 29, 2025
Collaborator

iwhalen Sep 4, 2025
Author