`@node` decorator syntax #5028

merelcht · 2023-03-27T14:11:58Z

merelcht
Mar 27, 2023
Maintainer

Introduction

ChatGPT once suggested this syntax to a user

Is this a syntax we'd like to add to Kedro?

jasonmhite · 2023-04-27T13:51:53Z

jasonmhite
Apr 27, 2023

Some folks on my team have been doing this for a while and it's pretty ergonomic. A basic implementation is trivial, just

from kedro.pipeline import node

def make_node(**kwargs):
  return lambda func: node(func=func, **kwargs)

0 replies

antonymilne · 2023-05-04T14:05:20Z

antonymilne
May 4, 2023

Copying here my comments from #2408.

As noted already, there's a problem here if you want to reuse the same function in multiple nodes and when you start using modular pipelines. These are not insurmountable with the decorator approach but it does get cumbersome.

That said, I actually think there's potentially a lot of value in the decorator method for simple projects. Just like auto-registration of pipelines, anything we can do to make it easier for people to make a simple project is good in my book. Certainly just slapping a node decorator on a function is easier than needing to make a pipeline.py file and probably sufficient in lots of cases.

In fact, a while ago there was a QB tool that did something similar and took the simplification even further:

use the name of the function arguments as dataset names by default (which is very common practice already), so you don't even need to write out the inputs mapping unless it's something other than the default
use the name of the function to generate the the output(s?) name by default, so you don't even need to write out the outputs mapping unless it's something other than the default
automatically detect all functions decorated with node so that you don't even need to write a pipeline.py file at all
in cases where I don't want to change input/output name or apply tags etc., there's in fact no need to use the decorator at all - go through and automatically detect node functions by finding functions with a certain pattern in their name like node_xyz and automatically make the pipeline based on this

I actually think there's a lot to be said for at least some aspects of this approach, although it won't ever replace the current system.

0 replies

noklam · 2023-05-04T15:13:41Z

noklam
May 4, 2023
Collaborator

use the name of the function arguments as dataset names by default (which is very common practice already), so you don't even need to write out the mapping of the input unless it's something other than the default

This is a reasonable default, but I suspect this is less scalable as you are likely to use a generic variable name like df, and these name collide.
In addition, I use CAPTIAL letter for the persisted dataset as a convention (I think I saw similar patterns elsewhere as well), this is to differentiate the MemoryDataSet that is generated by Kedro automatically. I

use the name of the function to generate the the output(s?) name by default, so you don't even need to write out the outputs mapping unless it's something other than the default

automatically detect all functions decorated with node so that you don't even need to write a pipeline.py file at all

this is my favourite part, It's sometimes annoying to search over 10 pipelines.py.

in cases where I don't want to change input/output name or apply tags etc., there's in fact no need to use the decorator at all - go through and automatically detect node functions by finding functions with a certain pattern in their name like node_xyz and automatically make the pipeline based on this

0 replies

jasonmhite · 2023-05-10T14:02:24Z

jasonmhite
May 10, 2023

Definitely agree with @antonymilne , this functionality definitely would not replace the existing scheme, just act as sugar for what I suspect is a decent chunk of use cases. The possibility to automatically infer names and arguments would be slick.

0 replies

noklam · 2023-06-03T08:30:15Z

noklam
Jun 3, 2023
Collaborator

Just coming back to it again, it may be quite useful for learning kedro smoothing.

I have been teaching some people about Kedro lately and sometimes they get stuck about how to write a node , then I will ask "forget about Kedro, how would you write it in normal python script?". Suddenly they can write the code properly and then just map it to Kedro.

I suspect this happen more for beginner when they trying to jump too fast and think about node. @astrojuanlu Have you ever encountered this, is it worth to check with other people who teaches Kedro?

0 replies

astrojuanlu · 2023-06-26T13:58:00Z

astrojuanlu
Jun 26, 2023

Hi @astrojuanlu, I thought of the decorator syntax but didn't find it satisfactory because of 2 reasons:

It conflates functions and nodes. I have functions that are reused in different pipelines (and want to keep open the ability to reuse others in new pipelines). Tying functions to specific inputs and outputs makes reuse harder.

It reduces readability because the pipeline is now scattered in different places. To grasp a pipeline, one would need to match input/output dataset names and scroll up and down to collect the nodes in their head. Whereas with the proposed pipeline builder, everything is in one block and one can understand a pipeline with just a glance.

Originally posted by @mle-els in #2726 (comment)

0 replies

astrojuanlu · 2023-11-25T08:48:19Z

astrojuanlu
Nov 25, 2023

I think it's safe to say that the current approach will not go away for the time being, so the question would be whether to maintain this as an alternative, recommended approach. And from that perspective, it looks like a lot of documentation and maintainability effort, plus the potential confusion of users having to pick between the two approaches, all of this for little gain. There have been some very valid concerns about this approach, most importantly that it blends functions with nodes and that hampers reusability of said functions by coupling them to Kedro.

I'm closing this issue, if you disagree and think this deserves further discussion feel free to leave a comment.

0 replies

lucharo · 2025-01-17T13:11:22Z

lucharo
Jan 17, 2025

I don't think this approach blends functions with nodes, if a function is not decorated, it's a simple function, it can be used with whatever inputs and outputs. I think this actually helps create separation of concern between function and nodes. Let me try to illustrate this:

# This is a regular, reusable function - can be used anywhere, any inputs/outputs
def reusable_fn(x, y):
    return x + y

# This is explicitly a pipeline node 
@node(inputs=["a", "b"], outputs="sum")
def pipeline_step(a, b): return reusable_fn(a, b)

I agree that at that point, creating a function that returns another decorated function is similarly convoluted to what it would simply be replacing:

# Without decorator
node(
    func=pipeline_step,
    inputs=["a", "b"],
    outputs="sum"
)

but it streamlines the cases where a function serves a single node, hence can be defined as standalone.

I only suggested this because I've seen as a syntax in ZenML, AirFlow tasks, etc but I understand the concern about having two separate ways of doing things.

If the decorator replaced the the node() function seamlessly I don't think it'd be an issue to have both.

As @noklam said in Slack (https://kedro-org.slack.com/archives/C03RKP2LW64/p1737118683984249):

It will make simple things simpler but the complex cases more difficult. That's the main tradeoff

0 replies

deepyaman · 2025-07-23T21:56:51Z

deepyaman
Jul 23, 2025
Collaborator

Decided to take a stab at what it looks like in a "real" (but very simple) pipeline. I used https://github.com/catherinenelson1/from_notebooks_to_scalable/blob/main/penguins_refactored.py as the example (repo corresponding to Catherine's talk at PyCon US on Going from Notebooks to Scalable Systems), because I think that there's a case to be made for Kedro being the logical progression in the learning curve.

Complete implementation

import joblib
import pandas as pd
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, Node
from kedro.runner import SequentialRunner
from kedro_datasets.pandas import CSVDataset
from kedro_datasets.pickle import PickleDataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

_SENTINEL = object()


def node(
    func=None,
    inputs=_SENTINEL,
    outputs=_SENTINEL,
    *,
    name=None,
    tags=None,
    confirms=None,
    namespace=None
) -> Node:
    if func is None:
        assert inputs is not _SENTINEL, "Inputs must be provided when using the decorator syntax."
        assert outputs is not _SENTINEL, "Outputs must be provided when using the decorator syntax."
        def wrapper(func):
            return Node(
                func,
                inputs,
                outputs,
                name=name,
                tags=tags,
                confirms=confirms,
                namespace=namespace,
            )
        return wrapper
    else:
        assert func is not None, "Function must be provided when using the function syntax."
        assert inputs is not _SENTINEL, "Inputs must be provided when using the function syntax."
        assert outputs is not _SENTINEL, "Outputs must be provided when using the function syntax."
        return Node(
            func,
            inputs,
            outputs,
            name=name,
            tags=tags,
            confirms=confirms,
            namespace=namespace,
        )


catalog = DataCatalog(
    {
        "penguins": CSVDataset(
            filepath="https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/penguins.csv"
        ),
        "encoder": PickleDataset(
            filepath="penguins_label_encoder.joblib",
            backend="joblib",
        ),
        "model": PickleDataset(
            filepath="penguins_model.joblib",
            backend="joblib",
        ),
    }
)


@node(inputs="penguins", outputs=["feature", "labels"])
def clean_data(penguins):
    # drop rows with missing values
    # return numpy arrays for features and labels
    df = penguins.dropna(subset=['species', 'bill_length_mm', 'bill_depth_mm',
                                 'flipper_length_mm', 'body_mass_g'])
    features = df[['bill_length_mm', 'bill_depth_mm',
                   'flipper_length_mm', 'body_mass_g']].to_numpy()
    labels = df['species'].to_numpy()
    return features, labels


@node(inputs=["feature", "labels"], outputs=["encoder", "X_train", "X_test", "y_train", "y_test"])
def preprocess_data(features, labels):
    # encode categorical variables
    # scale numerical features
    # split the data into train/test features and labels
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features)
    encoder = LabelEncoder()
    labels_encoded = encoder.fit_transform(labels)
    X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels_encoded, test_size=0.2, random_state=42)
    return encoder, X_train, X_test, y_train, y_test


@node(inputs=["X_train", "y_train", "X_test", "y_test"], outputs="model")
def train_model(X_train, y_train, X_test, y_test):
    # train a model and save it
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    # print the model's accuracy
    accuracy = clf.score(X_test, y_test)
    print(f'Model accuracy: {accuracy:.2f}')
    return clf


@node(inputs=["X_new", "model", "encoder"], outputs="predicted_class")
def predict_new_data(X_new, clf, encoder):
    # load the model and make predictions
    # return a string of the predicted class
    prediction = clf.predict(X_new)
    # decode the prediction
    predicted_class = encoder.inverse_transform(prediction)
    return predicted_class[0]


training_pipeline = Pipeline([clean_data, preprocess_data, train_model])

if __name__ == "__main__":
    runner = SequentialRunner()
    output_datasets = runner.run(pipeline=training_pipeline, catalog=catalog)

I think it works quite well (at least, as @antonymilne said long ago, for the simple projects without function reuse).

In fact, a while ago there was a QB tool that did something similar and took the simplification even further:

use the name of the function arguments as dataset names by default (which is very common practice already), so you don't even need to write out the inputs mapping unless it's something other than the default

use the name of the function to generate the the output(s?) name by default, so you don't even need to write out the outputs mapping unless it's something other than the default

automatically detect all functions decorated with node so that you don't even need to write a pipeline.py file at all

in cases where I don't want to change input/output name or apply tags etc., there's in fact no need to use the decorator at all - go through and automatically detect node functions by finding functions with a certain pattern in their name like node_xyz and automatically make the pipeline based on this

I didn't do this here (and I did use a different name, i.e. clf in code and model in catalog), but it could be worth exploring? Another place doing fancy name inference quickly falls over is when I return features and labels (as in Catherine's original code). I also think the explicit pipeline construction here is pretty straightforward. (Nit: I would have loved to just do training_pipeline.run(catalog=catalog) instead.)

From an implementation perspective, if wanted to do something like this, it seems fairly straightforward to retain the existing node function behavior on top of the decorator syntax, as I've prototyped.

Putting my new user hat on, the part that still feels like a jump is the fact that these "disconnected" nodes turned into a pipeline. This would still be much more explicit:

@node
def clean_data(penguins):
    ...

@node
def preprocess_data(features, labels):
    ...

@node
def train_model(X_train, y_train, X_test, y_test):
    ...

@pipeline
def training_pipeline(penguins):
    features, labels = clean_data(penguins)
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)
    model = train_model(X_train, y_train, X_test, y_test)
    return model  # Not sure this is necessary

Maybe somebody knows or can look it up—isn't this more like what the QB-internal Brix post @antonymilne was referring to looked like, too? I can't remember.

Under the hood, here @node is creating an unbound Node, and a statement like features, labels = clean_data(penguins) binds inputs=["penguins"], outputs=["features", "labels"]. These instructions are pretty easy to parse in the @pipeline decorator:

Example AST parsing

>>> import ast
>>> tree = ast.parse("""\
... def training_pipeline(penguins):
...     features, labels = clean_data(penguins)
...     encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)
...     model = train_model(X_train, y_train, X_test, y_test)
...     return model  # Not sure this is necessary
... """)
>>> print(ast.dump(tree, indent=4))
Module(
    body=[
        FunctionDef(
            name='training_pipeline',
            args=arguments(
                args=[
                    arg(arg='penguins')]),
            body=[
                Assign(
                    targets=[
                        Tuple(
                            elts=[
                                Name(id='features', ctx=Store()),
                                Name(id='labels', ctx=Store())],
                            ctx=Store())],
                    value=Call(
                        func=Name(id='clean_data', ctx=Load()),
                        args=[
                            Name(id='penguins', ctx=Load())])),
                Assign(
                    targets=[
                        Tuple(
                            elts=[
                                Name(id='encoder', ctx=Store()),
                                Name(id='X_train', ctx=Store()),
                                Name(id='X_test', ctx=Store()),
                                Name(id='y_train', ctx=Store()),
                                Name(id='y_test', ctx=Store())],
                            ctx=Store())],
                    value=Call(
                        func=Name(id='preprocess_data', ctx=Load()),
                        args=[
                            Name(id='features', ctx=Load()),
                            Name(id='labels', ctx=Load())])),
                Assign(
                    targets=[
                        Name(id='model', ctx=Store())],
                    value=Call(
                        func=Name(id='train_model', ctx=Load()),
                        args=[
                            Name(id='X_train', ctx=Load()),
                            Name(id='y_train', ctx=Load()),
                            Name(id='X_test', ctx=Load()),
                            Name(id='y_test', ctx=Load())])),
                Return(
                    value=Name(id='model', ctx=Load()))])])

Because of the late binding, we also enable easy reuse (the @node-decorated piece can be used multiple times in a pipeline, or across pipelines, with different bound inputs and outputs), and we're not making any assumptions around dataset names.

I'm happy to flesh this second approach out into a full PoC, but would love to get some initial reactions, first!

Update: Here's a bit hacky (but functional) PoC with room for improvement:

End-to-end with AST parsing

import ast
import inspect
from functools import partial

import joblib
import pandas as pd
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, Node
from kedro.runner import SequentialRunner
from kedro_datasets.pandas import CSVDataset
from kedro_datasets.pickle import PickleDataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

_SENTINEL = object()

# FIXME(deepyaman): Populating a node registry is ugly and likely unnecessary.
#   A better approach would be to rewrite the AST of the pipeline function to produce a
#   `Pipeline` object directly.
_NODE_REGISTRY = {}


def node(
    func=None,
    inputs=_SENTINEL,
    outputs=_SENTINEL,
    *,
    name=None,
    tags=None,
    confirms=None,
    namespace=None
) -> Node:
    if inputs is _SENTINEL and outputs is _SENTINEL:
        unbound_node = partial(node, func=func)
        _NODE_REGISTRY[func.__name__] = unbound_node
        return unbound_node
    else:
        assert func is not None, "Function must be provided when using the function syntax."
        assert inputs is not _SENTINEL, "Inputs must be provided when using the function syntax."
        assert outputs is not _SENTINEL, "Outputs must be provided when using the function syntax."
        return Node(
            func,
            inputs,
            outputs,
            name=name,
            tags=tags,
            confirms=confirms,
            namespace=namespace,
        )


class NodeVisitor(ast.NodeVisitor):
    def __init__(self):
        self.nodes = []

    def visit_Assign(self, node):
        unbound_node = _NODE_REGISTRY[node.value.func.id]
        inputs = [arg.id for arg in node.value.args]
        outputs = (
            [elt.id for elt in node.targets[0].elts]
            if hasattr(node.targets[0], "elts")
            else node.targets[0].id
        )
        self.nodes.append(unbound_node(inputs=inputs, outputs=outputs))
        self.generic_visit(node)


def pipeline(pipe):
    source = inspect.getsource(pipe)
    tree = ast.parse(source)
    visitor = NodeVisitor()
    visitor.visit(tree)
    return Pipeline(visitor.nodes)


catalog = DataCatalog(
    {
        "penguins": CSVDataset(
            filepath="https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/penguins.csv"
        ),
        "encoder": PickleDataset(
            filepath="penguins_label_encoder.joblib",
            backend="joblib",
        ),
        "model": PickleDataset(
            filepath="penguins_model.joblib",
            backend="joblib",
        ),
    }
)


@node
def clean_data(penguins):
    # drop rows with missing values
    # return numpy arrays for features and labels
    df = penguins.dropna(subset=['species', 'bill_length_mm', 'bill_depth_mm',
                                 'flipper_length_mm', 'body_mass_g'])
    features = df[['bill_length_mm', 'bill_depth_mm',
                   'flipper_length_mm', 'body_mass_g']].to_numpy()
    labels = df['species'].to_numpy()
    return features, labels


@node
def preprocess_data(features, labels):
    # encode categorical variables
    # scale numerical features
    # split the data into train/test features and labels
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features)
    encoder = LabelEncoder()
    labels_encoded = encoder.fit_transform(labels)
    X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels_encoded, test_size=0.2, random_state=42)
    return encoder, X_train, X_test, y_train, y_test


@node
def train_model(X_train, y_train, X_test, y_test):
    # train a model and save it
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    # print the model's accuracy
    accuracy = clf.score(X_test, y_test)
    print(f'Model accuracy: {accuracy:.2f}')
    return clf


@node
def predict_new_data(X_new, clf, encoder):
    # load the model and make predictions
    # return a string of the predicted class
    prediction = clf.predict(X_new)
    # decode the prediction
    predicted_class = encoder.inverse_transform(prediction)
    return predicted_class[0]


@pipeline
def training_pipeline():
    features, labels = clean_data(penguins)
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)
    model = train_model(X_train, y_train, X_test, y_test)


if __name__ == "__main__":
    runner = SequentialRunner()
    output_datasets = runner.run(pipeline=training_pipeline, catalog=catalog)

0 replies

noklam · 2025-07-29T16:57:40Z

noklam
Jul 29, 2025
Collaborator

Overall I think I like the 2nd approach better, the first one is mostly a syntactic sugar and I don't think it provides much value other than a few keystrokes. The 2nd is better because the @pipeline code looks more like code that you will have in a notebook.

@pipeline
def training_pipeline():
    features, labels = clean_data(penguins)
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)
    model = train_model(X_train, y_train, X_test, y_test)

What happen if now you have a namespace dataset like namespace.penguins?

0 replies

deepyaman · 2025-07-29T17:07:01Z

deepyaman
Jul 29, 2025
Collaborator

What happen if now you have a namespace dataset like namespace.penguins?

I haven't really thought about it, but any reason it wouldn't be workable? The NodeVisitor is parsing penguins to an input "penguins", and I don't see why it couldn't parse namespace.penguins to "namespace.penguins". Maybe you would just need to define your namespace; could be either as a variable or using a context manager or something.

0 replies

noklam · 2025-07-29T18:30:45Z

noklam
Jul 29, 2025
Collaborator

I was thinking it's the opposite, you cannot have namespace.penguin in the pipeline code because it will be interpreted as a missing attribute.

0 replies

deepyaman · 2025-07-29T19:01:04Z

deepyaman
Jul 29, 2025
Collaborator

I was thinking it's the opposite, you cannot have namespace.penguin in the pipeline code because it will be interpreted as a missing attribute.

Sorry, I wasn't clear; that's what I meant by, "Maybe you would just need to define your namespace; could be either as a variable or using a context manager or something."

namespace = Namespace(name="namespace")
namespace.penguins

Alternatively, if you just want to make it a valid name, you can require something like Dataset("namespace.penguins") for dotted names. A lot of frameworks have to do such more explicit object creations as the requirements get slightly more complex (e.g. if you want to use a different asset name than the method name in Dagster).

For now, I haven't thought too much about this side of the spectrum, because my focus has been much more on the Jupyter notebook-using data scientist persona getting started rather than the person getting into more complex use cases (e.g. haven't even bothered thinking about dynamic pipelines). I think a lot of validation will need to be done regardless.

Just to be clear, I also am not proposing getting rid of the other syntax or anything; I think this "simplified" syntax should at least ease the journey and should be able to go quite far, but then more explicit Pipeline construction may also make sense to recommend in more complex cases.

I'll try to better illustrate the adoption journey as I've currently thought about it ahead of tomorrow.

0 replies

deepyaman · 2025-07-30T16:21:08Z

deepyaman
Jul 30, 2025
Collaborator

This issue was discussed in Tech Design earlier today. Here is the content of my (Marimo) slides:

Flattening the learning curve
from notebooks to production

Deepyaman Datta · Kedro Tech Design · July 30, 2025

The journey for new data scientists from Jupyter notebooks to Kedro pipelines isn't smooth

Assumptions:
- The data scientist is currently using Jupyter notebooks
- There isn't an existing Kedro project (i.e. no kedro jupyter notebook, etc.)
Exising resources:
- https://docs.kedro.org/en/1.0.0/integrations-and-plugins/notebooks_and_ipython/notebook-example/add_kedro_to_a_notebook/
- https://medium.com/@sean_picklejuice/converting-a-kaggle-notebook-to-a-kedro-pipeline-a0a024ccbff5
Claims:
- The Data Catalog is the wrong place to start for a new user
- "Where next?" after refactoring code into functions is not clear

The Data Catalog throws a lot of new concepts at the truly-new user

# Using Kedro's DataCatalog                               # Friction points

from kedro.io import DataCatalog                          # Diving into Kedro head first

import yaml                                               # Another new Python package

# load the configuration file
with open("catalog.yml") as f:                            # A second file
    conf_catalog = yaml.safe_load(f)

# Create the DataCatalog instance from the configuration
catalog = DataCatalog.from_config(conf_catalog)           # A lot of Kedro users never need to learn this

# Load the datasets
companies = catalog.load("companies")
reviews = catalog.load("reviews")
shuttles = catalog.load("shuttles")

The Data Catalog is one of the most useful components of Kedro, but not to a new data scientist

Separation of code and config is not the first principle of software engineering
The benefit of being able to scale by changing datasets when you change environments is lost on somebody coming straight from notebooks
Avoiding magic numbers (and other configuration) is already solved by a basic software engineering concept (CONSTANTS)

How do people actually teach software engineering for data scientists?

People are interested in this topic these days!
Primary reference: Going from Notebooks to Scalable Systems (PyCon US 2025 talk by Catherine Nelson)
Worth looking into:
- Software Engineering for Data Scientists (also by Catherine Nelson)
- Production Ready Data Science (by Khuyen Tran)
Common thread: Refactor your Jupyter notebook into a Python script
- Our aforementioned docs actually also go here, but it's part of "Where next?" and not a focus

Our revised starting point

We start where Catherine's demo left off. No Kedro concepts yet!

Code

import requests

def download_data(file_path):
    # download the dataset if it doesn't exist already
    try:
        with open(file_path, 'rb') as file:
            return pd.read_csv(file)
    except FileNotFoundError:
        url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/penguins.csv'
        response = requests.get(url)
        with open(file_path, 'wb') as file:
            file.write(response.content)


import pandas as pd

def clean_data(file_path):
    # drop rows with missing values
    # return numpy arrays for features and labels
    df = pd.read_csv(file_path)
    df = df.dropna(subset=['species', 'bill_length_mm', 'bill_depth_mm', 
                           'flipper_length_mm', 'body_mass_g'])
    features = df[['bill_length_mm', 'bill_depth_mm', 
                   'flipper_length_mm', 'body_mass_g']].to_numpy()
    labels = df['species'].to_numpy()
    return features, labels


from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

import joblib


def preprocess_data(features, labels, encoder_filename):
    # encode categorical variables
    # scale numerical features
    # split the data into train/test features and labels
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features)
    encoder = LabelEncoder()
    labels_encoded = encoder.fit_transform(labels)
    joblib.dump(encoder, encoder_filename)
    X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels_encoded, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test


from sklearn.linear_model import LogisticRegression


def train_model(X_train, y_train, X_test, y_test, model_filename):
    # train a model and save it
    clf = LogisticRegression() 
    clf.fit(X_train, y_train)
    # print the model's accuracy
    accuracy = clf.score(X_test, y_test)
    print(f'Model accuracy: {accuracy:.2f}')
    joblib.dump(clf, model_filename)


def run_training_pipeline(data_file_path, encoder_filename, model_filename):
    # load data
    download_data(data_file_path)

    # clean data
    features, labels = clean_data(data_file_path)

    # preprocess data
    X_train, X_test, y_train, y_test = preprocess_data(features, labels, encoder_filename)

    # train model
    train_model(X_train, y_train, X_test, y_test, model_filename)


MODEL_FILENAME = 'penguins_model.joblib'
ENCODER_FILENAME = 'penguins_label_encoder.joblib'
DATA_FILE_PATH = 'penguins_data.csv'

if __name__ == "__main__":
    run_training_pipeline(DATA_FILE_PATH, ENCODER_FILENAME, MODEL_FILENAME)

Make sure inputs and outputs aren't side effects

Code

from pathlib import Path

import requests
import pandas as pd

DATA_FILE_PATH = 'penguins_data.csv'


def download_data():
    # download the dataset if it doesn't exist already
    if not Path(DATA_FILE_PATH).exists()
        url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/penguins.csv'
        response = requests.get(url)
        with open(DATA_FILE_PATH, 'wb') as file:
            file.write(response.content)
    return pd.read_csv(file)


def clean_data(penguins):
    ...


def preprocess_data(features, labels):
    ...


def train_model(X_train, y_train, X_test, y_test):
    ...


def run_training_pipeline():
    # load data
    penguins = download_data()

    # clean data
    features, labels = clean_data(penguins)

    # preprocess data
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)

    # train model
    model = train_model(X_train, y_train, X_test, y_test)


if __name__ == "__main__":
    run_training_pipeline()

And turn it into a Kedro pipeline

from pathlib import Path

import requests
import pandas as pd
from kedro import node, pipeline

DATA_FILE_PATH = 'penguins_data.csv'


@node
def download_data():
    ...


@node
def clean_data(penguins):
    ...


@node
def preprocess_data(features, labels):
    ...


@node
def train_model(X_train, y_train, X_test, y_test):
    ...


@pipeline
def training_pipeline():
    # load data
    penguins = download_data()

    # clean data
    features, labels = clean_data(penguins)

    # preprocess data
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)

    # train model
    model = train_model(X_train, y_train, X_test, y_test)


if __name__ == "__main__":
    training_pipeline.run()

And now the Data Catalog

Show that we can remove data loading and saving code from within function bodies (and avoid writing it altogether!)

from kedro.io import DataCatalog
from kedro_datasets.pandas import CSVDataset
from kedro_datasets.pickle import PickleDataset


catalog = DataCatalog(
    {
        "penguins": CSVDataset(
            filepath="https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/penguins.csv"
        ),
        "encoder": PickleDataset(
            filepath="penguins_label_encoder.joblib",
            backend="joblib",
        ),
        "model": PickleDataset(
            filepath="penguins_model.joblib",
            backend="joblib",
        ),
    }
)

Where next?

Visualizing your Kedro pipeline
- Kedro-Viz should work on a script with a pipeline
- Show that you can preview outputs (data, charts)
Creating a Kedro project
- Move pipeline definitions into pipeline.py files
- Pipeline autodiscovery should automatically pick them up

For investigation

Validating the magic
Supporting advanced concepts
- Namespaces
- Dynamic pipelines
BUT we can always say to use existing syntax for complex use cases

3 replies

noklam Jul 31, 2025
Collaborator

Follow on the Technical design discussion today. I want to add some comments on it.

I am not sure if I favor this @node syntax a lot more, but I think it's better for these reasons:

The syntax is more LLM friendly - the complex dependencies are handled by the decorator and LLM is probably good enough to understand the relation between nodes and DataCatalog with string matching (provided some example)
Removing the Pipeline component from user code. I think the pipeline concept are still important, but users doesn't need to know about it. They just need to execute code in certain order, visualise it with Kedro Viz etc.

dbt (SQL) / Kedro (Python)

I will argue that SQL workflows benefit significant more from the pipeline abstraction and transformation library, mainly because with SQL it's simply hard to re-use logic. You don't have proper control logic, class (group of logics), modules, functions (there are function but they are not as reusable as the Python one), lack of testing framework etc. In Python you don't have any of these problems.

I've thought about this before, but I think Pipeline is one of least useful component in Kedro. Majority of our user are using a small subset of the features, i.e. kedro run, kedro run --pipeline. Conceptually, this is not too different from running python scripts with some custom glue code to stitch them together.

Pipeline slicing - it's super powerful and flexible in theory. In practice, most of the time we have rather linear pipeline that they are either sequential or some kind of branch-out logic, i.e.

out1 = pipeline_a()
out2 = pipeline_b()
final = pipeline_c(out1, out2)

counter argument: people has been asking questions about how kedro determine execution order, and there is no way to define it. If all user need is orchestrating a few big functions (pipelines), they can do this with pure Python without too much work.

deepyaman Aug 1, 2025
Collaborator

I will argue that SQL workflows benefit significant more from the pipeline abstraction and transformation library, mainly because with SQL it's simply hard to re-use logic. You don't have proper control logic, class (group of logics), modules, functions (there are function but they are not as reusable as the Python one), lack of testing framework etc. In Python you don't have any of these problems.

I think this is all a bit tangential, but...

In my opinion, this is somewhat true. I think there is significant benefit to the fact that Kedro clearly defines input and output datasets and abstracts this from the transformation logic. When you need to productionize your workflow, these artifacts are usually very important—be it with an ML orchestrator like Azure ML Pipelines or Kubeflow Pipelines or a more general-purpose orchestrator like Airflow or Dagster. These frameworks usually all have some sort of artifact/asset/dataset concept, and, if you have a Python script or notebook, these artifacts are not necessarily clearly defined. Take https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/pipelines/2c_nyc_taxi_data_regression/transform_src/transform.py as an example of what IMO looks like a bad separation of I/O logic.

Now, you could just write your logic within the body of your orchestrator assets (for something like Airflow or Dagster), but that means you need to be happy working with one of these frameworks as a data scientist or data engineer. dbt Cosmos or the Dagster-dbt integration are well-liked because they let the data engineer/analyst work with tooling that's built for them (people who just care about the use case) rather than for software engineers or people who care about the broader data platform. That's all to say, I think there is still value in (1) having the separation of transformation framework and orchestrator, even in the Python world and (2) having some abstraction (like Pipeline) that lets you map to the orchestrator abstractions.

But, going back to also seeing your point, I think if we can make Pipeline construction automatic or very easy for the simplest cases (or at least look just like Python), that would also be great. Maybe, if instead of:

Show code

from kedro import node, pipeline

@node
def download_data(): ...

@node
def clean_data(penguins): ...

@node
def preprocess_data(features, labels): ...

@node
def train_model(X_train, y_train, X_test, y_test): ...

@pipeline
def training_pipeline():
    penguins = download_data()  # load data
    features, labels = clean_data(penguins)  # clean data
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)  # preprocess data
    model = train_model(X_train, y_train, X_test, y_test)  # train model

if __name__ == "__main__":
    training_pipeline.run()

We could simplify it even further to something like:

from kedro import node

@node
def download_data(): ...

@node
def clean_data(penguins): ...

@node
def preprocess_data(features, labels): ...

@node
def train_model(X_train, y_train, X_test, y_test): ...

penguins = download_data()  # load data
features, labels = clean_data(penguins)  # clean data
encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)  # preprocess data
model = train_model(X_train, y_train, X_test, y_test)  # train model

it could be nice (or even be able to drop the @node altogether, and then Kedro is just parsing function definitions in the AST as Nodes. But, at the same time, I don't know how bad it is to have some of the lightweight abstractions be more explicit.

deepyaman Aug 1, 2025
Collaborator

@noklam As I think about it more, maybe this does also align with some of my previously-held thoughts on how Pipelines should be deployed (i.e. Pipeline is the level of deployment, or an orchestrator task).

Majority of our user are using a small subset of the features, i.e. kedro run, kedro run --pipeline. Conceptually, this is not too different from running python scripts with some custom glue code to stitch them together.

Maybe the idea of being able to run pure Python programs like this as a Kedro pipeline would enable the level most people need:

def download_data(): ...
def clean_data(penguins): ...
def preprocess_data(features, labels): ...
def train_model(X_train, y_train, X_test, y_test): ...

if __name__ == "__main__":
    penguins = download_data()  # load data
    features, labels = clean_data(penguins)  # clean data
    encoder, X_train, X_test, y_train, y_test = preprocess_data(features, labels)  # preprocess data
    model = train_model(X_train, y_train, X_test, y_test)  # train model

And then, in your orchestrator, you can just call kedro run --file script.py and it runs as a Kedro pipeline. (Somewhere in the script you should also load the Data Catalog into the context, I skipped that.)

stephkaiser · 2025-08-22T15:58:49Z

stephkaiser
Aug 22, 2025
Collaborator

Previous comment related to issue #4892

0 replies

yetudada · 2025-09-26T09:10:30Z

yetudada
Sep 26, 2025
Maintainer

This feels like an idea that would benefit from community feedback. What’s the plan to gather that input to see if it actually resolves #4892? I’d also be curious to hear from @Galileo-Galilei on whether this would help with adoption in his organisation.

4 replies

deepyaman Sep 26, 2025
Collaborator

I can share this to #user-feedback to gather some initial feedback. Re resolving #4892 specifically, @iamelijahko don't know if you'd be open to testing this with the people you're gathering feedback from.

There has also been anecdotal feedback over the years; I recall somebody within QB built something like this for their own needs, and there have also been several other users who have asked about this kind of syntax. Last but not least, simpler, more Pythonic syntax is often cited as a highlight of other frameworks in this space, like Hamilton and perhaps now ordeq (curious also if this could have helped adoption in @sbrugman's organization before).

The proposal is also not to introduce a breaking change, but an alternative syntactic sugar for 90% of cases. (I think @jasonmhite was spot on in #5028 (comment)
Releasing this as experimental, self-contained functionality seems reasonable to me, and can look at adoption as a signal.

jpc271828 Jan 9, 2026

I'd like to boost @deepyaman's comments above. In evaluating alternative pipeline stacks, kedro's main drawback for our team is how cumbersome composition is for our analysts (analogous to data scientists). Composition in something like dagster, hamilton, or flyte, for simple cases appears cleaner and more intuitive, especially for building small pipelines.

deepyaman Jan 11, 2026
Collaborator

@jpc271828 Thanks for the input! Let me also send a message to the Kedro #user-research channel on Slack on Monday to try and gauge interest.

Edit: Message sent: https://kedro-org.slack.com/archives/C03R8N2M8KT/p1768201205907629

Galileo-Galilei Jan 14, 2026
Collaborator

I've re-read the entire discussion, and there are a lot of things to ponder here.

What I understand is that the main goal is to try to avoid users to write the pipeline objects, mainly because it is tedious / hard at first sight to navigate between catalog.yml, pipeline.py, pipeline_registry.py and nodes.py when writing a single function.

My personal feeling is that is would be less scary for users to be able to "convert their notebook" in a single script.py, but this would only lessen the barrier for a couple of hours/ days, because they will very fast grow to a more complex project which would need namespaces /slicing/function reuse functionalities. Maybe it fits in the wider story of being able to make people test / adopt quickly kedro quickly and slowly move them towards the more complex framework, but I am afraid that they will need to learn the framework twice in the end when their project evolve and I think this could lead to churn / unsatisfaction.

Inside my org I don't think it would have improved adoption because we have internal formations to onboard people so the framework is less "scary" (and they are basically forced to use it whatsoever), and after a couple of month of seeing projects all written the same way DS tends to acknowledge the value.

Overall I am quite against to introduce several ways to do the same thing because I think it tends to make the project "less standardized" and this is the main value for me: I can open any project and know where things are. I often say to data scientists that organising the code is a "tax" to pay : either they have to pay it and use the framework which is slightly more complicated for them, or they can let the future person who will use the project pay the tax and struggle with an undocumented codebase. As a manager I prefer to pay the tax when writing code because it is cheaper (the person who writes the code knows what it does), but as a DS it pisses me off when I want to do it very quickly. This is much less true with coding agents though.

I won't stand against the KEP if it raised too, since the signal exists (albeit not very strong though, however it may be a survivor bias: the people who would potentially be the most interested in it may never have used it long enough to end up on GitHub or slack)

pierrejeden · 2026-01-14T08:20:59Z

pierrejeden
Jan 14, 2026

Hi, interesting discussion! I have used this kind of alternative pipeline definition, and everything that optionally reduces boilerplate is good IMO. Apart from the meta-programming fun, I have done it for the following reasons, which are a bit different from the beginners-in-notebooks case:

As mentioned, being able to prototype quickly and in a compartmentalized, project-independent manner, is good. For me, this need mostly arises when delevoping kedro utils or extensions, and is part of having a neat functional interface to Kedro, where all assets can be handled in a single place and passed to a single function.
I have been trying a bit to sell Kedro internally, and -- failing that -- push for functional "pipeline code" that follows the pattern of a framework (Like Kedro or Dagster) while not using it explicitly. Given such code, it can run in production without any framework dependencies or overhead, but it can also be executed in "Kedro mode" locally, with the all the framework benefits.

Currently, I am mostly concerned with (2), but I believe a well-designed decorator feature can be helpful both for the beginners-in-notebooks, as well as is cases where the use of Kedro can be in conflict with other constraints or developer standards.

When I played around with implementing (2) I used very thin @node and @pipeline decorators, only attaching __KEDRO_NODE = True or __KEDRO_PIPELINE = True attributes to functions. The AST logic parsing this into a pipeline representation is part of the pipeline executor, otherwise it is just ordinary functions. I think this separation of concerns is the best choice, i.e., that the decorators should return a function or a callable instance, while possibly exposing as_node or as_pipeline property.

(Note on (2): Handling limited iteration and branching is challangeing, but possible.)

(Also +1 for parsing a script-file to pipeline.)

3 replies

deepyaman Jan 14, 2026
Collaborator

@pierrejeden Thanks for the detailed writeup!

(Note on (2): Handling limited iteration and branching is challangeing, but possible.)

Yeah, that's been my view, too. So far, haven't really prioritized looking into this, but it sounds like you've given it a bit more thought.

When I played around with implementing (2) I used very thin @node and @pipeline decorators, only attaching __KEDRO_NODE = True or __KEDRO_PIPELINE = True attributes to functions. The AST logic parsing this into a pipeline representation is part of the pipeline executor, otherwise it is just ordinary functions.

Is it easy to share your basic implementation? I'll plan to write up a KEP once get more feedback, and would be great to look at different potential designs. I think some of the things you bring up, like only parsing into a node/pipeline in a Kedro context, are very interesting (and make sense).

pierrejeden Jan 14, 2026

(sorry for the doble-posting): Thanks @deepyaman! Unfortunately my work-inspired late-night digressions ended up on a computer that is not mine, so I need to ask before sharing :-) but I'm pretty sure I'll get an OK.

Your AST-parsing example covers the well-formed case (no iteration, no branching, only node-function calls, no attribute access on arguments) well. The only thing missing for a "Kedro-complete" parsing is (AFAICT) pipeline nesting, treating sub-pipeline-functions as namespaces.

I extracted the live function from the ast call name and a passed dict with scope variables (was a couple of years since I visited this code). To construct a pipeline instance, one can AST-assemble it as you mention, or just eval on a string. The cleanest solution IMO would be to have an intermediate dataformat, like the kwargs of Kedro nodes, but with the import path to the function instead of the function itself. (Might require python code in files, IIRC problematic in a notebook.)

pierrejeden Jan 17, 2026

@deepyaman I pushed a standalone example here.

I tried to remove code not relevant for targetting Kedro pipelines (handling e.g. iteration and branching, however it still allows attribute access on arguments). The pipeline representation is nested, rather than flat, but each node keeps track of its namespace and IO. Added a print function that unnests it, and a Kedro pipeline repr would be similar.

Also this lacks Kedro's book-keeping of data and parameter arguments, which would be needed for a translation; this could 1) be passed as args to the decorator as in @lucharo 's example, 2) using a convention of always passing a single params container object as the first argument, or 3) using an arg naming convention like func(data_1, data_2, p__threshold, p__factor) .

The quick example at the bottom of the file:

@as_node
def node_1(data_1, param_1):
    data_2 = data_1 + param_1
    return data_2

@as_node
def node_2(data_2, param_2):
    data_3 = data_2 + param_2
    return data_3

@as_node
def node_3(data_3, param_3):
    data_4 = data_3 + param_3
    return data_4

@as_node
def node_4(data_4, param_4):
    data_5 = data_4 + param_4
    return data_5

@as_pipeline
def sub_pipe_1(params, data_1):
    data_2 = node_3(data_3=data_1, param_3=params.param_3)
    data_3 = node_4(data_4=data_2, param_4=params.param_4)
    return data_3
   
@as_pipeline
def pipeline(params, data):
    data_1 = node_1(data_1=data, param_1=params.param_1)
    data_2 = node_2(data_2=data_1, param_2=params.param_2)
    data_3 = sub_pipe_1(params=params, data_1=data_2)
    return data_3


print_node(pipeline)

pipeline: {'params': 'id:params', 'data': 'id:data'} -> ['data_3']

    node_1: {'data_1': 'id:data', 'param_1': 'id:params.param_1'} -> ['data_1']

    node_2: {'data_2': 'id:data_1', 'param_2': 'id:params.param_2'} -> ['data_2']

    sub_pipe_1: {'params': 'id:params', 'data_1': 'id:data_2'} -> ['data_3:data_3']

        sub_pipe_1.node_3: {'data_3': 'id:data_1', 'param_3': 'id:params.param_3'} -> ['data_2']

        sub_pipe_1.node_4: {'data_4': 'id:data_2', 'param_4': 'id:params.param_4'} -> ['data_3']

@node decorator syntax #5028

Uh oh!

merelcht Mar 27, 2023 Maintainer

Introduction

Replies: 17 comments · 10 replies

Uh oh!

Uh oh!

Uh oh!

noklam May 4, 2023 Collaborator

Uh oh!

Uh oh!

noklam Jun 3, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deepyaman Jul 23, 2025 Collaborator

Uh oh!

noklam Jul 29, 2025 Collaborator

Uh oh!

deepyaman Jul 29, 2025 Collaborator

Uh oh!

noklam Jul 29, 2025 Collaborator

Uh oh!

deepyaman Jul 29, 2025 Collaborator

Uh oh!

deepyaman Jul 30, 2025 Collaborator

Flattening the learning curvefrom notebooks to production

The journey for new data scientists from Jupyter notebooks to Kedro pipelines isn't smooth

The Data Catalog throws a lot of new concepts at the truly-new user

The Data Catalog is one of the most useful components of Kedro, but not to a new data scientist

How do people actually teach software engineering for data scientists?

Our revised starting point

Make sure inputs and outputs aren't side effects

And turn it into a Kedro pipeline

And now the Data Catalog

Where next?

For investigation

Uh oh!

noklam Jul 31, 2025 Collaborator

Uh oh!

deepyaman Aug 1, 2025 Collaborator

Uh oh!

Uh oh!

deepyaman Aug 1, 2025 Collaborator

Uh oh!

stephkaiser Aug 22, 2025 Collaborator

Uh oh!

yetudada Sep 26, 2025 Maintainer

Uh oh!

deepyaman Sep 26, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

deepyaman Jan 11, 2026 Collaborator

Uh oh!

Galileo-Galilei Jan 14, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

deepyaman Jan 14, 2026 Collaborator

Uh oh!

Uh oh!

`@node` decorator syntax #5028

merelcht
Mar 27, 2023
Maintainer

Replies: 17 comments 10 replies

noklam
May 4, 2023
Collaborator

noklam
Jun 3, 2023
Collaborator

deepyaman
Jul 23, 2025
Collaborator

noklam
Jul 29, 2025
Collaborator

deepyaman
Jul 29, 2025
Collaborator

noklam
Jul 29, 2025
Collaborator

deepyaman
Jul 29, 2025
Collaborator

deepyaman
Jul 30, 2025
Collaborator

Flattening the learning curve
from notebooks to production

noklam Jul 31, 2025
Collaborator

deepyaman Aug 1, 2025
Collaborator

deepyaman Aug 1, 2025
Collaborator

stephkaiser
Aug 22, 2025
Collaborator

yetudada
Sep 26, 2025
Maintainer

deepyaman Sep 26, 2025
Collaborator

deepyaman Jan 11, 2026
Collaborator

Galileo-Galilei Jan 14, 2026
Collaborator

deepyaman Jan 14, 2026
Collaborator