[KEP 5] Kedro Inspection API #5405
Replies: 8 comments 5 replies
-
|
+1 |
Beta Was this translation helpful? Give feedback.
-
|
Love the idea of having a core interface defined formally. I am less sure about the naming of inspection, and the pydantic models (and using pydantic for interface). I have not spent too much time to think about it. |
Beta Was this translation helpful? Give feedback.
-
|
+1, based on Nok's comment... maybe we call it |
Beta Was this translation helpful? Give feedback.
-
|
On the same page as @noklam.
The fact that all of the individual plugins (both core and community-maintained) redefine the same things, and these depend on internal attributes and need to be maintained across Kedro version changes, is a huge selling point definitely been burned trying to maintain compatibility across versions in the past. Slightly less clear to me how much this actually helps standardize? E.g. Kedro-Airflow constructs the dependency graph, but #5266 seems to indicate this should be part of a standalone package. I assume Kedro-Viz benefits most, and very unfamiliar with
Not sure what Nok had in mind, but at least I intuitively preferred Biggest concern is Pydantic as a core dependency. Is it absolutely necessary?
@ravi-kumar-pilla Given you've specified that snapshots are read-only, and so you can't reconstruct Kedro objects (and therefore don't need to worry about users creating their own snapshots somehow and building Kedro objects from them), I would think Pydantic/snapshot validation isn't that necessary. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @deepyaman ,
Yes, v1 will not be straightaway helpful to eliminate everything in the individual plugins. Thin adapters are still needed on the plugin side to tune to their needs. These will give more info on how much it helps plugins - https://github.com/kedro-org/kedro/blob/spike/inspect-api/docs/inspect-docs/kedro-inspect-consumer-summary.md, https://github.com/kedro-org/kedro/blob/spike/inspect-api/docs/inspect-docs/kedro-inspect-consumer-analysis.md
Yes, @rashidakanchwala and I discussed around this. I echo with the future capabilities we might support with this layer. Validation might be included in the api contract of the inspection layer but the core focus is on inspection (snapshots or read-only for now). I would actually prefer inspection, but I can ask on our slack tsc for naming the group. Some options to consider -
Pydantic will be optional present within the
You're right that validation isn't the reason. The reason is serialization and schema generation. Pydantic is being used here as a serialization layer, not a validation layer. The value is:
If the team pushes back, we can do I will post this on slack as well for everyone to vote on these 2 topics, so we can take this KEP to implementation. Thank you. |
Beta Was this translation helpful? Give feedback.
-
|
+1 Re: the Pydantic concern, I suggest we open a separate KEP (or simply a discussion) about whether we use Pydantic or Dataclasses as a standard throughout the framework. |
Beta Was this translation helpful? Give feedback.
-
|
The core purpose of this module needs to be clearly defined, right now it's mixing concerns. Is it a ser/deser layer, or a formal interface to Kedro internals? These are fundamentally different goals and shouldn't be conflated.
If the intent is to provide an interface for external libraries, the ser/deser layer is unnecessary overhead. When the interface is the JSON/OpenAPI spec, a separate package isn't really a problem, since the interface is the spec, the downstream should not require any dependencies. A concrete example would make this much clearer. The key question is: what is the actual contract these snapshots define? Take class ProjectMetadata(NamedTuple):
"""Structure holding project metadata derived from `pyproject.toml`"""
config_file: Path
package_name: str
project_name: str
project_path: Path
source_dir: Path
kedro_init_version: str
tools: list
example_pipeline: str |
Beta Was this translation helpful? Give feedback.
-
|
Since there are no downvotes, the idea of having the Inspection API will be implemented as a follow up. Regarding the inspection models, we will go with dataclasses as opposed to pydantic, to avoid pydantic as a kedro core dependency. Thank you everyone for your time. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Related issue: #5266
Author: @ravi-kumar-pilla
Shepherd: @rashidakanchwala , @jitu5
Context
Kedro's ecosystem has grown to include several plugins that need to understand a project's structure, its pipelines, nodes, datasets, and parameters, without executing any pipeline runs. Today, every plugin (KedroViz, vscode-kedro, kedro-airflow) independently reimplements this logic. Each one loads the project differently, makes different assumptions about the catalog, and has no guaranteed contract with Kedro's internals.
This has led to:
A standardised inspection API in Kedro core establishes a single contract for this purpose.
What are we trying to do?
Provide a first-class Python API in
kedro.inspectionthat returns a structured, serialisable snapshot of a Kedro project, its metadata, registered pipelines, nodes with their inputs/outputs, catalog dataset types, and parameter keys using only public Kedro APIs and without running any pipeline.The API should be usable by any tool that needs to understand a Kedro project programmatically.
What this is NOT
get_snapshot()never loads datasets, runs nodes, or modifies project state.kedro run. Solving dependency-free inspection is a significant architectural problem that affectsNode,Pipeline,_ProjectPipelines, andfind_pipelines(). This is tracked as a separate spike.kedro inspect) and REST endpoints are out of scope for the initial release. The v1 target integration point is the programmatic Python API only.Proposed Approach
Prerequisites
All project dependencies must be installed.
get_snapshot()callsbootstrap_project()which triggersregister_pipelines(), which imports all user pipeline modules. This is the same requirement askedro run.No KedroSession, no KedroContext
The inspection API does not create a
KedroSessionorKedroContext. These were designed for pipeline execution, not inspection, and bring overhead (hook registration, parameter materialisation, dataset instantiation) that is unnecessary for read-only structural queries. The API callsbootstrap_project()directly, then accesses_ProjectPipelinesandOmegaConfigLoaderindependently.Pydantic Models
Public API Surface
get_snapshot(project_path, env) → ProjectSnapshotPrimary entry point. Builds and returns a full
ProjectSnapshot.build_metadata_snapshot(metadata) → ProjectMetadataSnapshotConverts the
ProjectMetadatanamedtuple returned bybootstrap_project().build_catalog_snapshot(project_path, env) → tuple[dict[str, DatasetSnapshot], list[str]]Reads
catalog.ymlandparameters.ymlviaOmegaConfigLoader+CatalogConfigResolver. No dataset classes instantiated.build_pipeline_snapshots() → list[PipelineSnapshot]Triggers
register_pipelines()via_ProjectPipelines. Requiresconfigure_project()to have been called (viabootstrap_project()).build_project_snapshot(project_path, env) → ProjectSnapshotFull orchestrator. Calls
bootstrap_project()then assembles all snapshot components.Potential Drawbacks
Timeline
For implementation: ~1-1.5weeks
Completion with reviews: ~2-2.5weeks
Appendix A: Module Structure
Dependency graph (no cycles):
Appendix B: Rejected Designs
B1. Inspection via KedroSession/KedroContext
Loading a full
KedroSessionfor inspection runs hook registration, instantiates datasets, and materialises parameters, all unnecessary for a read-only structural snapshot. Rejected in favour of directbootstrap_project()+ individual loader access.B2. Separate
kedro-inspectpackageKeeps core lean but fragments the API across packages. Every plugin would need an additional dependency. Since the inspection models and builders are tightly coupled to Kedro internals, maintaining them in core is more reliable and gives better guarantees across Kedro releases. Rejected.
B3. Dependency-free inspection via AST scanning and module mocking
Prototyped in this spike: AST-scan pipeline files for imports, mock missing modules with
MagicMock, then loadpipeline_registry.pysafely. Technically viable but more importantly, the root problem is architectural: why does any Kedro command other thankedro runneed all pipeline dependencies installed? This is tracked as a separate spike (see dependency spike ticket) and is explicitly out of scope for v1.Are you fine with the feature implementation and overall architecture ?
Please review the proposed approach and flag any concerns.
Please vote +1/−1 in comments
Thank you
Beta Was this translation helpful? Give feedback.
All reactions