notebookx Project Roadmap

Overview

notebookx is a Rust-based alternative to Python's nbconvert, providing fast, lightweight notebook conversion with support for Rust library, CLI (nbx), and Python bindings.

Architecture Decisions

Core Data Model

The Notebook struct is the central representation, closely mirroring the Jupyter .ipynb format (nbformat v4):

Notebook
├── cells: Vec<Cell>
├── metadata: NotebookMetadata
├── nbformat: u8 (always 4)
└── nbformat_minor: u8

Cell (enum)
├── Code
│   ├── source: String
│   ├── execution_count: Option<u32>
│   ├── outputs: Vec<Output>
│   └── metadata: CellMetadata
├── Markdown
│   ├── source: String
│   └── metadata: CellMetadata
└── Raw
    ├── source: String
    └── metadata: CellMetadata

Output (enum)
├── ExecuteResult { execution_count, data, metadata }
├── DisplayData { data, metadata }
├── Stream { name: stdout|stderr, text }
└── Error { ename, evalue, traceback }

MimeBundle: HashMap<String, MimeData>
MimeData: String | Vec<String> (for multi-line) | Base64 bytes

Format Abstraction

All format-specific logic lives in separate modules:

formats/ipynb.rs - JSON serialization/deserialization
formats/percent.rs - Percent format parsing/generation

The core Notebook struct has no knowledge of specific formats.

Conversion Strategy

Input File → Parser (format-specific) → Notebook → Serializer (format-specific) → Output File

Parsing and serialization are symmetric operations. Each format implements:

parse(input: &str) -> Result<Notebook, ParseError>
serialize(notebook: &Notebook, options: FormatOptions) -> Result<String, SerializeError>

Milestones

Milestone 1: Core Data Model & ipynb Support ✓

Implementation:

Define Notebook, Cell, Output, Metadata structs
Implement serde serialization for ipynb JSON format
Handle all output types: execute_result, display_data, stream, error
Support MIME bundles with text, JSON, and base64 binary data

Testing:

Milestone 2: Percent Format Support ✓

Implementation:

Implement percent format parser
- YAML header extraction (optional, with defaults)
- Cell delimiter parsing (# %%, # %% [markdown], # %% [raw])
- Cell metadata parsing (# %% tags=["hide"])
- Markdown cell content (comment-prefixed lines)
Implement percent format serializer
- YAML header generation (configurable)
- Cell delimiter generation
- Proper comment wrapping for markdown

Testing:

Milestone 3: Clean Functionality ✓

Implementation:

Define CleanOptions struct with granular controls:
- remove_outputs: bool
- remove_execution_counts: bool
- remove_cell_metadata: bool
- remove_notebook_metadata: bool
- remove_kernel_info: bool
- preserve_cell_ids: bool
- allowed_metadata_keys: Option<Vec<String>>
Implement Notebook::clean(options: CleanOptions) -> Notebook
Ensure clean creates a new copy, not mutation

Testing:

Test each clean option individually (outputs, exec counts, cell meta, etc.)
Test combinations of clean options
Test that original notebook is unchanged after clean (immutability)
Test clean with empty options (should return equivalent notebook)
Test allowed_metadata_keys whitelist behavior
Test idempotency: clean(clean(nb, opts), opts) == clean(nb, opts)
Test clean on notebook with no outputs (no-op for remove_outputs)
Test clean preserves cell content integrity
Integration test: clean real notebook, verify outputs removed

Milestone 4: CLI (`nbx`) ✓

Implementation:

Set up clap-based CLI structure
Implement format inference from file extensions
Commands:
- nbx <input> --to <output> (convert with format inference)
- nbx <input> --from-fmt <fmt> --to <output> --to-fmt <fmt> (explicit formats)
- nbx clean <input> [--output <output>] [--remove-outputs] [--remove-metadata] ...
Stdin/stdout support (nbx - --from-fmt ipynb --to - --to-fmt percent)
Error handling with helpful messages
Exit codes for scripting

Testing:

Milestone 5: Python Bindings ✓

Project Setup:

Create pyproject.toml at repository root (maturin config)
Create python/notebookx/ directory structure
Create crates/notebookx-py/ PyO3 crate
Add notebookx-py to workspace members
Set up python/notebookx/__init__.py with re-exports
Create python/notebookx/py.typed marker file

Implementation:

Implement PyO3 bindings in crates/notebookx-py/src/lib.rs
Expose Notebook class:
- Notebook.from_file(path, format=None)
- Notebook.from_string(content, format)
- Notebook.to_file(path, format=None)
- Notebook.to_string(format)
- Notebook.clean(options=None)
Expose CleanOptions as Python dataclass/dict
Expose format enum: Format.IPYNB, Format.PERCENT
Convenience functions:
- convert(input_path, output_path, from_fmt=None, to_fmt=None)
- clean_notebook(path, output_path=None, **options)
Python type stubs (python/notebookx/__init__.pyi)
PyPI packaging via maturin

Testing (in tests/python/):

Milestone 6: CI/CD & Wheel Building ✓

GitHub Actions CI:

Create .github/workflows/ci.yml for continuous integration
- Run Rust tests on push/PR
- Run Python tests on push/PR
- Test on ubuntu-latest, macos-latest, windows-latest

Release Workflow & Wheel Building:

Create .github/workflows/release.yml triggered on git tags
Build wheels for all major platforms using maturin:
- Linux x86_64 (manylinux)
- Linux ARM64 (manylinux)
- macOS x86_64 (Intel)
- macOS ARM64 (Apple Silicon)
- Windows x86_64
Build source distribution (sdist)
Publish to PyPI on release
Publish to crates.io on release

Notes:

Using abi3-py38 means one wheel per platform works for Python 3.8+
Use maturin build --release for optimized builds
Use maturin publish for PyPI upload (requires PYPI_API_TOKEN secret)

Milestone 7: Documentation ✓

Documentation:

README with installation and usage examples
API documentation (rustdoc)
Python docstrings and API docs
Examples directory with common use cases

Format Parsing Strategy

ipynb Format

Parsing:

Deserialize JSON using serde_json
Map to internal structures with validation
Handle both string and array source formats (normalize to String internally)
Preserve unknown metadata fields as serde_json::Value

Serialization:

Serialize cells with source as array of lines (Jupyter convention)
Pretty-print JSON with 1-space indentation (matching Jupyter default)
Ensure trailing newline

Edge Cases:

Empty cells
Cells with only whitespace
Binary outputs (base64 encoded)
Very large outputs
Malformed JSON (helpful error messages)

Percent Format

Parsing:

Check for optional YAML header (# --- ... # ---)
Split on cell delimiters (# %%)
For each cell:
- Parse cell type from delimiter ([markdown], [raw], or code)
- Extract optional metadata from delimiter line
- For markdown/raw: strip # prefix from each line
- For code: keep as-is
Infer kernel from YAML header or default to Python 3

Serialization:

Generate YAML header (configurable):
- Full header with all metadata
- Minimal header (kernelspec only)
- No header
For each cell:
- Write delimiter with type marker if needed
- For markdown: prefix each line with #
- For code: write source directly
Ensure single trailing newline

Configuration Options:

include_yaml_header: bool
yaml_header_style: Full | Minimal | None
preserve_outputs_as_comments: bool (future)

CLI Design

Command Structure

nbx [OPTIONS] <INPUT> --to <OUTPUT>
nbx [OPTIONS] <INPUT> --from-fmt <FORMAT> --to <OUTPUT> --to-fmt <FORMAT>
nbx clean [OPTIONS] <INPUT> [--output <OUTPUT>]

Format Inference

Extension	Format
`.ipynb`	ipynb
`.pct.py`	percent
`.py` (with `# %%`)	percent (detection)

Options

Conversion:

--from-fmt <FORMAT> - Explicit input format
--to-fmt <FORMAT> - Explicit output format
--strip-outputs - Remove outputs during conversion
--strip-metadata - Remove metadata during conversion

Cleaning:

--remove-outputs / -o
--remove-execution-counts / -e
--remove-cell-metadata
--remove-notebook-metadata
--remove-kernel-info
--keep-only <keys> - Whitelist specific metadata keys
--in-place / -i - Modify file in place

Exit Codes

0 - Success
1 - Parse error
2 - Serialization error
3 - I/O error
4 - Invalid arguments

Python Bindings Plan

Repository Structure for Python

notebookx/                      # Repository root
├── Cargo.toml                  # Workspace configuration
├── pyproject.toml              # Python package config (maturin)
├── python/
│   └── notebookx/
│       ├── __init__.py         # Re-exports from Rust extension
│       ├── __init__.pyi        # Type stubs
│       └── py.typed            # PEP 561 marker
├── crates/
│   ├── notebookx/              # Core Rust library + CLI (with "cli" feature)
│   └── notebookx-py/           # PyO3 bindings crate
│       ├── Cargo.toml
│       └── src/
│           └── lib.rs          # PyO3 module definition
└── tests/
    └── python/                 # Python test suite (pytest)
        └── test_notebookx.py

Maturin Configuration

The pyproject.toml at root configures maturin to:

Build the notebookx-py crate as a native extension
Include the python/notebookx/ package
Generate wheels for multiple platforms

API Design

from notebookx import Notebook, Format, CleanOptions

# Load from file (format inferred)
nb = Notebook.from_file("example.ipynb")

# Load from string (format required)
nb = Notebook.from_string(content, Format.IPYNB)

# Convert to different format
percent_str = nb.to_string(Format.PERCENT)
nb.to_file("example.pct.py")

# Clean notebook
options = CleanOptions(
    remove_outputs=True,
    remove_execution_counts=True,
)
clean_nb = nb.clean(options)

# Convenience functions
from notebookx import convert, clean

convert("input.ipynb", "output.pct.py")
clean("notebook.ipynb", output="clean.ipynb", remove_outputs=True)

PyO3 Implementation Notes

Use #[pyclass] for Notebook, CleanOptions, Format
Use #[pymethods] for instance methods
Use #[pyfunction] for module-level convenience functions
Return Python exceptions for Rust errors
Support both Path objects and strings for file paths

Testing Strategy

Unit Tests

Each struct/enum has serialization/deserialization tests
Each format parser has dedicated test module
Edge case coverage for malformed inputs
Clean options tested individually and in combination

Integration Tests

Real notebook files from nb_format_examples/
Round-trip conversion tests (A → B → A)
Cross-format conversion tests (ipynb → percent → ipynb)
CLI end-to-end tests

Property-Based Tests

Using proptest or quickcheck:

Arbitrary valid notebooks round-trip correctly
Clean is idempotent: clean(clean(nb)) == clean(nb)
Format conversions preserve cell content integrity

Benchmarks

Using criterion:

Parse time for various notebook sizes
Serialize time for various notebook sizes
Compare with Python nbconvert/jupytext (external benchmark)

Incremental Priorities

Phase 1: MVP (ipynb ↔ pct.py)

Core data model with unit tests
ipynb parsing/serialization with round-trip tests
Percent format parsing/serialization with round-trip tests
Cross-format integration tests
Basic CLI with format conversion and CLI tests
Basic cleaning (outputs, metadata) with cleaning tests

Phase 2: Production Ready

Comprehensive error handling with error case tests
Full CLI feature set with end-to-end tests
Python bindings with Python test suite
Documentation
Benchmarks comparing with nbconvert/jupytext
PyPI/crates.io publishing

Phase 3: Extended Formats (Future)

Light format (.lgt.py)
MyST Markdown (.myst.md)
Quarto (.qmd)
R Markdown (.Rmd)

Open Questions / Future Considerations

Cell IDs: nbformat 4.5+ supports cell IDs. Should we generate them if missing?
Validation: Should we validate notebook structure strictly or be lenient?
Streaming: For very large notebooks, should we support streaming parse/serialize?
Diff-friendly output: Option to sort metadata keys for deterministic output?
Widget state: How to handle Jupyter widget state in metadata?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebookx Project Roadmap

Overview

Architecture Decisions

Core Data Model

Format Abstraction

Conversion Strategy

Milestones

Milestone 1: Core Data Model & ipynb Support ✓

Milestone 2: Percent Format Support ✓

Milestone 3: Clean Functionality ✓

Milestone 4: CLI (`nbx`) ✓

Milestone 5: Python Bindings ✓

Milestone 6: CI/CD & Wheel Building ✓

Milestone 7: Documentation ✓

Format Parsing Strategy

ipynb Format

Percent Format

CLI Design

Command Structure

Format Inference

Options

Exit Codes

Python Bindings Plan

Repository Structure for Python

Maturin Configuration

API Design

PyO3 Implementation Notes

Testing Strategy

Unit Tests

Integration Tests

Property-Based Tests

Benchmarks

Incremental Priorities

Phase 1: MVP (ipynb ↔ pct.py)

Phase 2: Production Ready

Phase 3: Extended Formats (Future)

Open Questions / Future Considerations

FilesExpand file tree

TODO.md

Latest commit

History

TODO.md

File metadata and controls

notebookx Project Roadmap

Overview

Architecture Decisions

Core Data Model

Format Abstraction

Conversion Strategy

Milestones

Milestone 1: Core Data Model & ipynb Support ✓

Milestone 2: Percent Format Support ✓

Milestone 3: Clean Functionality ✓

Milestone 4: CLI (nbx) ✓

Milestone 5: Python Bindings ✓

Milestone 6: CI/CD & Wheel Building ✓

Milestone 7: Documentation ✓

Format Parsing Strategy

ipynb Format

Percent Format

CLI Design

Command Structure

Format Inference

Options

Exit Codes

Python Bindings Plan

Repository Structure for Python

Maturin Configuration

API Design

PyO3 Implementation Notes

Testing Strategy

Unit Tests

Integration Tests

Property-Based Tests

Benchmarks

Incremental Priorities

Phase 1: MVP (ipynb ↔ pct.py)

Phase 2: Production Ready

Phase 3: Extended Formats (Future)

Open Questions / Future Considerations

Milestone 4: CLI (`nbx`) ✓