Release Release v0.2.1 · pulp-platform/Deeploy

This release includes improvements to the tiling and DMA code generation, new networks and operators, improved CI workflows, migration to PyTest, and support for PyPi package releases.

Note: Since the release tag references the Docker container tagged with the release tag (ghcr.io/pulp-platform/deeploy:v0.2.1), the CI will initially fail. The Deeploy Docker image must be built after the release PR is merged and the CI restarted.

List of Pull Requests

PyPi Package Deployment + Remove Banshee Dept #154
PyTest Migration #144
Update submodule pulp-nn-mixed #145
Improve Profiling #138
FP32 ReduceMean operator improvement #137
Support for RMSNorm (Pow and Sqrt operators) #136
Demo TinyViT compatibility with tiled Siracusa #124
TinyViT on non-tiled Siracusa #117
Support Fully Asynchronous DMAs #114
Disallow shape inference #128
Remove memory-aware node bindings #123
Fix missing const's layout transformation and refactor NCHWtoNHWC passes #122
Fix aliasing #125
Support for 1D Autoencoder #98
Refactor Logging for Improved Debugging #115
Add reuse-tool as an SPDX license header linter #113
Bug fixes, API Cleanup and Reduce Compiler Warning on PULP #112
Fix PULP GEMM batch serialization #109
Split CI Workflows by Platform and Task, Improve Formatting and Linting Reliability #108
Refactor tiling code generation #105
Change order of typeMatching entries #68
Node Mangling to avoid duplication #93
Prepare Post v0.2.0 Release #104
Use Docker digests instead of arch-specific tags #106
Fix Unsqueeze Op. when using ONNX opset 13 or higher (from attribute to input) #119
Fix bias hoisting in generic GEMM with no bias #126

Added

The publish.yml action to build a branch and push it to PyPi. The action is automatically triggered when a tag with the "v*" format is emitted.
I created a release of Banshee so we don't need to rebuild it over and over. The Makefile now pulls that release depending on the platform.
I bumped the onnx-graphsurgeon version such that we don't need to use NVIDIA's PyPi index anymore.
_export_graph assigns their export type to the tensors before export.
pytest and pytest-xdist as dependencies of Deeploy.
A pytest.ini for the global configuration of PyTest for the project.
conftest.py to define CLI args for PyTest for the whole project, it also defines a set of global fixtures and markers.
pytestRunner.py contains helper functions and fixtures for the whole project.
test_platforms.py lists the E2E tests and sorts them into marked categories (per platform and per kernel/model).
Each platform has a test config file where a list or a dict describes the tests.
Support for unknown number of data dimensions in the tiler
Parallelization support for the FP32 ReduceMean operator on PULPOpen
Extensive testing for the ReduceMean operator
Pass to remove ReduceMean operators that don't change data content, but only its shape
Support for RMSNorm operation via operator decomposition.
Added Pow (Power) and Sqrt (Square Root) operation support (Parsers, Layers, Bindings, Templates, and FP32 Kernels) for the Generic platform.
Support for input tiling for PULP FP regular and DW conv 2D.
CI tests for tiled Siracusa FP regular and DW conv 2D, with and without bias, for skip connections, and for the demo version of TinyViT.
Documentation for PULP FP regular and DW conv 2D and MatMul tile constraints.
PULP ReduceMean and Slice tile constraints.
PULP 2D FP DW conv Im2Col template and kernel, with bias support.
Bias support for PULP 2D FP regular conv Im2Col in template & kernel.
PULP FP DW conv 2D parser.
FP conv 2D (simple & DW), reshape & skip connection, and TinyViT demo tests to the non-tiled Siracusa CI pipeline.
FP bindings and mappings for PULP slice, DW conv 2D, and reduce mean operations.
FP PULP DW conv lowering optimization pass similar to the existent one for integer version.
RemoveEmptyConvBiasPass to the PULP optimizer.
Add manual type inference feature (CLI: --input-type-map/--input-offset-map) to resolve ambiguities when test inputs are not representative enough
Added a testTypeInferenceDifferentTypes test case to validate type inference for different input types
Added _mangleNodeNames function to avoid duplicate node mappings
Output Docker image digests per platform (amd64, arm64) after build, which is used to construct the multi-arch Docker manifest. This preventes registry clutter caused by unnecessary per-architecture Docker tags.
AsyncDma abstraction of DMA's
test runner per DMA and a script that tests all the DMA's
generic Single/DoubleBufferingTilingCodeGeneration classes
TilingVariableReplacementUpdate class that updates the variable replacement refs
TilingHoistingMixIn class that encapsulates all the hoisting helper functions of tiling
sorting of input memory allocations to allow references that live in the same memory level as the memory they are referencing
a function that tests the tiling solution for correctness which currently only tests buffer allocation for byte alignment
IntrospectiveCodeTransformation: _indexPointer(), indexVars(), dereferenceVars(). The *Vars functions index/dereference a list of variables (useful for tiling)
NetworkContext: unravelReference() that unravels a _ReferenceBuffer until the base buffer
NetworkContext: is_object() - helper function that determines whether the string represents a name of a local or global object
NetworkContext: is_buffer() - helper function that determines whether the string represents a name of a buffer
missing checks for environment variables
_permuteHyperRectangle helper function
Added CI badges to the README
Added YAML linting to CI
Added missing license headers and C header include guards
Extended the pre-commit hooks to remove trailing whitespace, check licenses, format and lint files
Reshape operator support for PULP (ReshapeTemplate in bindings)
Missing class attributes in Closure.py
reuse_skip_wrapper.py to manually skip files
Centralized logging with DEFAULT_LOGGER, replacing print statements
Debug logs for type checking/parsing; __repr__ for core classes
Buffer utilities: checkNumLevels validation and sizeInBytes method
Per–memory-level usage tracking and worst-case reporting in NetworkContext
Memory/I/O summaries and input/output logging in deployers
RequantHelpers.py for Neureka's TileConstraints
Added assertion that all the graph tensors after lowering have a shape annotated
Added testFloatGEMMnobias
Profiling support and optional comments in generated DMA code for better traceability
Added new waiting-strategy logic with fine-grained PerTensorWaitingStrategy
PULPClusterEngine now accepts a n_cores parameter to set the number of cores used
annotateNCores method to PULPDeployer that adds an n_cores key to all PULPClusterEngine templates' operatorRepresentations
Calculate non-kernel overhead and show total time spent during profiling

Changed

Rename package name from PULP-Deeploy to deeploy-pulp.
Each CI workflow has been simplified to call the pytest suite with certain markers.
Structure of Tests subdir for improved ordering
Structure of .gitignore file for improved ordering
Decreased L1 maximal memory limit for CI pipeline tests where compatible thanks to the implementation of Conv2D input tiling support.
Reduced size of reshape & skip connection test, for non-tiled Siracusa memory compatibility.
Replaced platform-specific tags (*-amd64, *-arm64) with direct digest references in Noelware/docker-manifest-action.
mchan HAL is now reduced to bare-bones
refactor of the IntrospectiveCodeTransformation to work on the Mako template
refactor of memory allocation code transformation passes
_ReferenceBuffer accepts an optional offset argument to offset the reference
NetworkContext: hoistReference - accepts the actual buffer as reference instead of name, accepts shape, offset, and override_type arguments, and returns the actual buffer, not its name
_mangleNodeRep -> _mangleOpRepr - the canonical name we use is OperatorRepresentation. NodeRep and ParseDict are old iterations of the name.
rename of permutation functions to follow this convention: permute is an action that permutes something, permutation is a function that generates a permutation
_permuteList to just _permute
removed manual buffer name mangling since we do it in the ExecutionBlock generate() function, simplifies templates
we now check that buffer shapes/hyperrectangles/tiling ranks match which required changing a few serializeTilingSolution functions to preserve the same shape rank
big refactor of the code generation part of the TilingExtension and needed changes to PULPOpen and Snitch due to it
PULPClusterTilingSB and PULPClusterTilingDB now allow for transfers of any rank (dimensionality)
PULP's final output diff is now calculated as absolute error, instead of just subtraction
common code generation code between testMVP/generateNetwork/... was extracted into a single generateTestNetwork function
in some functions, instead of passing the name of a buffer, the actual buffer is just passed
tile function allows overriding the optimizer with external tilingSolution and memoryMap
refactor of the permutation functions for clarity
Split CI into multiple workflow files: one per platform, one for lint & license, one for general Deeploy tests, one for infrastructure, and two for Docker flows, improving maintainability and status reporting
Extended CI to check license in cMake and YAML files
Removed all trailing whitespace
Removed unnecessary includes from the PULP platform header list, such as DeeployBasicMath.h, for cleaner code generation
Changed types and added correct casts to fix many compiler warnings in the PULP target library
Use reuse-tool in pre-commit, CI, and Makefile for SPDX license header linting
Deployer workflow now uses prepare(...) instead of generateFunction(...).
Removed fromVariableBuffer
Refactored hoistConstant
Refactored TransientBuffer's __init__
Refactor of the NCHWtoNHWC passes
Removed NodeMemoryLevelChecker, MemoryAwareNodeBinding
Removed _parseNode from MemoryNetworkDeployer since we don't need the annotations before typeChecking anymore
Removed Wmem variants of bindings and tile constraints from Neureka
Disabled ICCT_ITA_8 MemPool test because it was using a lowering that created shapeless tensors
Added missing shape annotation to the testTypeInferenceDifferentTypes
Refactored DMA code generation (SnitchDma, Mchan) to correctly overlap transfers and compute in double-buffering mode
changed _mapNode to _selectEngine which reduces the responsibility of that function to, as the name states, just engine selection
Print kernel profiling information for all memory levels

Fixed

Update install.md to remove rust mention and fix test command.
Update README.md to remove reference to NVIDIA's PyPi index.
nvidia-pyindex was broken as it now tries to build the wheel to respect the new policy on packages using pyproject. Instead of installing this package, we just add the https://pypi.ngc.nvidia.com channel to the pip config file.
Pin versions of broken dependencies of Banshee.
Fixed ReduceMean parallelization and tiling issues described in Issue #134.
Fixed PULP FP32 regular and DW Conv2D, and MatMul tile constraints.
Fixed type casting for tiling code generation.
Fixed bug in buffer name identification in code generation for tests with L3 default memory level.
PULP GELU kernel to use tanh approximation.
Fixed bug for non-batched elements in the PULPOpen FP GEMM and matmul templates.
Added underscore to the beginning of closure names to avoid naming issues when they start with unsupported first characters (like numbers).
Data types in the PULPOpen FP add and mul templates.
Prevent node duplication for graphs generated via GraphSurgeon
Resolved issue with missing id in the Build Cache for Docker step, used in the Inject build-cache step.
Fix license CI check and prevent potential issues with jq installation
PULP Gemm batch variable serialization
Fixed multiple typos in variable and method names, such as changing includeGobalReferences to includeGlobalReferences and dicardedMappers to discardedMappers
Corrected method usage in importDeeployState to call NetworkContext.importNetworkContext instead of the incorrect method name
Correctly return signProp from setupDeployer instead of hardcoding the value to False in testMVP.py
Fixed Unsqueeze Op. when using ONNX opset 13 or higher (from attribute to input)
Fixed aliasing
Missing layout transformation of the const's (bias, mul, add, shift in Conv/RequantizedConv)
Keep mul/add rank of requantized Neureka tile constraints
Fix bias hoisting in generic GEMM with no bias
DMA synchronization bug causing reduced DB performance on memory-bound kernels.

Removed

Delete outdated and unused .gitlab-ci.yml file
dory_dma.c and dory_dma.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.1

Choose a tag to compare

Sorry, something went wrong.