v0.4.2 — Memory Optimization & Large-Scale Benchmarks

Latest

Latest

chansigit released this 10 Apr 00:50

c467d42

Performance & Scaling

~40% peak GPU memory reduction via in-place momentum update and sink_dtype Lambda construction
Max scale on L40S 48GB: 35k x 40k (was 25k x 30k)
Max scale on H100 80GB: 45k x 45k / 40k x 50k, 17-18s, Spearman >= 0.999
Fix torch.dot int32 overflow for transport plans with N*K > 2^31
Graceful float64 output fallback when GPU memory is exhausted

Documentation

Rewritten "How It Works" with Mermaid flowchart, algorithm intuition, and speedup table
H100 + L40S benchmark tables in README and Sphinx docs
Updated quickstart with pip install torchgw, grad_mode examples
Reproducible benchmark script: python examples/benchmark_scale.py

Install

pip install torchgw==0.4.2

Full changelog: CHANGELOG.md

Assets 4