Skip to content

Latest commit

 

History

History
961 lines (654 loc) · 35.1 KB

File metadata and controls

961 lines (654 loc) · 35.1 KB

v3.10.0

Change Log

Feature

  • Add support for foreach. (#287, #476, #477)
    • More than 10 optimizers (e.g. AdaFactor, StableAdamW, Lion, AdaBelief, Amos, ...) now support foreach.
    • In most cases, foreach improves training speed by 1.1x to 1.5x, with a moderate increase in memory usage.
    • Like official PyTorch optimizers, the default value of foreach is None. When foreach=None, CUDA paths prefer the foreach implementation over the for-loop implementation.
    • If you need the previous for-loop behavior, set foreach=False explicitly.
  • Update the Emo-series optimizers. (#472, #478)
    • Update EmoNavi, EmoFact, and EmoLynx.
    • Begin deprecating EmoNeco and EmoZeal (they are being phased out).
  • Implement SpectralSphere optimizer. (#483, #485)
  • Support various coefficients for zero_power_via_newton_schulz_5. (#487)
    • Add coefficient presets: original, quintic, polar_express, and polar_express_safer.
    • Support custom coefficient schedules and expose ns_coeffs in Muon, DistributedMuon, AdaMuon, and AdaGO.

Refactor

  • Rename and organize type aliases. (#488)

Fix

  • Fix misbehavior in AdaFactor optimizer. (#477)
  • Fix a potential NaN issue in AdamP optimizer. (#480, #481)
  • Fix Lookahead wrapper compatibility with accelerate by normalizing lookahead_state serialization. (#484, #489)

Docs

  • Convert the previous docstring style to Google style. (#487)
  • Add py.typed to mark distributed typing information. (#487)
  • rework the README.md. (#491, #492)

CI/CD

  • Introduce uv. (#473)

v3.9.0

Change Log

Feature

Update

  • Enable more Pyright rules. (#457)
  • Reduce the visualization image size to load fast. (#457)
  • Update the EmoNavi, EmoLynx, EmoFact optimizers to the latest version. (#454, #461)
  • Add a CLAUDE.md. (#467)

Fix

  • LoRA training failed when Dropout module is enabled in Kohya_ss. (#462, #463)
  • Fix potential LookSAM optimizer bug when there are multiple parameter groups. (#442, #465)

v3.8.2

Change Log

Feature

  • Speed-up zeropower_via_newtonschulz up to 20% by utilizing torch.baddmm and torch.addmm ops. (#448)

Update

  • Refactor the type hints. (#448)

Fix

  • Resolved compatibility issue with lower PyTorch versions where torch.optim.optimizer.ParamT could not be imported. (#448)

Docs

  • Convert the docstring style from reST to google-style docstring. (#449)

v3.8.1

Change Log

Feature

Update

  • Accept the GaloreProjector parameters in the init params of the Conda optimizer. (#443, #444)

Bug

  • Fix NaN problem when grad norm is zero in StableSPAM optimizer. (#431)

Docs

  • Update the documentation page. (#428)

Contribution

thanks to @liveck, @AhmedMostafa16

v3.8.0

Change Log

Feature

  • Implement EmoNeco and EmoZeal optimizers. (#407)
  • Implement Refined Schedule-Free AdamW optimizer. (#409, #414)
  • Add more built-in optimizers, NAdam, RMSProp, and LBFGS optimizers. (#415)
  • Support cautious variant for Muon optimizer. (#417)
  • Separate distributed functionality from Muon to DistribtuedMuon optimizer. (#418)
  • Implement StochasticAccumulator, which is a gradient hook. (#418)

Update

  • Re-implement Muon and AdaMuon optimizers based on the recent official implementation. (#408, #410)
    • Their definitions have changed from the previous version, so please check out the documentation!
  • Update the missing optimizers from __init__.py. (#415)
  • Add the HuggingFace Trainer example. (#415)
  • Optimize the visualization outputs and change the visualization document to a table layout. (#416)

Dependency

  • Update mkdocs dependencies. (#417)

CI

  • Add some GitHub actions to automate some processes. (#411, #412, #413)

Contributions

thanks to @AidinHamedi

v3.7.0

Change Log

Feature

CI

  • Enable CI for Python 3.8 ~ 3.13. (#402, #404)

Fix

  • Adjust the value of eps to the fixed value 1e-15 when adding to exp_avg_sq. (#397, #398)
  • built-in type-hint in Kron optimizer. (#404)

v3.6.1

Change Log

Feature

Update

  • Change the default range of the beta parameter from [0, 1] to [0, 1). (#392)

Fix

  • Fix to use momentum buffer instead of the gradient to calculate LMO. (#385)

v3.6.0

Change Log

Feature

Update

  • Support 2D< Tensor for RACS and Alice optimizers. (#380)
  • Remove the auxiliary variants from the default parameters of the optimizers and change the name of the state and parameter. (#380)
    • use_gc, adanorm, cautious, stable_adamw, and adam_debias will be affected.
    • You can still use these variants by passing the parameters to **kwargs.
    • Notably, in case of adanorm variant, you need to pass adanorm (and adanorm_r for r option) parameter(s) to use this variant, and the name of the state will be changed from exp_avg_norm to exp_avg_adanorm.
  • Refactor reset() to init_group() method in the BaseOptimizer class. (#380)
  • Refactor SAM optimizer family. (#380)
  • Gather AdamP, SGDP things into pytorch_optimizer.optimizer.adamp.*. (#381)
    • pytorch_optimizer.optimizer.sgdp.SGDP to pytorch_optimizer.optimizer.adamp.SGDP
    • pytorch_optimizer.optimizer.util.projection to pytorch_optimizer.optimizer.adamp.projection
    • pytorch_optimizer.optimizer.util.cosine_similarity_by_view to pytorch_optimizer.optimizer.adamp.cosine_similarity_by_view
  • Remove channel_view() and layer_view() from pytorch_optimizer.optimizer.util. (#381)

Fix

  • Fix shape mismatch issues in the Galore projection for reverse_std, right, and full projection types. (#376)

v3.5.1

Change Log

Feature

  • Implement ScionLight optimizer. (#369)

Update

  • Update SCION optimizer based on the official implementation. (#369)

Fix

  • Correct the learning rate ratio in Muon optimizer properly. (#371, #372, #373)

v3.5.0

Change Log

Feature

Update

  • Update Muon optimizer. (#355, #356)
    • support decoupled weight decay.
    • adjust default hyperparameters the same as the original implementation.
    • support adjusted lr from the Moonlight. you can use it by setting use_adjusted_lr=True.
  • Tune the performance of the coupled Newton iteration method by 5% increase. (#360)
  • Update SCION optimizer. (#361)
    • add scale parameter.
    • update get_lmo_direction.

Fix

  • bias_correction2 in ScheduleFreeRAdam optimizer. (#354)
  • potential bug in SPAM optimizer. (#365)
  • initialize the z state within the step() of the ScheduleFreeWrapper. (#363, #366)

v3.4.2

Change Log

Feature

Update

  • Update ScheduleFreeSGD, AdamW, RAdam optimizers with the latest. (#351, #353)
  • Remove use_palm variant in ScheduleFree optimizer due to instability. (#353)
  • Ranger25 optimizer. (#353)

Fix

  • Remove weight decouple parameter in ScheduleFree optimizers. (#351, #353)

Docs

  • Fix AliG optimizer visualization. (#350)

Contributions

thanks to @AidinHamedi, @hatonosuke

v3.4.1

Change Log

Feature

Update

  • Support alternative precision training for Shampoo optimizer. (#339)
  • Add more features to and tune Ranger25 optimizer. (#340)
    • AGC + Lookahead variants
    • change default beta1, beta2 to 0.95 and 0.98 respectively
  • Skip adding Lookahead wrapper in case of Ranger* optimizers, which already have it in create_optimizer(). (#340)
  • Improved optimizer visualization. (#345)
  • Rename pytorch_optimizer.optimizer.gc to pytorch_optimizer.optimizer.gradient_centralization to avoid possible conflict with Python built-in function gc. (#349)

Bug

  • Fix to update exp_avg_sq after calculating the denominator in ADOPT optimizer. (#346, #347)

Docs

  • Update the visualizations. (#340)

Contributions

thanks to @AidinHamedi

v3.4.0

Change Log

Feature

Update

  • Support OrthoGrad variant to Ranger25. (#332)
    • Ranger25 optimizer is my experimental-crafted optimizer, which mixes lots of optimizer variants such as ADOPT + AdEMAMix + Cautious + StableAdamW + Adam-Atan2 + OrthoGrad.

Fix

  • Add the missing state property in OrthoGrad optimizer. (#326, #327)
  • Add the missing state_dict, and load_state_dict methods to TRAC and OrthoGrad optimizers. (#332)
  • Skip when the gradient is sparse in OrthoGrad optimizer. (#332)
  • Support alternative precision training in SOAP optimizer. (#333)
  • Store SOAP condition matrices as the dtype of their parameters. (#335)

Contributions

thanks to @Vectorrent, @kylevedder

v3.3.4

Change Log

Feature

  • Support OrthoGrad feature for create_optimizer(). (#324)
  • Enhanced flexibility for the optimizer parameter in Lookahead, TRAC, and OrthoGrad optimizers. (#324)
    • Now supports both torch.optim.Optimizer instances and classes
    • You can now use Lookahead optimizer in two ways.
      • Lookahead(AdamW(model.parameters(), lr=1e-3), k=5, alpha=0.5)
      • Lookahead(AdamW, k=5, alpha=0.5, params=model.parameters())
  • Implement SPAM optimizer. (#324)
  • Implement TAM, and AdaTAM optimizers. (#325)

v3.3.3

Change Log

Feature

v3.3.2

Change Log

Feature

Bug

  • Clone exp_avg before calling apply_cautious not to mask exp_avg. (#316)

v3.3.1

Change Log

Feature

  • Support Cautious variant to AdaShift optimizer. (#310)
  • Save the state of the Lookahead optimizer too. (#310)
  • Implement APOLLO optimizer. (#311, #312)
  • Rename the Apollo (An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization) optimizer name to ApolloDQN not to overlap with the new optimizer name APOLLO. (#312)
  • Implement MARS optimizer. (#313, #314)
  • Support Cautious variant to MARS optimizer. (#314)

Bug

  • Fix bias_correction in AdamG optimizer. (#305, #308)
  • Fix a potential bug when loading the state for Lookahead optimizer. (#306, #310)

Docs

  • Add more visualizations. (#310, #314)

Contributions

thanks to @Vectorrent

v3.3.0

Change Log

Feature

Refactor

  • Big refactoring, removing direct import from pytorch_optimizer.*.
    • I removed some methods not to directly import from it from pytorch_optimzier.* because they're probably not used frequently and actually not an optimizer rather utils only used for specific optimizers.
    • pytorch_optimizer.[Shampoo stuff] -> pytorch_optimizer.optimizers.shampoo_utils.[Shampoo stuff].
      • shampoo_utils like Graft, BlockPartitioner, PreConditioner, etc. You can check the details here.
    • pytorch_optimizer.GaLoreProjector -> pytorch_optimizer.optimizers.galore.GaLoreProjector.
    • pytorch_optimizer.gradfilter_ema -> pytorch_optimizer.optimizers.grokfast.gradfilter_ema.
    • pytorch_optimizer.gradfilter_ma -> pytorch_optimizer.optimizers.grokfast.gradfilter_ma.
    • pytorch_optimizer.l2_projection -> pytorch_optimizer.optimizers.alig.l2_projection.
    • pytorch_optimizer.flatten_grad -> pytorch_optimizer.optimizers.pcgrad.flatten_grad.
    • pytorch_optimizer.un_flatten_grad -> pytorch_optimizer.optimizers.pcgrad.un_flatten_grad.
    • pytorch_optimizer.reduce_max_except_dim -> pytorch_optimizer.optimizers.sm3.reduce_max_except_dim.
    • pytorch_optimizer.neuron_norm -> pytorch_optimizer.optimizers.nero.neuron_norm.
    • pytorch_optimizer.neuron_mean -> pytorch_optimizer.optimizers.nero.neuron_mean.

Docs

  • Add more visualizations. (#297)

Bug

  • Add optimizer parameter to PolyScheduler constructor. (#295)

Contributions

thanks to @tanganke

v3.2.0

Change Log

Feature

  • Implement SOAP optimizer. (#275)
  • Support AdEMAMix variants. (#276)
    • bnb_ademamix8bit, bnb_ademamix32bit, bnb_paged_ademamix8bit, bnb_paged_ademamix32bit
  • Support 8/4bit, fp8 optimizers. (#208, #281)
    • torchao_adamw8bit, torchao_adamw4bit, torchao_adamwfp8.
  • Support a module-name-level (e.g. LayerNorm) weight decay exclusion for get_optimizer_parameters. (#282, #283)
  • Implement CPUOffloadOptimizer, which offloads optimizer to CPU for single-GPU training. (#284)
  • Support a regex-based filter for searching names of optimizers, lr schedulers, and loss functions.

Bug

  • Fix should_grokfast condition when initialization. (#279, #280)

Contributions

thanks to @Vectorrent

v3.1.2

Change Log

Feature

Bug

  • Add **kwargs to the parameters for dummy placeholder. (#270, #271)

v3.1.1

Change Log

Feature

Bug

  • Handle the optimizers that only take the model instead of the parameters in create_optimizer(). (#263)
  • Move the variable to the same device with the parameter. (#266, #267)

v3.1.0

Change Log

Feature

Refactor

  • Refactor AdamMini optimizer. (#258)
  • Deprecate optional dependency, bitsandbytes. (#258)
  • Move get_rms, approximate_sq_grad functions to BaseOptimizer for reusability. (#258)
  • Refactor shampoo_utils.py. (#259)
  • Add debias, debias_adam methods in BaseOptimizer. (#261)
  • Refactor to use BaseOptimizer only, not inherit multiple classes. (#261)

Bug

  • Fix several bugs in AdamMini optimizer. (#257)

Contributions

thanks to @sdbds

v3.0.2

Change Log

Feature

Refactor

  • Refactor Chebyschev lr scheduler modules. (#248)
    • Rename get_chebyshev_lr to get_chebyshev_lr_lambda.
    • Rename get_chebyshev_schedule to get_chebyshev_perm_steps.
    • Call get_chebyshev_schedule function to get LamdbaLR scheduler object.
  • Refactor with ScheduleType. (#248)

v3.0.1

Change Log

Feature

Bug

  • Wrong typing of reg_noise. (#239, #240)
  • Lookahead`s param_groups attribute is not loaded from checkpoint. (#237, #238)

Contributions

thanks to @michaldyczko

v3.0.0

The major version is updated! (v2.12.0 -> v3.0.0) (#164)

Many optimizers, learning rate schedulers, and objective functions are in pytorch-optimizer. Currently, pytorch-optimizer supports 67 optimizers (+ bitsandbytes), 11 lr schedulers, and 13 loss functions, and reached about 4 ~ 50K downloads / month (peak is 75K downloads / month)!

The reason for updating the major version from v2 to v3 is that I think it's a good time to ship the recent implementations (the last update was about 7 months ago) and plan to pivot to new concepts like training utilities while maintaining the original features (e.g. optimizers). Also, rich test cases, benchmarks, and examples are on the list!

Finally, thanks for using the pytorch-optimizer, and feel free to make any requests :)

Change Log

Feature

Fix

  • Fix SRMM to allow operation beyond memory_length. (#227)

Dependency

  • Drop Python 3.7 support officially. (#221)
  • Update bitsandbytes to 0.43.0. (#228)

Docs

  • Add missing parameters in Ranger21 optimizer document. (#214, #215)
  • Fix WSAM optimizer paper link. (#219)

Diff

Contributions

thanks to @sdbds, @i404788

v2.12.0

Change Log

Feature

  • Support bitsandbytes optimizer. (#211)
    • now, you can install with pip3 install pytorch-optimizer[bitsandbytes]
    • supports 8 bnb optimizers.
      • bnb_adagrad8bit, bnb_adam8bit, bnb_adamw8bit, bnb_lion8bit, bnb_lamb8bit, bnb_lars8bit, bnb_rmsprop8bit, bnb_sgd8bit.

Docs

Diff

2.11.2...2.12.0

v2.11.2

Change Log

Feature

Fix

  • Fix Lookahead optimizer (#200, #201, #202)
    • When using PyTorch Lightning which expects your optimiser to be a subclass of Optimizer.
  • Fix default rectify to False in AdaBelief optimizer (#203)

Test

  • Add DynamicLossScaler test case

Docs

  • Highlight the code blocks
  • Fix pepy badges

Diff

2.11.1...2.11.2

Contributions

thanks to @georg-wolflein

v2.11.1

Change Log

Feature

Diff

2.11.0...2.11.1

v2.11.0

Change Log

Feature

Diff

2.10.1...2.11.0

v2.10.1

Change Log

Feature

Fix

  • perturb isn't multiplied by -step_size in SWATS optimizer. (#179)
  • chebyshev step has size of T while the permutation is 2^T. (#168, #181)

Diff

2.10.0...2.10.1

v2.10.0

Change Log

Feature

Diff

2.9.1...2.10.0

v2.9.1

Change Log

Fix

  • fix weight decay in Ranger21 (#170)

Diff

2.9.0...2.9.1

v2.9.0

Change Log

Feature

Docs

  • Fix readthedocs build issue (#156)
  • Move citations into table (#156)

Refactor

  • Refactor validation logic (#149, #150)
  • Rename amsbound, amsgrad terms into ams_bound (#149)
  • Return gradient instead of the parameter, AGC. (#149)
  • Refactor duplicates (e.g. rectified step size, AMSBound, AdamD, AdaNorm, weight decay) into re-usable functions (#150)
  • Move pytorch_optimizer.experimental under pytorch_optimizer.*.experimental

Diff

2.8.0...2.9.0

v2.8.0

Change Log

Feature

Bug

  • Fix update in Lion optimizer (#135)
  • Fix momentum_buffer in SGDP optimizer (#139)

Diff

2.7.0...2.8.0

v2.7.0

Change Log

Feature

Refactor

  • Rename adamd_debias_term to adam_debias (#133)
  • Merge the rectified version with the original (#133)
    • diffRGrad + diffGrad -> diffGrad
    • RaLamb + Lamb -> Lamb
    • now you can simply use with rectify=True

Bug

  • Fix previous_grad deepcopy issue in Adan optimizer (#134)

Diff

2.6.1...2.7.0