- Add support for
foreach. (#287, #476, #477)- More than 10 optimizers (e.g. AdaFactor, StableAdamW, Lion, AdaBelief, Amos, ...) now support
foreach. - In most cases,
foreachimproves training speed by 1.1x to 1.5x, with a moderate increase in memory usage. - Like official
PyTorchoptimizers, the default value offoreachisNone. Whenforeach=None, CUDA paths prefer theforeachimplementation over the for-loop implementation. - If you need the previous for-loop behavior, set
foreach=Falseexplicitly.
- More than 10 optimizers (e.g. AdaFactor, StableAdamW, Lion, AdaBelief, Amos, ...) now support
- Update the Emo-series optimizers. (#472, #478)
- Update
EmoNavi,EmoFact, andEmoLynx. - Begin deprecating
EmoNecoandEmoZeal(they are being phased out).
- Update
- Implement
SpectralSphereoptimizer. (#483, #485) - Support various coefficients for
zero_power_via_newton_schulz_5. (#487)- Add coefficient presets:
original,quintic,polar_express, andpolar_express_safer. - Support custom coefficient schedules and expose
ns_coeffsinMuon,DistributedMuon,AdaMuon, andAdaGO.
- Add coefficient presets:
- Rename and organize type aliases. (#488)
- Fix misbehavior in
AdaFactoroptimizer. (#477) - Fix a potential
NaNissue inAdamPoptimizer. (#480, #481) - Fix
Lookaheadwrapper compatibility withaccelerateby normalizinglookahead_stateserialization. (#484, #489)
- Convert the previous docstring style to Google style. (#487)
- Add
py.typedto mark distributed typing information. (#487) - rework the README.md. (#491, #492)
- Introduce
uv. (#473)
- Implement BCOS optimizer. (#455, #458)
- Implement CWD (Cautious Weight Decay) feature. (#453, #460)
- Implement Ano optimizer. (#432, #466)
- Enable more Pyright rules. (#457)
- Reduce the visualization image size to load fast. (#457)
- Update the EmoNavi, EmoLynx, EmoFact optimizers to the latest version. (#454, #461)
- Add a
CLAUDE.md. (#467)
- LoRA training failed when Dropout module is enabled in
Kohya_ss. (#462, #463) - Fix potential LookSAM optimizer bug when there are multiple parameter groups. (#442, #465)
- Speed-up
zeropower_via_newtonschulzup to 20% by utilizingtorch.baddmmandtorch.addmmops. (#448)
- Refactor the type hints. (#448)
- Resolved compatibility issue with lower PyTorch versions where
torch.optim.optimizer.ParamTcould not be imported. (#448)
- Convert the docstring style from reST to google-style docstring. (#449)
- Implement
FriendlySAMoptimizer. (#424, #434) - Implement
AdaGOoptimizer. (#436, #437) - Update
EXAdamoptimizer to the latest version. (#438) - Update
EmoNavioptimizer to the latest version. (#433, #439) - Implement
Condaoptimizer. (#440, #441)
- Accept the
GaloreProjectorparameters in the init params of theCondaoptimizer. (#443, #444)
- Fix NaN problem when grad norm is zero in StableSPAM optimizer. (#431)
- Update the documentation page. (#428)
thanks to @liveck, @AhmedMostafa16
- Implement
EmoNecoandEmoZealoptimizers. (#407) - Implement
Refined Schedule-Free AdamWoptimizer. (#409, #414)- Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
- You can use this variant by setting
decoupling_cparameter in theScheduleFreeAdamWoptimizer.
- Add more built-in optimizers,
NAdam,RMSProp, andLBFGSoptimizers. (#415) - Support
cautiousvariant forMuonoptimizer. (#417) - Separate distributed functionality from
MuontoDistribtuedMuonoptimizer. (#418) - Implement
StochasticAccumulator, which is a gradient hook. (#418)
- Re-implement
MuonandAdaMuonoptimizers based on the recent official implementation. (#408, #410)- Their definitions have changed from the previous version, so please check out the documentation!
- Update the missing optimizers from
__init__.py. (#415) - Add the HuggingFace Trainer example. (#415)
- Optimize the visualization outputs and change the visualization document to a table layout. (#416)
- Update
mkdocsdependencies. (#417)
- Add some GitHub actions to automate some processes. (#411, #412, #413)
thanks to @AidinHamedi
- Implement
AdaMuonoptimizer. (#394, #395) - Implement
SPlusoptimizer. (#396, #399) - Implement
EmoNavi,EmoFact, andEmoLynxoptimizers. (#393, #400)
- Enable CI for Python 3.8 ~ 3.13. (#402, #404)
- Adjust the value of
epsto the fixed value1e-15when adding toexp_avg_sq. (#397, #398) - built-in type-hint in
Kronoptimizer. (#404)
- Implement more cooldown types for WSD learning rate scheduler. (#382, #386)
- Implement
AdamWSNoptimizer. (#387, #389) - Implement
AdamCoptimizer. (#388, #390)
- Change the default range of the
betaparameter from[0, 1]to[0, 1). (#392)
- Fix to use
momentum bufferinstead of the gradient to calculate LMO. (#385)
- Implement
Firaoptimizer. (#376) - Implement
RACSandAliceoptimizers. (#376) - Implement
VSGDoptimizer. (#377, #378) - Enable training with complex parameters. (#370, #380)
- will raise
NoComplexParameterErrorfor unsupported optimizers, due to its design or not-yet-implemented.
- will raise
- Support
maximizeparameter. (#370, #380)maximize: maximize the objective with respect to the params, instead of minimizing.
- Implement
copy_stochastic()method. (#381)
- Support 2D< Tensor for
RACSandAliceoptimizers. (#380) - Remove the auxiliary variants from the default parameters of the optimizers and change the name of the state and parameter. (#380)
use_gc,adanorm,cautious,stable_adamw, andadam_debiaswill be affected.- You can still use these variants by passing the parameters to
**kwargs. - Notably, in case of
adanormvariant, you need to passadanorm(andadanorm_rforroption) parameter(s) to use this variant, and the name of the state will be changed fromexp_avg_normtoexp_avg_adanorm.
- Refactor
reset()toinit_group()method in theBaseOptimizerclass. (#380) - Refactor
SAMoptimizer family. (#380) - Gather
AdamP,SGDPthings intopytorch_optimizer.optimizer.adamp.*. (#381)pytorch_optimizer.optimizer.sgdp.SGDPtopytorch_optimizer.optimizer.adamp.SGDPpytorch_optimizer.optimizer.util.projectiontopytorch_optimizer.optimizer.adamp.projectionpytorch_optimizer.optimizer.util.cosine_similarity_by_viewtopytorch_optimizer.optimizer.adamp.cosine_similarity_by_view
- Remove
channel_view()andlayer_view()frompytorch_optimizer.optimizer.util. (#381)
- Fix shape mismatch issues in the Galore projection for
reverse_std,right, andfullprojection types. (#376)
- Implement
ScionLightoptimizer. (#369)
- Update
SCIONoptimizer based on the official implementation. (#369)
- Correct the learning rate ratio in
Muonoptimizer properly. (#371, #372, #373)
- Support
StableSPAMoptimizer. (#358, #359) - Support
ScheduleFreeWrapper. (#334, #360) - Implement
AdaGCoptimizer. (#364, #366) - Implement
Simplified-Ademamixoptimizer. (#364, #366) - Support
Ackleyfunction for testing optimization algorithms.
- Update Muon optimizer. (#355, #356)
- support decoupled weight decay.
- adjust default hyperparameters the same as the original implementation.
- support adjusted lr from the Moonlight. you can use it by setting
use_adjusted_lr=True.
- Tune the performance of the coupled Newton iteration method by 5% increase. (#360)
- Update
SCIONoptimizer. (#361)- add
scaleparameter. - update
get_lmo_direction.
- add
- bias_correction2 in ScheduleFreeRAdam optimizer. (#354)
- potential bug in SPAM optimizer. (#365)
- initialize the
zstate within thestep()of the ScheduleFreeWrapper. (#363, #366)
- Implement
SCIONoptimizer. (#348, #352)
- Update ScheduleFreeSGD, AdamW, RAdam optimizers with the latest. (#351, #353)
- Remove
use_palmvariant in ScheduleFree optimizer due to instability. (#353) - Ranger25 optimizer. (#353)
- Remove
weight decoupleparameter in ScheduleFree optimizers. (#351, #353)
- Fix
AliGoptimizer visualization. (#350)
thanks to @AidinHamedi, @hatonosuke
- Support
GCSAMoptimizer. (#343, #344)- Gradient Centralized Sharpness Aware Minimization
- you can use it from
SAMoptimizer by settinguse_gc=True.
- Support
LookSAMoptimizer. (#343, #344)
- Support alternative precision training for
Shampoooptimizer. (#339) - Add more features to and tune
Ranger25optimizer. (#340)AGC+Lookaheadvariants- change default beta1, beta2 to 0.95 and 0.98 respectively
- Skip adding
Lookaheadwrapper in case ofRanger*optimizers, which already have it increate_optimizer(). (#340) - Improved optimizer visualization. (#345)
- Rename
pytorch_optimizer.optimizer.gctopytorch_optimizer.optimizer.gradient_centralizationto avoid possible conflict with Python built-in functiongc. (#349)
- Fix to update exp_avg_sq after calculating the denominator in
ADOPToptimizer. (#346, #347)
- Update the visualizations. (#340)
thanks to @AidinHamedi
- Implement
FOCUSoptimizer. (#330, #331) - Implement
PSGD Kronoptimizer. (#336, #337) - Implement
EXAdamoptimizer. (#338, #339)
- Support
OrthoGradvariant toRanger25. (#332)Ranger25optimizer is my experimental-crafted optimizer, which mixes lots of optimizer variants such asADOPT+AdEMAMix+Cautious+StableAdamW+Adam-Atan2+OrthoGrad.
- Add the missing
stateproperty inOrthoGradoptimizer. (#326, #327) - Add the missing
state_dict, andload_state_dictmethods toTRACandOrthoGradoptimizers. (#332) - Skip when the gradient is sparse in
OrthoGradoptimizer. (#332) - Support alternative precision training in
SOAPoptimizer. (#333) - Store SOAP condition matrices as the dtype of their parameters. (#335)
thanks to @Vectorrent, @kylevedder
- Support
OrthoGradfeature forcreate_optimizer(). (#324) - Enhanced flexibility for the
optimizerparameter inLookahead,TRAC, andOrthoGradoptimizers. (#324)- Now supports both torch.optim.Optimizer instances and classes
- You can now use
Lookaheadoptimizer in two ways.Lookahead(AdamW(model.parameters(), lr=1e-3), k=5, alpha=0.5)Lookahead(AdamW, k=5, alpha=0.5, params=model.parameters())
- Implement
SPAMoptimizer. (#324) - Implement
TAM, andAdaTAMoptimizers. (#325)
- Implement
Gramsoptimizer. (#317, #318) - Support
stable_adamwvariant forADOPTandAdEMAMixoptimizer. (#321)optimizer = ADOPT(model.parameters(), ..., stable_adamw=True)
- Implement an experimental optimizer
Ranger25(not tested). (#321)- mixing
ADOPT + AdEMAMix + StableAdamW + Cautious + RAdamoptimizers.
- mixing
- Implement
OrthoGradoptimizer. (#321) - Support
Adam-Atan2feature forProdigyoptimizer whenepsis None. (#321)
- Implement
SGDSaIoptimizer. (#315, #316)
- Clone
exp_avgbefore callingapply_cautiousnot to maskexp_avg. (#316)
- Support
Cautiousvariant toAdaShiftoptimizer. (#310) - Save the state of the
Lookaheadoptimizer too. (#310) - Implement
APOLLOoptimizer. (#311, #312) - Rename the
Apollo(An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization) optimizer name toApolloDQNnot to overlap with the new optimizer nameAPOLLO. (#312) - Implement
MARSoptimizer. (#313, #314) - Support
Cautiousvariant toMARSoptimizer. (#314)
- Fix
bias_correctioninAdamGoptimizer. (#305, #308) - Fix a potential bug when loading the state for
Lookaheadoptimizer. (#306, #310)
- Add more visualizations. (#310, #314)
thanks to @Vectorrent
- Support
PaLMvariant forScheduleFreeAdamWoptimizer. (#286, #288)- you can use this feature by setting
use_palmtoTrue.
- you can use this feature by setting
- Implement
ADOPToptimizer. (#289, #290) - Implement
FTRLoptimizer. (#291) - Implement
Cautious optimizerfeature. (#294)- Improving Training with One Line of Code
- you can use it by setting
cautious=TrueforLion,AdaFactorandAdEMAMixoptimizers.
- Improve the stability of
ADOPToptimizer. (#294) - Support a new projection type
randomforGaLoreProjector. (#294) - Implement
DeMooptimizer. (#300, #301) - Implement
Muonoptimizer. (#302) - Implement
ScheduleFreeRAdamoptimizer. (#304) - Implement
LaPropoptimizer. (#304) - Support
Cautiousvariant toLaProp,AdamP,Adoptoptimizers. (#304).
- Big refactoring, removing direct import from
pytorch_optimizer.*.- I removed some methods not to directly import from it from
pytorch_optimzier.*because they're probably not used frequently and actually not an optimizer rather utils only used for specific optimizers. pytorch_optimizer.[Shampoo stuff]->pytorch_optimizer.optimizers.shampoo_utils.[Shampoo stuff].shampoo_utilslikeGraft,BlockPartitioner,PreConditioner, etc. You can check the details here.
pytorch_optimizer.GaLoreProjector->pytorch_optimizer.optimizers.galore.GaLoreProjector.pytorch_optimizer.gradfilter_ema->pytorch_optimizer.optimizers.grokfast.gradfilter_ema.pytorch_optimizer.gradfilter_ma->pytorch_optimizer.optimizers.grokfast.gradfilter_ma.pytorch_optimizer.l2_projection->pytorch_optimizer.optimizers.alig.l2_projection.pytorch_optimizer.flatten_grad->pytorch_optimizer.optimizers.pcgrad.flatten_grad.pytorch_optimizer.un_flatten_grad->pytorch_optimizer.optimizers.pcgrad.un_flatten_grad.pytorch_optimizer.reduce_max_except_dim->pytorch_optimizer.optimizers.sm3.reduce_max_except_dim.pytorch_optimizer.neuron_norm->pytorch_optimizer.optimizers.nero.neuron_norm.pytorch_optimizer.neuron_mean->pytorch_optimizer.optimizers.nero.neuron_mean.
- I removed some methods not to directly import from it from
- Add more visualizations. (#297)
- Add optimizer parameter to
PolySchedulerconstructor. (#295)
thanks to @tanganke
- Implement
SOAPoptimizer. (#275) - Support
AdEMAMixvariants. (#276)bnb_ademamix8bit,bnb_ademamix32bit,bnb_paged_ademamix8bit,bnb_paged_ademamix32bit
- Support 8/4bit, fp8 optimizers. (#208, #281)
torchao_adamw8bit,torchao_adamw4bit,torchao_adamwfp8.
- Support a module-name-level (e.g.
LayerNorm) weight decay exclusion forget_optimizer_parameters. (#282, #283) - Implement
CPUOffloadOptimizer, which offloads optimizer to CPU for single-GPU training. (#284) - Support a regex-based filter for searching names of optimizers, lr schedulers, and loss functions.
- Fix
should_grokfastcondition when initialization. (#279, #280)
thanks to @Vectorrent
- Implement
AdEMAMixoptimizer. (#272)
- Add
**kwargsto the parameters for dummy placeholder. (#270, #271)
- Implement
TRACoptimizer. (#263) - Support
AdamWoptimizer viacreate_optimizer(). (#263) - Implement
AdamGoptimizer. (#264, #265)
- Handle the optimizers that only take the
modelinstead of the parameters increate_optimizer(). (#263) - Move the variable to the same device with the parameter. (#266, #267)
- Implement
AdaLomooptimizer. (#258) - Support
Q-GaLoreoptimizer. (#258)- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
- you can use by
optimizer = load_optimizer('q_galore_adamw8bit')
- Support more bnb optimizers. (#258)
bnb_paged_adam8bit,bnb_paged_adamw8bit,bnb_*_*32bit.
- Improve
power_iteration()speed up to 40%. (#259) - Improve
reg_noise()(E-MCMC) speed up to 120%. (#260) - Support
disable_lr_schedulerparameter forRanger21optimizer to disable built-in learning rate scheduler. (#261)
- Refactor
AdamMinioptimizer. (#258) - Deprecate optional dependency,
bitsandbytes. (#258) - Move
get_rms,approximate_sq_gradfunctions toBaseOptimizerfor reusability. (#258) - Refactor
shampoo_utils.py. (#259) - Add
debias,debias_adammethods inBaseOptimizer. (#261) - Refactor to use
BaseOptimizeronly, not inherit multiple classes. (#261)
- Fix several bugs in
AdamMinioptimizer. (#257)
thanks to @sdbds
- Implement
WSDLR Scheduler. (#247, #248) - Add more Pytorch built-in lr schedulers. (#248)
- Implement
Kateoptimizer. (#249, #251) - Implement
StableAdamWoptimizer. (#250, #252) - Implement
AdamMinioptimizer. (#246, #253)
- Refactor
Chebyschevlr scheduler modules. (#248)- Rename
get_chebyshev_lrtoget_chebyshev_lr_lambda. - Rename
get_chebyshev_scheduletoget_chebyshev_perm_steps. - Call
get_chebyshev_schedulefunction to getLamdbaLRscheduler object.
- Rename
- Refactor with
ScheduleType. (#248)
- Implement
FAdamoptimizer. (#241, #242) - Tweak
AdaFactoroptimizer. (#236, #243)- support not-using-first-momentum when beta1 is not given
- default dtype for first momentum to
bfloat16 - clip second momentum to 0.999
- Implement
GrokFastoptimizer. (#244, #245)
- Wrong typing of reg_noise. (#239, #240)
- Lookahead`s param_groups attribute is not loaded from checkpoint. (#237, #238)
thanks to @michaldyczko
The major version is updated! (v2.12.0 -> v3.0.0) (#164)
Many optimizers, learning rate schedulers, and objective functions are in pytorch-optimizer.
Currently, pytorch-optimizer supports 67 optimizers (+ bitsandbytes), 11 lr schedulers, and 13 loss functions, and reached about 4 ~ 50K downloads / month (peak is 75K downloads / month)!
The reason for updating the major version from v2 to v3 is that I think it's a good time to ship the recent implementations (the last update was about 7 months ago) and plan to pivot to new concepts like training utilities while maintaining the original features (e.g. optimizers).
Also, rich test cases, benchmarks, and examples are on the list!
Finally, thanks for using the pytorch-optimizer, and feel free to make any requests :)
- Implement
REXlr scheduler. (#217, #222) - Implement
Aidaoptimizer. (#220, #221) - Implement
WSAMoptimizer. (#213, #216) - Implement
GaLoreoptimizer. (#224, #228) - Implement
Adaliteoptimizer. (#225, #229) - Implement
bSAMoptimizer. (#212, #233) - Implement
Schedule-Freeoptimizer. (#230, #233) - Implement
EMCMC. (#231, #233)
- Fix SRMM to allow operation beyond memory_length. (#227)
- Drop
Python 3.7support officially. (#221)- Please check the README.
- Update
bitsandbytesto0.43.0. (#228)
- Add missing parameters in
Ranger21 optimizerdocument. (#214, #215) - Fix
WSAMoptimizer paper link. (#219)
- from the previous major version : 2.0.0...3.0.0
- from the previous version: 2.12.0...3.0.0
thanks to @sdbds, @i404788
- Support
bitsandbytesoptimizer. (#211)- now, you can install with
pip3 install pytorch-optimizer[bitsandbytes] - supports 8 bnb optimizers.
bnb_adagrad8bit,bnb_adam8bit,bnb_adamw8bit,bnb_lion8bit,bnb_lamb8bit,bnb_lars8bit,bnb_rmsprop8bit,bnb_sgd8bit.
- now, you can install with
- Introduce
mkdocswithmaterialtheme. (#204, #206)- documentation : https://pytorch-optimizers.readthedocs.io/en/latest/
- Implement DAdaptLion optimizer (#203)
- Fix Lookahead optimizer (#200, #201, #202)
- When using PyTorch Lightning which expects your optimiser to be a subclass of
Optimizer.
- When using PyTorch Lightning which expects your optimiser to be a subclass of
- Fix default
rectifytoFalseinAdaBeliefoptimizer (#203)
- Add
DynamicLossScalertest case
- Highlight the code blocks
- Fix pepy badges
thanks to @georg-wolflein
- Implement Tiger optimizer (#192)
- Implement CAME optimizer (#196)
- Implement loss functions (#198)
- Tversky Loss : Tversky loss function for image segmentation using 3D fully convolutional deep networks
- Focal Tversky Loss
- Lovasz Hinge Loss : The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks
- Implement PAdam optimizer (#186)
- Implement LOMO optimizer (#188)
- Implement loss functions (#189)
- BCELoss
- BCEFocalLoss
- FocalLoss : Focal Loss for Dense Object Detection
- FocalCosineLoss : Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble
- DiceLoss : Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations
- LDAMLoss : Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss
- JaccardLoss
- BiTemperedLogisticLoss : Robust Bi-Tempered Logistic Loss Based on Bregman Divergences
- Implement Prodigy optimizer (#183)
perturbisn't multiplied by-step_sizein SWATS optimizer. (#179)chebyshev stephas size ofTwhile the permutation is2^T. (#168, #181)
- Implement Amos optimizer (#174)
- Implement SignSGD optimizer (#176) (thanks to @i404788)
- Implement AdaHessian optimizer (#176) (thanks to @i404788)
- Implement SophiaH optimizer (#173, #176) (thanks to @i404788)
- Implement re-usable functions to compute hessian in
BaseOptimizer(#176, #177) (thanks to @i404788)- two types of distribution are supported (
gaussian,rademacher).
- two types of distribution are supported (
- Support
AdamDvariant for AdaHessian optimizer (#177)
- fix weight decay in Ranger21 (#170)
- Implement AdaMax optimizer (#148)
- A variant of Adam based on the infinity norm
- Implement Gravity optimizer (#151)
- Implement AdaSmooth optimizer (#153)
- Implement SRMM optimizer (#154)
- Implement AvaGrad optimizer (#155)
- Implement AdaShift optimizer (#157)
- Upgrade to D-Adaptation v3 (#158, #159)
- Implement AdaDelta optimizer (#160)
- Fix readthedocs build issue (#156)
- Move citations into table (#156)
- Refactor validation logic (#149, #150)
- Rename
amsbound,amsgradterms intoams_bound(#149) - Return gradient instead of the parameter, AGC. (#149)
- Refactor duplicates (e.g. rectified step size, AMSBound, AdamD, AdaNorm, weight decay) into re-usable functions (#150)
- Move
pytorch_optimizer.experimentalunderpytorch_optimizer.*.experimental
- Implement A2Grad optimizer (#136)
- Implement Accelerated SGD optimizer (#137)
- Implement Adaptive SGD optimizer (#139)
- Implement SGDW optimizer (#139)
- Implement Yogi optimizer (#140)
- Implement SWATS optimizer (#141)
- Implement Fromage optimizer (#142)
- Implement MSVAG optimizer (#143)
- Implement AdaMod optimizer (#144)
- Implement AggMo optimizer (#145)
- Implement QHAdam, QHM optimizers (#146)
- Implement PID optimizer (#147)
- Fix
updatein Lion optimizer (#135) - Fix
momentum_bufferin SGDP optimizer (#139)
- Implement
AdaNormoptimizer (#133) - Implement
RotoGradoptimizer (#124, #134) - Implement
D-Adapt Adanoptimizer (#134) - Support
AdaNormvariant (#133, #134)- AdaBelief
- AdamP
- AdamS
- AdaPNM
- diffGrad
- Lamb
- RAdam
- Ranger
- Adan
- Support
AMSGradvariant (#133, #134)- diffGrad
- AdaFactor
- Support
degenerated_to_sgd(#133)- Ranger
- Lamb
- Rename
adamd_debias_termtoadam_debias(#133) - Merge the rectified version with the original (#133)
- diffRGrad + diffGrad -> diffGrad
- RaLamb + Lamb -> Lamb
- now you can simply use with
rectify=True
- Fix
previous_graddeepcopy issue in Adan optimizer (#134)