Novograd(params, lr=0.001, betas=(0.9, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False)¶
Novograd based on Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks. The code is adapted from the implementations in Jasper for PyTorch, and OpenSeq2Seq.
Iterable) – iterable of parameters to optimize or dicts defining parameter groups.
float) – learning rate. Defaults to 1e-3.
float]) – coefficients used for computing running averages of gradient and its square. Defaults to (0.9, 0.98).
float) – term added to the denominator to improve numerical stability. Defaults to 1e-8.
float) – weight decay (L2 penalty). Defaults to 0.
bool) – gradient averaging. Defaults to
bool) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond. Defaults to
Performs a single optimization step.
Callable]) – A closure that reevaluates the model and returns the loss. Defaults to