Optimizers#

LearningRateFinder#

class monai.optimizers.LearningRateFinder(model, optimizer, criterion, device=None, memory_cache=True, cache_dir=None, amp=False, pickle_module=<module 'pickle' from '/home/docs/.asdf/installs/python/3.9.18/lib/python3.9/pickle.py'>, pickle_protocol=2, verbose=True)[source]#

Learning rate range test.

The learning rate range test increases the learning rate in a pre-training run between two boundaries in a linear or exponential manner. It provides valuable information on how well the network can be trained over a range of learning rates and what is the optimal learning rate.

Example (fastai approach): >>> lr_finder = LearningRateFinder(net, optimizer, criterion) >>> lr_finder.range_test(data_loader, end_lr=100, num_iter=100) >>> lr_finder.get_steepest_gradient() >>> lr_finder.plot() # to inspect the loss-learning rate graph

Example (Leslie Smith’s approach): >>> lr_finder = LearningRateFinder(net, optimizer, criterion) >>> lr_finder.range_test(train_loader, val_loader=val_loader, end_lr=1, num_iter=100, step_mode=”linear”)

Gradient accumulation is supported; example: >>> train_data = … # prepared dataset >>> desired_bs, real_bs = 32, 4 # batch size >>> accumulation_steps = desired_bs // real_bs # required steps for accumulation >>> data_loader = torch.utils.data.DataLoader(train_data, batch_size=real_bs, shuffle=True) >>> acc_lr_finder = LearningRateFinder(net, optimizer, criterion) >>> acc_lr_finder.range_test(data_loader, end_lr=10, num_iter=100, accumulation_steps=accumulation_steps)

By default, image will be extracted from data loader with x[“image”] and x[0], depending on whether batch data is a dictionary or not (and similar behaviour for extracting the label). If your data loader returns something other than this, pass a callable function to extract it, e.g.: >>> image_extractor = lambda x: x[“input”] >>> label_extractor = lambda x: x[100] >>> lr_finder = LearningRateFinder(net, optimizer, criterion) >>> lr_finder.range_test(train_loader, val_loader, image_extractor, label_extractor)

References: Modified from: davidtvs/pytorch-lr-finder. Cyclical Learning Rates for Training Neural Networks: https://arxiv.org/abs/1506.01186

__init__(model, optimizer, criterion, device=None, memory_cache=True, cache_dir=None, amp=False, pickle_module=<module 'pickle' from '/home/docs/.asdf/installs/python/3.9.18/lib/python3.9/pickle.py'>, pickle_protocol=2, verbose=True)[source]#

Constructor.

Parameters:
  • model – wrapped model.

  • optimizer – wrapped optimizer.

  • criterion – wrapped loss function.

  • device – device on which to test. run a string (“cpu” or “cuda”) with an optional ordinal for the device type (e.g. “cuda:X”, where is the ordinal). Alternatively, can be an object representing the device on which the computation will take place. Default: None, uses the same device as model.

  • memory_cache – if this flag is set to True, state_dict of model and optimizer will be cached in memory. Otherwise, they will be saved to files under the cache_dir.

  • cache_dir – path for storing temporary files. If no path is specified, system-wide temporary directory is used. Notice that this parameter will be ignored if memory_cache is True.

  • amp – use Automatic Mixed Precision

  • pickle_module – module used for pickling metadata and objects, default to pickle. this arg is used by torch.save, for more details, please check: https://pytorch.org/docs/stable/generated/torch.save.html#torch.save.

  • pickle_protocol – can be specified to override the default protocol, default to 2. this arg is used by torch.save, for more details, please check: https://pytorch.org/docs/stable/generated/torch.save.html#torch.save.

  • verbose – verbose output

Returns:

None

get_lrs_and_losses(skip_start=0, skip_end=0)[source]#

Get learning rates and their corresponding losses

Parameters:
  • skip_start (int) – number of batches to trim from the start.

  • skip_end (int) – number of batches to trim from the end.

Return type:

tuple[list, list]

get_steepest_gradient(skip_start=0, skip_end=0)[source]#

Get learning rate which has steepest gradient and its corresponding loss

Parameters:
  • skip_start – number of batches to trim from the start.

  • skip_end – number of batches to trim from the end.

Returns:

Learning rate which has steepest gradient and its corresponding loss

plot(skip_start=0, skip_end=0, log_lr=True, ax=None, steepest_lr=True)[source]#

Plots the learning rate range test.

Parameters:
  • skip_start – number of batches to trim from the start.

  • skip_end – number of batches to trim from the start.

  • log_lr – True to plot the learning rate in a logarithmic scale; otherwise, plotted in a linear scale.

  • ax – the plot is created in the specified matplotlib axes object and the figure is not be shown. If None, then the figure and axes object are created in this method and the figure is shown.

  • steepest_lr – plot the learning rate which had the steepest gradient.

Returns:

The matplotlib.axes.Axes object that contains the plot. Returns None if matplotlib is not installed.

range_test(train_loader, val_loader=None, image_extractor=<function default_image_extractor>, label_extractor=<function default_label_extractor>, start_lr=None, end_lr=10.0, num_iter=100, step_mode='exp', smooth_f=0.05, diverge_th=5, accumulation_steps=1, non_blocking_transfer=True, auto_reset=True)[source]#

Performs the learning rate range test.

Parameters:
  • train_loader – training set data loader.

  • val_loader – validation data loader (if desired).

  • image_extractor – callable function to get the image from a batch of data. Default: x[“image”] if isinstance(x, dict) else x[0].

  • label_extractor – callable function to get the label from a batch of data. Default: x[“label”] if isinstance(x, dict) else x[1].

  • start_lr – the starting learning rate for the range test. The default is the optimizer’s learning rate.

  • end_lr – the maximum learning rate to test. The test may stop earlier than this if the result starts diverging.

  • num_iter – the max number of iterations for test.

  • step_mode – schedule for increasing learning rate: (linear or exp).

  • smooth_f – the loss smoothing factor within the [0, 1[ interval. Disabled if set to 0, otherwise loss is smoothed using exponential smoothing.

  • diverge_th – test is stopped when loss surpasses threshold: diverge_th * best_loss.

  • accumulation_steps – steps for gradient accumulation. If set to 1, gradients are not accumulated.

  • non_blocking_transfer – when True, moves data to device asynchronously if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

  • auto_reset – if True, returns model and optimizer to original states at end of test.

Returns:

None

reset()[source]#

Restores the model and optimizer to their initial states.

Return type:

None

Novograd#

class monai.optimizers.Novograd(params, lr=0.001, betas=(0.9, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False)[source]#

Novograd based on Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks. The code is adapted from the implementations in Jasper for PyTorch, and OpenSeq2Seq.

Parameters:
  • params (Iterable) – iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float) – learning rate. Defaults to 1e-3.

  • betas (tuple[float, float]) – coefficients used for computing running averages of gradient and its square. Defaults to (0.9, 0.98).

  • eps (float) – term added to the denominator to improve numerical stability. Defaults to 1e-8.

  • weight_decay (float) – weight decay (L2 penalty). Defaults to 0.

  • grad_averaging (bool) – gradient averaging. Defaults to False.

  • amsgrad (bool) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond. Defaults to False.

step(closure=None)[source]#

Performs a single optimization step.

Parameters:

closure – A closure that reevaluates the model and returns the loss. Defaults to None.

Generate parameter groups#

monai.optimizers.generate_param_groups(network, layer_matches, match_types, lr_values, include_others=True)[source]#

Utility function to generate parameter groups with different LR values for optimizer. The output parameter groups have the same order as layer_match functions.

Parameters:
  • network (Module) – source network to generate parameter groups from.

  • layer_matches (Sequence[Callable]) – a list of callable functions to select or filter out network layer groups, for “select” type, the input will be the network, for “filter” type, the input will be every item of network.named_parameters(). for “select”, the parameters will be select_func(network).parameters(). for “filter”, the parameters will be (x[1] for x in filter(f, network.named_parameters()))

  • match_types (Sequence[str]) – a list of tags to identify the matching type corresponding to the layer_matches functions, can be “select” or “filter”.

  • lr_values (Sequence[float]) – a list of LR values corresponding to the layer_matches functions.

  • include_others (bool) – whether to include the rest layers as the last group, default to True.

It’s mainly used to set different LR values for different network elements, for example:

net = Unet(spatial_dims=3, in_channels=1, out_channels=3, channels=[2, 2, 2], strides=[1, 1, 1])
print(net)  # print out network components to select expected items
print(net.named_parameters())  # print out all the named parameters to filter out expected items
params = generate_param_groups(
    network=net,
    layer_matches=[lambda x: x.model[0], lambda x: "2.0.conv" in x[0]],
    match_types=["select", "filter"],
    lr_values=[1e-2, 1e-3],
)
# the groups will be a list of dictionaries:
# [{'params': <generator object Module.parameters at 0x7f9090a70bf8>, 'lr': 0.01},
#  {'params': <filter object at 0x7f9088fd0dd8>, 'lr': 0.001},
#  {'params': <filter object at 0x7f9088fd0da0>}]
optimizer = torch.optim.Adam(params, 1e-4)
Return type:

list[dict]

ExponentialLR#

class monai.optimizers.ExponentialLR(optimizer, end_lr, num_iter, last_epoch=-1)[source]#

Exponentially increases the learning rate between two boundaries over a number of iterations.

LinearLR#

class monai.optimizers.LinearLR(optimizer, end_lr, num_iter, last_epoch=-1)[source]#

Linearly increases the learning rate between two boundaries over a number of iterations.

WarmupCosineSchedule#

class monai.optimizers.WarmupCosineSchedule(optimizer, warmup_steps, t_total, end_lr=0.0, cycles=0.5, last_epoch=-1, warmup_multiplier=0)[source]#

Linear warmup and then cosine decay. Based on https://huggingface.co/ implementation.

__init__(optimizer, warmup_steps, t_total, end_lr=0.0, cycles=0.5, last_epoch=-1, warmup_multiplier=0)[source]#
Parameters:
  • optimizer (Optimizer) – wrapped optimizer.

  • warmup_steps (int) – number of warmup iterations.

  • t_total (int) – total number of training iterations.

  • end_lr (float) – the final learning rate. Defaults to 0.0.

  • cycles (float) – cosine cycles parameter.

  • last_epoch (int) – the index of last epoch.

  • warmup_multiplier (float) – if provided, starts the linear warmup from this fraction of the initial lr. Must be in 0..1 interval. Defaults to 0

Returns:

None