Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision > Class Template Referenceabstractexport

Abstract base class for parameter optimizers. More...

Inheritance diagram for Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >:

Public Member Functions

virtual ~Optimizer ()=default
virtual void addParameter (ITensor *param, ITensor *grad)=0
 Register a parameter tensor for optimization.
virtual float getLearningRate () const =0
 Get the current learning rate.
virtual void setLearningRate (float learning_rate)=0
 Set the learning rate for future updates.
virtual void step ()=0
 Perform one optimization step.

Detailed Description

template<DeviceType TDeviceType, TensorDataType TPrecision>
class Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >

Abstract base class for parameter optimizers.

Optimizers update model parameters using computed gradients according to specific update rules (SGD, Adam, AdamW, etc.). The optimizer:

  • Maintains internal state per parameter (momentum, velocity, etc.)
  • Performs parameter updates via step()

Template Parameters:

Template Parameters
TDeviceTypeDevice where optimization occurs (DeviceType::Cpu or DeviceType::Cuda)
TPrecisionAbstract tensor precision (TensorDataType::FP32, FP16, BF16)

Typical usage pattern:

// Create optimizer
auto optimizer = std::make_shared<AdamWOptimizer<DeviceType::Cuda, TensorDataType::FP32>>(
learning_rate, beta1, beta2, epsilon, weight_decay);
// Register parameters
auto params = model->getParameters();
auto grads = model->getGradients();
for (size_t i = 0; i < params.size(); ++i) {
optimizer->addParameter(params[i], grads[i]);
}
// Training loop
for (size_t epoch = 0; epoch < num_epochs; ++epoch) {
model->zeroGradients(); // Clear model-owned gradients (activation + parameter grads)
model->forward(input, output); // Forward pass
model->backward(input, grad); // Compute gradients
optimizer->step(); // Update parameters
}

Implementation Requirements:

  • Derived classes must handle device-specific memory and execution
  • State tensors (momentum, variance) must reside on same device as parameters
  • Thread-safety is not guaranteed; synchronize externally if needed
  • Parameters must remain valid for optimizer lifetime
Note
Parameters and gradients are stored as weak references; the Module retains ownership and is responsible for parameter lifetime.
Implementations should support asynchronous execution where possible (e.g., CUDA streams) without requiring explicit synchronization in step().

Constructor & Destructor Documentation

◆ ~Optimizer()

template<DeviceType TDeviceType, TensorDataType TPrecision>
virtual Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::~Optimizer ( )
virtualdefault

Member Function Documentation

◆ addParameter()

template<DeviceType TDeviceType, TensorDataType TPrecision>
virtual void Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::addParameter ( ITensor * param,
ITensor * grad )
pure virtual

Register a parameter tensor for optimization.

Adds a parameter-gradient pair to the optimizer's update list. The optimizer will allocate internal state tensors (momentum, variance, etc.) matching the parameter shape and device placement.

Parameters
paramShared pointer to parameter tensor to be optimized
gradShared pointer to gradient tensor (must match param shape)
Exceptions
std::invalid_argumentif param or grad is nullptr
std::invalid_argumentif param and grad shapes don't match
std::invalid_argumentif param and grad are on different devices
std::runtime_errorif state allocation fails
Note
Must be called after model->build() when parameter shapes are known
Parameter and gradient must persist for the optimizer's lifetime
Calling multiple times with same parameter updates the gradient reference
State tensors are initialized to zero on first registration
See also
step()

Example:

auto params = model->getParameters();
auto grads = model->getGradients();
for (size_t i = 0; i < params.size(); ++i) {
optimizer->addParameter(params[i], grads[i]);
}

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

◆ getLearningRate()

template<DeviceType TDeviceType, TensorDataType TPrecision>
virtual float Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::getLearningRate ( ) const
pure virtual

Get the current learning rate.

Returns the base learning rate used for parameter updates. Some optimizers may apply adaptive per-parameter learning rates internally (Adam, AdamW), but this method returns the global scaling factor.

Returns
Current learning rate as a float
Note
For adaptive optimizers, actual effective learning rate per parameter may differ due to momentum and variance scaling
See also
setLearningRate()

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

◆ setLearningRate()

template<DeviceType TDeviceType, TensorDataType TPrecision>
virtual void Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::setLearningRate ( float learning_rate)
pure virtual

Set the learning rate for future updates.

Updates the base learning rate used by the optimizer. Typically used for learning rate schedules (decay, warmup, cyclic, etc.).

Parameters
learning_rateNew learning rate (must be positive)
Exceptions
std::invalid_argumentif learning_rate <= 0
Note
Takes effect immediately for the next step() call
Does not affect optimizer state (momentum, variance)
For learning rate schedules, call this at epoch or iteration boundaries
See also
getLearningRate()

Example with learning rate decay:

float initial_lr = 0.001f;
optimizer->setLearningRate(initial_lr);
for (size_t epoch = 0; epoch < num_epochs; ++epoch) {
// Training loop...
// Decay learning rate every 10 epochs
if (epoch > 0 && epoch % 10 == 0) {
float new_lr = optimizer->getLearningRate() * 0.5f;
optimizer->setLearningRate(new_lr);
std::cout << "Learning rate: " << new_lr << std::endl;
}
}

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

◆ step()

template<DeviceType TDeviceType, TensorDataType TPrecision>
virtual void Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::step ( )
pure virtual

Perform one optimization step.

Updates all registered parameters using their accumulated gradients according to the optimizer's update rule (SGD, Adam, AdamW, etc.). This is the HOT PATH method called every training iteration.

For algorithms with state (Adam, AdamW):

  • Updates first and second moment estimates
  • Applies bias correction if needed
  • Computes parameter update
  • Writes updated parameters back to tensors
Exceptions
std::runtime_errorif no parameters have been registered
std::runtime_errorif gradient data is invalid or null
Note
Gradients should be computed via backward() before calling step()
For CUDA implementations, may be asynchronous (uses device stream)
Increments internal step counter for algorithms requiring it (Adam, AdamW)
See also
addParameter()
backward()

Typical sequence:

model->zeroGradients(); // Clear previous gradients (model-managed)
model->forward(input, output); // Forward pass
loss = computeLoss(output, target);
model->backward(input, loss_grad); // Compute gradients
optimizer->step(); // Update parameters

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.


The documentation for this class was generated from the following file: