Abstract base class for parameter optimizers. More...

Inheritance diagram for Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >:

Public Member Functions
virtual	~Optimizer ()=default
virtual void	addParameter (ITensor param, ITensor grad)=0
	Register a parameter tensor for optimization.
virtual float	getLearningRate () const =0
	Get the current learning rate.
virtual void	setLearningRate (float learning_rate)=0
	Set the learning rate for future updates.
virtual void	step ()=0
	Perform one optimization step.

Detailed Description

template<DeviceType TDeviceType, TensorDataType TPrecision>
class Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >

Abstract base class for parameter optimizers.

Optimizers update model parameters using computed gradients according to specific update rules (SGD, Adam, AdamW, etc.). The optimizer:

Maintains internal state per parameter (momentum, velocity, etc.)
Performs parameter updates via step()

Template Parameters:

Template Parameters

TDeviceType	Device where optimization occurs (DeviceType::Cpu or DeviceType::Cuda)
TPrecision	Abstract tensor precision (TensorDataType::FP32, FP16, BF16)

Typical usage pattern:

// Create optimizer
auto optimizer = std::make_shared<AdamWOptimizer<DeviceType::Cuda, TensorDataType::FP32>>(
    learning_rate, beta1, beta2, epsilon, weight_decay);
 
// Register parameters
auto params = model->getParameters();
auto grads = model->getGradients();
for (size_t i = 0; i < params.size(); ++i) {
    optimizer->addParameter(params[i], grads[i]);
}
 
// Training loop
for (size_t epoch = 0; epoch < num_epochs; ++epoch) {
    model->zeroGradients();           // Clear model-owned gradients (activation + parameter grads)
    model->forward(input, output);    // Forward pass
    model->backward(input, grad);     // Compute gradients
    optimizer->step();                // Update parameters
}

Implementation Requirements:

Derived classes must handle device-specific memory and execution
State tensors (momentum, variance) must reside on same device as parameters
Thread-safety is not guaranteed; synchronize externally if needed
Parameters must remain valid for optimizer lifetime

Note: Parameters and gradients are stored as weak references; the Module retains ownership and is responsible for parameter lifetime.; Implementations should support asynchronous execution where possible (e.g., CUDA streams) without requiring explicit synchronization in step().

Constructor & Destructor Documentation

◆ ~Optimizer()

template<DeviceType TDeviceType, TensorDataType TPrecision>

virtual Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::~Optimizer ( )

virtualdefault

Member Function Documentation

◆ addParameter()

template<DeviceType TDeviceType, TensorDataType TPrecision>

virtual void Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::addParameter	(	ITensor *	param,
		ITensor *	grad )

pure virtual

Register a parameter tensor for optimization.

Adds a parameter-gradient pair to the optimizer's update list. The optimizer will allocate internal state tensors (momentum, variance, etc.) matching the parameter shape and device placement.

Parameters

param	Shared pointer to parameter tensor to be optimized
grad	Shared pointer to gradient tensor (must match param shape)

Exceptions

std::invalid_argument	if param or grad is nullptr
std::invalid_argument	if param and grad shapes don't match
std::invalid_argument	if param and grad are on different devices
std::runtime_error	if state allocation fails

Note: Must be called after model->build() when parameter shapes are known; Parameter and gradient must persist for the optimizer's lifetime; Calling multiple times with same parameter updates the gradient reference; State tensors are initialized to zero on first registration

See also: step()

Example:

auto params = model->getParameters();
auto grads = model->getGradients();
 
for (size_t i = 0; i < params.size(); ++i) {
    optimizer->addParameter(params[i], grads[i]);
}

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

◆ getLearningRate()

template<DeviceType TDeviceType, TensorDataType TPrecision>

virtual float Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::getLearningRate ( ) const

pure virtual

Get the current learning rate.

Returns the base learning rate used for parameter updates. Some optimizers may apply adaptive per-parameter learning rates internally (Adam, AdamW), but this method returns the global scaling factor.

Returns: Current learning rate as a float

Note: For adaptive optimizers, actual effective learning rate per parameter may differ due to momentum and variance scaling

See also: setLearningRate()

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

◆ setLearningRate()

template<DeviceType TDeviceType, TensorDataType TPrecision>

virtual void Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::setLearningRate ( float learning_rate )

pure virtual

Set the learning rate for future updates.

Updates the base learning rate used by the optimizer. Typically used for learning rate schedules (decay, warmup, cyclic, etc.).

Parameters

learning_rate New learning rate (must be positive)

Exceptions

std::invalid_argument if learning_rate <= 0

Note: Takes effect immediately for the next step() call; Does not affect optimizer state (momentum, variance); For learning rate schedules, call this at epoch or iteration boundaries

See also: getLearningRate()

Example with learning rate decay:

float initial_lr = 0.001f;
optimizer->setLearningRate(initial_lr);
 
for (size_t epoch = 0; epoch < num_epochs; ++epoch) {
    // Training loop...
 
    // Decay learning rate every 10 epochs
    if (epoch > 0 && epoch % 10 == 0) {
        float new_lr = optimizer->getLearningRate() * 0.5f;
        optimizer->setLearningRate(new_lr);
        std::cout << "Learning rate: " << new_lr << std::endl;
    }
}

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

◆ step()

template<DeviceType TDeviceType, TensorDataType TPrecision>

virtual void Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >::step ( )

pure virtual

Perform one optimization step.

Updates all registered parameters using their accumulated gradients according to the optimizer's update rule (SGD, Adam, AdamW, etc.). This is the HOT PATH method called every training iteration.

For algorithms with state (Adam, AdamW):

Updates first and second moment estimates
Applies bias correction if needed
Computes parameter update
Writes updated parameters back to tensors

Exceptions

std::runtime_error	if no parameters have been registered
std::runtime_error	if gradient data is invalid or null

Note: Gradients should be computed via backward() before calling step(); For CUDA implementations, may be asynchronous (uses device stream); Increments internal step counter for algorithms requiring it (Adam, AdamW)

See also: addParameter(); backward()

Typical sequence:

model->zeroGradients();              // Clear previous gradients (model-managed)
model->forward(input, output);       // Forward pass
loss = computeLoss(output, target);
model->backward(input, loss_grad);   // Compute gradients
optimizer->step();                   // Update parameters

Implemented in Mila::Dnn::Compute::CpuAdamWOptimizer< TPrecision >, Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >, and Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >.

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Dnn/Compute/Optimizers/OptimizerBase.ixx

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ ~Optimizer()

Member Function Documentation

◆ addParameter()

◆ getLearningRate()

◆ setLearningRate()

◆ step()