Device-agnostic AdamW optimizer. More...

Inheritance diagram for Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >:

Collaboration diagram for Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >:

[legend]

Public Types
using	ExecutionContextType = ExecutionContext<TDeviceType>
using	OptimizerType = CpuAdamWOptimizer<TPrecision>

Public Member Functions
	AdamWOptimizer (IExecutionContext *exec_context, const AdamWConfig &config)
	Construct AdamW optimizer from fluent AdamWConfig.
	~AdamWOptimizer () override=default
void	addParameter (ITensor param, ITensor grad) override
	Register a parameter tensor for optimization.
float	getBeta1 () const noexcept
float	getBeta2 () const noexcept
float	getEpsilon () const noexcept
float	getLearningRate () const override
	Get the current learning rate.
size_t	getParameterCount () const noexcept
size_t	getStepCount () const noexcept
float	getWeightDecay () const noexcept
void	setLearningRate (float learning_rate) override
	Set the learning rate for future updates.
void	setWeightDecay (float weight_decay)
void	step () override
	Perform one optimization step.
Public Member Functions inherited from Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >
virtual	~Optimizer ()=default

Private Attributes
AdamWConfig	config_
IExecutionContext *	context_
std::shared_ptr< OptimizerType >	impl_

Detailed Description

template<DeviceType TDeviceType, TensorDataType TPrecision>
requires PrecisionSupportedOnDevice<TPrecision, TDeviceType>
class Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >

Device-agnostic AdamW optimizer.

Dispatches to the appropriate device-specific implementation (CPU or CUDA) based on the TDeviceType template parameter. Uses AdamWConfig for fluent configuration of hyperparameters.

Template Parameters

TDeviceType	Device type (DeviceType::Cpu or DeviceType::Cuda)
TPrecision	Tensor precision (TensorDataType::FP32, FP16, BF16)

Member Typedef Documentation

◆ ExecutionContextType

template<DeviceType TDeviceType, TensorDataType TPrecision>

using Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::ExecutionContextType = ExecutionContext<TDeviceType>

◆ OptimizerType

template<DeviceType TDeviceType, TensorDataType TPrecision>

using Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::OptimizerType = CpuAdamWOptimizer<TPrecision>

Constructor & Destructor Documentation

◆ AdamWOptimizer()

template<DeviceType TDeviceType, TensorDataType TPrecision>

Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::AdamWOptimizer	(	IExecutionContext *	exec_context,
		const AdamWConfig &	config )

inlineexplicit

Construct AdamW optimizer from fluent AdamWConfig.

Parameters

exec_context	Execution context for device resources
config	Fluent AdamWConfig describing hyperparameters

Exceptions

std::invalid_argument	if exec_context is null
std::invalid_argument	if config.validate() fails

Here is the call graph for this function:

◆ ~AdamWOptimizer()

template<DeviceType TDeviceType, TensorDataType TPrecision>

Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::~AdamWOptimizer ( )

overridedefault

Member Function Documentation

◆ addParameter()

template<DeviceType TDeviceType, TensorDataType TPrecision>

void Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::addParameter	(	ITensor *	param,
		ITensor *	grad )

inlineoverridevirtual

Register a parameter tensor for optimization.

Adds a parameter-gradient pair to the optimizer's update list. The optimizer will allocate internal state tensors (momentum, variance, etc.) matching the parameter shape and device placement.

Parameters

param	Shared pointer to parameter tensor to be optimized
grad	Shared pointer to gradient tensor (must match param shape)

Exceptions

std::invalid_argument	if param or grad is nullptr
std::invalid_argument	if param and grad shapes don't match
std::invalid_argument	if param and grad are on different devices
std::runtime_error	if state allocation fails

Note: Must be called after model->build() when parameter shapes are known; Parameter and gradient must persist for the optimizer's lifetime; Calling multiple times with same parameter updates the gradient reference; State tensors are initialized to zero on first registration

See also: step()

Example:

auto params = model->getParameters();
auto grads = model->getGradients();
 
for (size_t i = 0; i < params.size(); ++i) {
    optimizer->addParameter(params[i], grads[i]);
}

Implements Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >.

◆ getBeta1()

template<DeviceType TDeviceType, TensorDataType TPrecision>

float Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getBeta1 ( ) const

inlinenoexcept

◆ getBeta2()

template<DeviceType TDeviceType, TensorDataType TPrecision>

float Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getBeta2 ( ) const

inlinenoexcept

◆ getEpsilon()

template<DeviceType TDeviceType, TensorDataType TPrecision>

float Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getEpsilon ( ) const

inlinenoexcept

◆ getLearningRate()

template<DeviceType TDeviceType, TensorDataType TPrecision>

float Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getLearningRate ( ) const

inlineoverridevirtual

Get the current learning rate.

Returns the base learning rate used for parameter updates. Some optimizers may apply adaptive per-parameter learning rates internally (Adam, AdamW), but this method returns the global scaling factor.

Returns: Current learning rate as a float

Note: For adaptive optimizers, actual effective learning rate per parameter may differ due to momentum and variance scaling

See also: setLearningRate()

Implements Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >.

◆ getParameterCount()

template<DeviceType TDeviceType, TensorDataType TPrecision>

size_t Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getParameterCount ( ) const

inlinenoexcept

◆ getStepCount()

template<DeviceType TDeviceType, TensorDataType TPrecision>

size_t Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getStepCount ( ) const

inlinenoexcept

◆ getWeightDecay()

template<DeviceType TDeviceType, TensorDataType TPrecision>

float Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::getWeightDecay ( ) const

inlinenoexcept

◆ setLearningRate()

template<DeviceType TDeviceType, TensorDataType TPrecision>

void Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::setLearningRate ( float learning_rate )

inlineoverridevirtual

Set the learning rate for future updates.

Updates the base learning rate used by the optimizer. Typically used for learning rate schedules (decay, warmup, cyclic, etc.).

Parameters

learning_rate New learning rate (must be positive)

Exceptions

std::invalid_argument if learning_rate <= 0

Note: Takes effect immediately for the next step() call; Does not affect optimizer state (momentum, variance); For learning rate schedules, call this at epoch or iteration boundaries

See also: getLearningRate()

Example with learning rate decay:

float initial_lr = 0.001f;
optimizer->setLearningRate(initial_lr);
 
for (size_t epoch = 0; epoch < num_epochs; ++epoch) {
    // Training loop...
 
    // Decay learning rate every 10 epochs
    if (epoch > 0 && epoch % 10 == 0) {
        float new_lr = optimizer->getLearningRate() * 0.5f;
        optimizer->setLearningRate(new_lr);
        std::cout << "Learning rate: " << new_lr << std::endl;
    }
}

Implements Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >.

◆ setWeightDecay()

template<DeviceType TDeviceType, TensorDataType TPrecision>

void Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::setWeightDecay ( float weight_decay )

inline

◆ step()

template<DeviceType TDeviceType, TensorDataType TPrecision>

void Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::step ( )

inlineoverridevirtual

Perform one optimization step.

Updates all registered parameters using their accumulated gradients according to the optimizer's update rule (SGD, Adam, AdamW, etc.). This is the HOT PATH method called every training iteration.

For algorithms with state (Adam, AdamW):

Updates first and second moment estimates
Applies bias correction if needed
Computes parameter update
Writes updated parameters back to tensors

Exceptions

std::runtime_error	if no parameters have been registered
std::runtime_error	if gradient data is invalid or null

Note: Gradients should be computed via backward() before calling step(); For CUDA implementations, may be asynchronous (uses device stream); Increments internal step counter for algorithms requiring it (Adam, AdamW)

See also: addParameter(); backward()

Typical sequence:

model->zeroGradients();              // Clear previous gradients (model-managed)
model->forward(input, output);       // Forward pass
loss = computeLoss(output, target);
model->backward(input, loss_grad);   // Compute gradients
optimizer->step();                   // Update parameters

Implements Mila::Dnn::Compute::Optimizer< TDeviceType, TPrecision >.

Member Data Documentation

◆ config_

template<DeviceType TDeviceType, TensorDataType TPrecision>

AdamWConfig Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::config_

private

◆ context_

template<DeviceType TDeviceType, TensorDataType TPrecision>

IExecutionContext* Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::context_

private

◆ impl_

template<DeviceType TDeviceType, TensorDataType TPrecision>

std::shared_ptr<OptimizerType> Mila::Dnn::Optimizers::AdamWOptimizer< TDeviceType, TPrecision >::impl_

private

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Dnn/Optimizers/AdamW.ixx

Public Types

Public Member Functions

Private Attributes

Detailed Description

Member Typedef Documentation

◆ ExecutionContextType

◆ OptimizerType

Constructor & Destructor Documentation

◆ AdamWOptimizer()

◆ ~AdamWOptimizer()

Member Function Documentation

◆ addParameter()

◆ getBeta1()

◆ getBeta2()

◆ getEpsilon()

◆ getLearningRate()

◆ getParameterCount()

◆ getStepCount()

◆ getWeightDecay()

◆ setLearningRate()

◆ setWeightDecay()

◆ step()

Member Data Documentation

◆ config_

◆ context_

◆ impl_