Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision > Class Template Referenceexport

CUDA-specific AdamW optimizer implementation. More...

Inheritance diagram for Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >:
Collaboration diagram for Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >:

Public Types

using CudaExecutionContext = ExecutionContext<DeviceType::Cuda>
using MR = CudaDeviceMemoryResource
using NativeType = typename Mila::Dnn::Compute::Cuda::TensorDataTypeMap<TPrecision>::native_type
using TensorType = Tensor<TPrecision, MR>

Public Member Functions

 CudaAdamWOptimizer (IExecutionContext *context, const OptimizerConfig &config)
 Construct CUDA AdamW optimizer.
 ~CudaAdamWOptimizer () override=default
void addParameter (ITensor *param, ITensor *grad) override
 Register a parameter-gradient pair for optimization.
float getBeta1 () const noexcept
 Get beta1 parameter.
float getBeta2 () const noexcept
 Get beta2 parameter.
float getEpsilon () const noexcept
 Get epsilon parameter.
float getLearningRate () const override
 Zero all gradient tensors.
size_t getParameterCount () const noexcept
 Get number of registered parameter groups.
size_t getStepCount () const noexcept
 Get current step count.
float getWeightDecay () const noexcept
 Get weight decay parameter.
void setLearningRate (float learning_rate) override
 Set learning rate for future steps.
void setWeightDecay (float weight_decay)
 Set weight decay coefficient.
void step () override
 Perform one AdamW optimization step.
Public Member Functions inherited from Mila::Dnn::Compute::Optimizer< DeviceType::Cuda, TPrecision >
virtual ~Optimizer ()=default

Private Member Functions

void validateHyperparameters () const
 Validate optimizer hyperparameters.

Static Private Member Functions

static std::string shapeToString (const shape_t &shape)
 Convert shape to string for error messages.

Private Attributes

OptimizerConfig config_
CudaExecutionContextexec_context_
std::vector< NativeType * > grad_data_
float grad_scale_ { 1.0f }
std::vector< ITensor * > grads_
std::vector< float * > m_data_
std::vector< std::shared_ptr< Tensor< TensorDataType::FP32, MR > > > m_states_
std::vector< float * > master_param_data_
std::vector< std::shared_ptr< Tensor< TensorDataType::FP32, MR > > > master_params_
std::vector< NativeType * > param_data_
std::vector< ITensor * > params_
size_t step_count_
std::vector< float * > v_data_
std::vector< std::shared_ptr< Tensor< TensorDataType::FP32, MR > > > v_states_

Detailed Description

template<TensorDataType TPrecision>
requires PrecisionSupportedOnDevice<TPrecision, DeviceType::Cuda>
class Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >

CUDA-specific AdamW optimizer implementation.

Implements the AdamW algorithm using optimized CUDA kernels from adamw.cuh. Maintains per-parameter state tensors (first moment, second moment) on the GPU and performs asynchronous parameter updates via CUDA streams.

AdamW algorithm:

  • m_t = beta1 * m_{t-1} + (1 - beta1) * g_t (first moment)
  • v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2 (second moment)
  • m_hat = m_t / (1 - beta1^t) (bias correction)
  • v_hat = v_t / (1 - beta2^t) (bias correction)
  • theta_t = theta_{t-1} - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta_{t-1})

Features:

  • Decoupled weight decay (AdamW variant)
  • Bias correction for moments
  • Stochastic rounding for mixed precision
  • Optional master parameters for FP16/BF16 training
  • Asynchronous execution via CUDA streams
Template Parameters
TPrecisionAbstract tensor precision (TensorDataType)

Member Typedef Documentation

◆ CudaExecutionContext

template<TensorDataType TPrecision>
using Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::CudaExecutionContext = ExecutionContext<DeviceType::Cuda>

◆ MR

template<TensorDataType TPrecision>
using Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::MR = CudaDeviceMemoryResource

◆ NativeType

template<TensorDataType TPrecision>
using Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::NativeType = typename Mila::Dnn::Compute::Cuda::TensorDataTypeMap<TPrecision>::native_type

◆ TensorType

template<TensorDataType TPrecision>
using Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::TensorType = Tensor<TPrecision, MR>

Constructor & Destructor Documentation

◆ CudaAdamWOptimizer()

template<TensorDataType TPrecision>
Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::CudaAdamWOptimizer ( IExecutionContext * context,
const OptimizerConfig & config )
inlineexplicit

Construct CUDA AdamW optimizer.

Parameters
exec_contextCUDA execution context for stream and device management
configAdamW optimizer configuration
Exceptions
std::invalid_argumentif exec_context is null
Here is the call graph for this function:

◆ ~CudaAdamWOptimizer()

template<TensorDataType TPrecision>
Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::~CudaAdamWOptimizer ( )
overridedefault

Member Function Documentation

◆ addParameter()

template<TensorDataType TPrecision>
void Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::addParameter ( ITensor * param,
ITensor * grad )
inlineoverridevirtual

Register a parameter-gradient pair for optimization.

The optimizer does not take ownership of the parameter/gradient tensors. The caller (typically a Module) must ensure the tensors remain valid for the lifetime of the optimizer.

Allocates momentum and variance state tensors on the GPU matching the parameter shape. State tensors are zero-initialized.

Parameters
paramParameter tensor to optimize (non-owning, must be on CUDA device)
gradGradient tensor (non-owning, must match param shape and device)
Exceptions
std::invalid_argumentif param or grad is null
std::invalid_argumentif param and grad shapes don't match
std::invalid_argumentif param or grad is not a CUDA tensor
std::invalid_argumentif param or grad data type doesn't match optimizer precision
std::runtime_errorif state allocation fails

Implements Mila::Dnn::Compute::Optimizer< DeviceType::Cuda, TPrecision >.

Here is the call graph for this function:

◆ getBeta1()

template<TensorDataType TPrecision>
float Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getBeta1 ( ) const
inlinenoexcept

Get beta1 parameter.

◆ getBeta2()

template<TensorDataType TPrecision>
float Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getBeta2 ( ) const
inlinenoexcept

Get beta2 parameter.

◆ getEpsilon()

template<TensorDataType TPrecision>
float Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getEpsilon ( ) const
inlinenoexcept

Get epsilon parameter.

◆ getLearningRate()

template<TensorDataType TPrecision>
float Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getLearningRate ( ) const
inlineoverridevirtual

Zero all gradient tensors.

Asynchronously clears all registered gradient tensors on the GPU.

Exceptions
std::runtime_errorif no parameters have been registered

Get current learning rate.

Implements Mila::Dnn::Compute::Optimizer< DeviceType::Cuda, TPrecision >.

◆ getParameterCount()

template<TensorDataType TPrecision>
size_t Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getParameterCount ( ) const
inlinenoexcept

Get number of registered parameter groups.

◆ getStepCount()

template<TensorDataType TPrecision>
size_t Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getStepCount ( ) const
inlinenoexcept

Get current step count.

◆ getWeightDecay()

template<TensorDataType TPrecision>
float Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::getWeightDecay ( ) const
inlinenoexcept

Get weight decay parameter.

◆ setLearningRate()

template<TensorDataType TPrecision>
void Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::setLearningRate ( float learning_rate)
inlineoverridevirtual

Set learning rate for future steps.

Parameters
learning_rateNew learning rate (must be positive)
Exceptions
std::invalid_argumentif learning_rate <= 0

Implements Mila::Dnn::Compute::Optimizer< DeviceType::Cuda, TPrecision >.

◆ setWeightDecay()

template<TensorDataType TPrecision>
void Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::setWeightDecay ( float weight_decay)
inline

Set weight decay coefficient.

Parameters
weight_decayNew weight decay (must be non-negative)
Exceptions
std::invalid_argumentif weight_decay < 0

◆ shapeToString()

template<TensorDataType TPrecision>
std::string Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::shapeToString ( const shape_t & shape)
inlinestaticprivate

Convert shape to string for error messages.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ step()

template<TensorDataType TPrecision>
void Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::step ( )
inlineoverridevirtual

Perform one AdamW optimization step.

Updates all registered parameters asynchronously on the GPU using the AdamW CUDA kernel. Execution happens on the execution context's CUDA stream.

Exceptions
std::runtime_errorif no parameters have been registered
std::runtime_errorif CUDA kernel launch fails
Note
Asynchronous - returns immediately without GPU synchronization
Increments internal step counter for bias correction

Implements Mila::Dnn::Compute::Optimizer< DeviceType::Cuda, TPrecision >.

Here is the call graph for this function:

◆ validateHyperparameters()

template<TensorDataType TPrecision>
void Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::validateHyperparameters ( ) const
inlineprivate

Validate optimizer hyperparameters.

Here is the caller graph for this function:

Member Data Documentation

◆ config_

template<TensorDataType TPrecision>
OptimizerConfig Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::config_
private

◆ exec_context_

template<TensorDataType TPrecision>
CudaExecutionContext* Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::exec_context_
private

◆ grad_data_

template<TensorDataType TPrecision>
std::vector<NativeType*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::grad_data_
private

◆ grad_scale_

template<TensorDataType TPrecision>
float Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::grad_scale_ { 1.0f }
private

◆ grads_

template<TensorDataType TPrecision>
std::vector<ITensor*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::grads_
private

◆ m_data_

template<TensorDataType TPrecision>
std::vector<float*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::m_data_
private

◆ m_states_

template<TensorDataType TPrecision>
std::vector<std::shared_ptr<Tensor<TensorDataType::FP32, MR> > > Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::m_states_
private

◆ master_param_data_

template<TensorDataType TPrecision>
std::vector<float*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::master_param_data_
private

◆ master_params_

template<TensorDataType TPrecision>
std::vector<std::shared_ptr<Tensor<TensorDataType::FP32, MR> > > Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::master_params_
private

◆ param_data_

template<TensorDataType TPrecision>
std::vector<NativeType*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::param_data_
private

◆ params_

template<TensorDataType TPrecision>
std::vector<ITensor*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::params_
private

◆ step_count_

template<TensorDataType TPrecision>
size_t Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::step_count_
private

◆ v_data_

template<TensorDataType TPrecision>
std::vector<float*> Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::v_data_
private

◆ v_states_

template<TensorDataType TPrecision>
std::vector<std::shared_ptr<Tensor<TensorDataType::FP32, MR> > > Mila::Dnn::Compute::CudaAdamWOptimizer< TPrecision >::v_states_
private

The documentation for this class was generated from the following file: