Mila
Deep Neural Network Library
|
Namespaces | |
namespace | Detail |
Namespace for CUDA layer normalization implementation details. | |
Classes | |
class | AMPConfig |
class | BinaryOperation |
Abstract class for binary operations in the neural network framework. More... | |
class | ComputeDevice |
Abstract interface for compute devices (CPU, CUDA, etc.). More... | |
class | ComputePrecision |
Controls automatic mixed precision behavior for neural network operations. More... | |
class | ComputeResource |
Abstract base class for compute resources. More... | |
class | CpuCrossEntropyOp |
CPU implementation of the cross entropy loss operation for neural networks. More... | |
class | CpuCrossEntropyOpRegistrar |
Class responsible for registering the CpuCrossEntropyOp operation. More... | |
class | CpuDevice |
Class representing a CPU compute device. More... | |
class | CpuEncoderOp |
CPU implementation of the encoder operation for neural networks. More... | |
class | CpuEncoderOpRegistrar |
Class responsible for registering the CpuEncoderOp operation. More... | |
class | CpuGeluOp |
class | CpuGeluOpRegistrar |
Class responsible for registering the CpuGeluOp operation. More... | |
class | CpuLayerNormOp |
CPU implementation of the Layer Normalization operation for neural networks. More... | |
class | CpuLayerNormOpRegistrar |
Class responsible for registering the CpuLayerNormOp operation. More... | |
class | CpuLinearOp |
CPU implementation of the Fully Connected operation for neural networks. More... | |
class | CpuLinearOpRegistrar |
Class responsible for registering the CpuLinearOp operation. More... | |
class | CpuMemoryResource |
A memory resource for CPU memory allocation. More... | |
class | CpuMultiHeadAttentionOp |
CPU implementation of the Multi-Head Attention operation for neural networks. More... | |
class | CpuMultiHeadAttentionOpRegistrar |
Class responsible for registering the CpuMultiHeadAttention operation. More... | |
class | CpuResidualOp |
CPU implementation of the residual operation for neural networks. More... | |
class | CpuResidualOpRegistrar |
Class responsible for registering the CpuResidualOp operation. More... | |
class | CpuSoftmaxOp |
CPU implementation of the softmax operation for neural networks. More... | |
class | CpuSoftmaxOpRegistrar |
Class responsible for registering the CpuSoftmaxOp operation. More... | |
class | CublasLtError |
class | CudaBadAlloc |
class | CudaComputeResource |
struct | CudaDataTypeMap |
Helper struct to map C++ types to CUDA data types for cuBLASLt. More... | |
struct | CudaDataTypeMap< __nv_bfloat16 > |
struct | CudaDataTypeMap< float > |
struct | CudaDataTypeMap< half > |
class | CudaDevice |
Class representing a CUDA compute device. More... | |
class | CudaEncoderOp |
CUDA implementation of the Encoder operation for transformer models. More... | |
class | CudaEncoderOpRegistrar |
Class responsible for registering the CudaEncoderOp operation. More... | |
class | CudaError |
Exception class for CUDA runtime errors. More... | |
class | CudaGeluOp |
CUDA implementation of the GELU activation function for neural networks. More... | |
class | CudaGeluOpRegistrar |
Class responsible for registering the CudaGeluOp operation. More... | |
class | CudaLayerNormOp |
CUDA implementation of the Layer Normalization operation for neural networks. More... | |
class | CudaLayerNormOpRegistrar |
Class responsible for registering the CudaLayerNormOp operation. More... | |
class | CudaLinearOp |
CUDA implementation of the Fully Connected operation for neural networks. More... | |
class | CudaLinearOpRegistrar |
Class responsible for registering the CudaLinearOp operation. More... | |
class | CudaManagedMemoryResource |
A memory resource that uses CUDA managed memory. More... | |
class | CudaMatMulBiasGeluOp |
CUDA implementation of the fused MatMul-Bias-GELU operation. More... | |
class | CudaMatMulBiasGeluOpRegistrar |
Class responsible for registering the CudaMatMulBiasGeluOp operation. More... | |
class | CudaMemoryResource |
A memory resource that allocates memory on a CUDA device. More... | |
class | CudaMultiHeadAttentionOp |
CUDA implementation of the Multi-Head Attention operation for transformer models. More... | |
class | CudaMultiHeadAttentionOpRegistrar |
Class responsible for registering the CudaMultiHeadAttentionOp operation. More... | |
class | CudaPinnedMemoryResource |
A memory resource that allocates pinned (page-locked) memory using CUDA. More... | |
class | CudaResidualOp |
CUDA implementation of the residual operation for neural networks. More... | |
class | CudaResidualOpRegistrar |
Class responsible for registering the CudaResidualOp operation. More... | |
class | CudaSoftmaxOp |
CUDA implementation of the softmax operation for neural networks. More... | |
class | CudaSoftmaxOpRegistrar |
Class responsible for registering the CudaSoftmaxOp operation. More... | |
struct | DeviceAccessible |
class | DeviceContext |
The DeviceContext class manages device contexts for module and tensor computations. More... | |
class | DeviceProps |
class | DeviceRegistrar |
Class to manage compute device initialization. More... | |
class | DeviceRegistry |
Registry for compute device creation and management. More... | |
class | DynamicMemoryResource |
A class that represents a dynamically-determined memory resource. More... | |
struct | FusedOpMeta |
Metadata for fused operations in the neural network. More... | |
class | FusedSoftmaxCrossEntropyOp |
CUDA implementation of the fused softmax and cross entropy operation for neural networks. More... | |
class | FusedSoftmaxCrossEntropyOpRegistrar |
Class responsible for registering the FusedSoftmaxCrossEntropyOp operation. More... | |
struct | HostAccessible |
class | HostComputeResource |
struct | MemoryStats |
Global memory statistics for all TrackedMemoryResource instances. More... | |
struct | OperationAttributes |
Common attributes for neural network operations. More... | |
class | OperationBase |
Base class for all compute operations in the Mila neural network framework. More... | |
class | OperationRegistry |
A registry for operations that can be created based on operation names, type information, and device type. More... | |
class | OperationsRegistrar |
Class to manage compute operations initialization. More... | |
class | TrackedMemoryResource |
A memory resource wrapper that tracks allocation and deallocation statistics. More... | |
class | UnaryOperation |
Abstract base class for unary operations in the compute framework. More... | |
Concepts | |
concept | IsCpuComputeResource |
concept | IsCudaComputeResource |
Typedefs | |
using | Mila::Dnn::Compute::DeviceMemoryResource = CudaMemoryResource |
Alias for CudaMemoryResource that represents device-accessible memory. | |
using | Mila::Dnn::Compute::HostMemoryResource = CpuMemoryResource |
Alias for CpuMemoryResource that represents host-accessible memory. | |
using | Mila::Dnn::Compute::MemoryResource = std::pmr::memory_resource |
An alias for the standard polymorphic memory resource. | |
Enumerations | |
enum class | Mila::Dnn::Compute::DeviceType { Cpu , Cuda } |
Enumeration of supported compute device types. More... | |
enum class | Mila::Dnn::Compute::OperationType { CrossEntropyOp , EncoderOp , FusedOp , LinearOp , GeluOp , LayerNormOp , MultiHeadAttentionOp , ResidualOp , SoftmaxOp } |
Enumeration of all supported neural network operation types. More... | |
Functions | |
constexpr int | Mila::Dnn::Compute::ceil_div (int M, int N) |
Calculates ceiling division for kernel grid/block dimensions. | |
int | Mila::Dnn::Compute::checkDevice (int deviceId) |
Validates that a device ID is valid and available. | |
template<DeviceType TDeviceType> | |
std::shared_ptr< DeviceContext > | Mila::Dnn::Compute::CreateCompatibleContext () |
Creates a device context compatible with the specified device type. | |
template<typename TDataType , typename TCompute = float> requires std::is_same_v<TDataType, float> || std::is_same_v<TDataType, half> || std::is_same_v<TDataType, __nv_bfloat16> || std::is_same_v<TDataType, __nv_fp8_e4m3> | |
void | Mila::Dnn::Compute::cublaslt_matmul_forward (TDataType *Y, const TDataType *X, const TDataType *weight, const TDataType *bias, int outer_size, int C, int OC, cudaStream_t stream, cublasLtHandle_t cublasLtHandle) |
cuBLASLt implementation of matrix multiplication with bias addition | |
void | Mila::Dnn::Compute::cublasLtCheckStatus (cublasStatus_t status, const std::source_location &location=std::source_location::current()) |
Checks the status of a cuBLASLt operation and throws if an error occurred. | |
void | cuda_encoder_forward_fp16 (half *Y, const int *X, const half *wte, const half *wpe, int B, int T, int C, cudaStream_t stream) |
void | cuda_encoder_forward_fp32 (float *Y, const int *X, const float *wte, const float *wpe, int B, int T, int C, cudaStream_t stream) |
void | cuda_gelu_backward_fp16 (half *dX, const half *X, const half *dY, const int N, cudaStream_t stream) |
void | cuda_gelu_backward_fp32 (float *dX, const float *X, const float *dY, const int N, cudaStream_t stream) |
void | cuda_gelu_forward_fp16 (half *Y, const half *X, int N, cudaStream_t stream) |
void | cuda_gelu_forward_fp32 (float *Y, const float *X, int N, cudaStream_t stream) |
void | cuda_layernorm_forward_fp16 (half *Y, half *mean, half *rstd, const half *X, const half *weight, const half *bias, int B, int T, int C, float epsilon, cudaStream_t stream) |
void | cuda_layernorm_forward_fp32 (float *Y, float *mean, float *rstd, const float *X, const float *weight, const float *bias, int B, int T, int C, float epsilon, cudaStream_t stream) |
void | cuda_matmul_forward_fp16 (half *Y, const half *X, const half *weight, const half *bias, int outer_size, int C, int OC, cudaStream_t stream) |
void | cuda_matmul_forward_fp32 (float *Y, const float *X, const float *weight, const float *bias, int outer_size, int C, int OC, cudaStream_t stream) |
void | cuda_mha_forward_fp16 (half *Y, half *qkvr, half *att, const half *X, int B, int T, int C, int NH, cudaStream_t stream) |
void | cuda_mha_forward_fp32 (float *Y, float *qkvr, float *att, const float *X, int B, int T, int C, int NH, cudaStream_t stream) |
void | cuda_residual_forward_fp16 (half *Y, const half *X1, const half *X2, int N, cudaStream_t stream) |
void | cuda_residual_forward_fp32 (float *Y, const float *X1, const float *X2, int N, cudaStream_t stream) |
template<typename TPrecision > | |
void | cuda_softmax_crossentropy_backward (TPrecision *dlogits, const TPrecision *dlosses, const TPrecision *probs, const int *targets, int batch_size, int seq_len, int vocab_size, cudaStream_t stream) |
template<typename TPrecision > | |
void | cuda_softmax_crossentropy_forward (TPrecision *losses, TPrecision *probs, const TPrecision *logits, const int *targets, int batch_size, int seq_len, int vocab_size, cudaStream_t stream) |
template<typename TPrecision > | |
void | cuda_softmax_forward (TPrecision *Y, const TPrecision *X, int N, int C, cudaStream_t stream) |
template<typename TPrecision > | |
void | cuda_softmax_forward_general (TPrecision *Y, const TPrecision *X, int outer_size, int dim_size, int inner_size, cudaStream_t stream) |
void | Mila::Dnn::Compute::cudaCheckLastError (const std::source_location &location=std::source_location::current()) |
Checks the last CUDA error and throws if an error occurred. | |
void | Mila::Dnn::Compute::cudaCheckStatus (cudaError_t status, const std::source_location &location=std::source_location::current()) |
Checks the status of a CUDA operation and throws if an error occurred. | |
std::string | Mila::Dnn::Compute::deviceToString (DeviceType device_type) |
Converts a DeviceType to its string representation. | |
int | Mila::Dnn::Compute::findCudaDevice (int deviceId=-1, bool preferMemory=false) |
Finds the most appropriate CUDA device for computation. | |
std::string | getBestDevice (DeviceType type, bool preferMemory=false) |
Gets the best device of a specific type based on performance characteristics. | |
int | Mila::Dnn::Compute::getBestDeviceId (bool preferMemory=false) |
Identifies the best CUDA device based on performance characteristics. | |
int | Mila::Dnn::Compute::getDeviceCount () |
Gets the number of available CUDA devices. | |
int | Mila::Dnn::Compute::getDriverVersion () |
Gets the installed CUDA driver version. | |
int | Mila::Dnn::Compute::getRuntimeVersion () |
Gets the installed CUDA runtime version. | |
bool | Mila::Dnn::Compute::isDeviceAvailable (const std::string &device_name) |
Checks if a specific device is available. | |
std::vector< std::string > | Mila::Dnn::Compute::listDevices () |
Lists all available compute devices. | |
std::vector< std::string > | Mila::Dnn::Compute::listDevicesByType (DeviceType type) |
Lists compute devices of a specific type. | |
std::string | Mila::Dnn::Compute::operationTypeToString (OperationType op) |
Converts an operation type to its string representation. | |
DeviceType | Mila::Dnn::Compute::toDeviceType (std::string device_type) |
Converts a string to the corresponding DeviceType. | |
template<DeviceType TDeviceType> | |
std::shared_ptr< DeviceContext > | Mila::Dnn::Compute::ValidateContext (std::shared_ptr< DeviceContext > context) |
Validates that the provided context is compatible with the specified device type. | |
Variables | |
template<typename T > | |
constexpr bool | always_false = false |
const float | GELU_SCALING_FACTOR = sqrtf( 2.0f / M_PI ) |
|
export |
Alias for CudaMemoryResource that represents device-accessible memory.
This alias provides a semantic name that describes the memory's accessibility characteristics rather than its implementation details. Use DeviceMemoryResource when you need memory that can be accessed by CUDA device code and operations.
This naming follows CUDA conventions where "device" refers to GPU memory, while maintaining consistency with the architecture's naming pattern.
|
export |
Alias for CpuMemoryResource that represents host-accessible memory.
This alias provides a semantic name that describes the memory's accessibility characteristics rather than its implementation details. Use HostMemoryResource when you need memory that can be directly accessed from host (CPU) code.
|
export |
An alias for the standard polymorphic memory resource.
This provides a common abstraction for memory allocation and management across different compute devices and memory types. The memory_resource is the foundation for all memory allocations within the compute framework and can be extended for specific devices (CPU, CUDA, etc.).
|
exportstrong |
|
exportstrong |
Enumeration of all supported neural network operation types.
This enumeration defines the different types of operations that can be executed by the compute framework. Each operation type corresponds to a specific neural network function or layer.
Enumerator | |
---|---|
CrossEntropyOp | Cross entropy loss operation. |
EncoderOp | Encoder operation for transformer architecture. |
FusedOp | Fused operation combining multiple operations for performance optimization. |
LinearOp | Linear (fully connected/dense) layer operation. |
GeluOp | Gaussian Error Linear Unit activation function. |
LayerNormOp | Layer normalization operation. |
MultiHeadAttentionOp | Multi-head attention operation for transformers. |
ResidualOp | Residual connection operation. |
SoftmaxOp | Softmax activation function. |
|
constexprexport |
Calculates ceiling division for kernel grid/block dimensions.
M | Dividend value |
N | Divisor value |
|
export |
Validates that a device ID is valid and available.
deviceId | CUDA device ID to check |
std::invalid_argument | If device ID is negative |
std::runtime_error | If no CUDA devices are available |
std::out_of_range | If device ID exceeds available device count |
std::runtime_error | If device is in prohibited compute mode |
|
export |
Creates a device context compatible with the specified device type.
TDevice | The device type to create a context for. |
|
export |
cuBLASLt implementation of matrix multiplication with bias addition
TPrecision | Data type for computation (float, half, etc.) |
out | Output tensor data pointer |
inp | Input tensor data pointer |
weight | Weight tensor data pointer |
bias | Bias tensor data pointer (can be nullptr) |
B | Batch size |
TPrecision | Sequence length |
C | Input channels |
OC | Output channels |
stream | CUDA stream |
|
inlineexport |
Checks the status of a cuBLASLt operation and throws if an error occurred.
status | The cuBLASLt error status code to check. |
location | Source location information (automatically populated by default). |
CublasLtError | if the status is not CUBLAS_STATUS_SUCCESS. |
void Mila::Dnn::Compute::cuda_encoder_forward_fp16 | ( | half * | Y, |
const int * | X, | ||
const half * | wte, | ||
const half * | wpe, | ||
int | B, | ||
int | T, | ||
int | C, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_encoder_forward_fp32 | ( | float * | Y, |
const int * | X, | ||
const float * | wte, | ||
const float * | wpe, | ||
int | B, | ||
int | T, | ||
int | C, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_gelu_backward_fp16 | ( | half * | dX, |
const half * | X, | ||
const half * | dY, | ||
const int | N, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_gelu_backward_fp32 | ( | float * | dX, |
const float * | X, | ||
const float * | dY, | ||
const int | N, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_gelu_forward_fp16 | ( | half * | Y, |
const half * | X, | ||
int | N, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_gelu_forward_fp32 | ( | float * | Y, |
const float * | X, | ||
int | N, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_layernorm_forward_fp16 | ( | half * | Y, |
half * | mean, | ||
half * | rstd, | ||
const half * | X, | ||
const half * | weight, | ||
const half * | bias, | ||
int | B, | ||
int | T, | ||
int | C, | ||
float | epsilon, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_layernorm_forward_fp32 | ( | float * | Y, |
float * | mean, | ||
float * | rstd, | ||
const float * | X, | ||
const float * | weight, | ||
const float * | bias, | ||
int | B, | ||
int | T, | ||
int | C, | ||
float | epsilon, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_matmul_forward_fp16 | ( | half * | Y, |
const half * | X, | ||
const half * | weight, | ||
const half * | bias, | ||
int | outer_size, | ||
int | C, | ||
int | OC, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_matmul_forward_fp32 | ( | float * | Y, |
const float * | X, | ||
const float * | weight, | ||
const float * | bias, | ||
int | outer_size, | ||
int | C, | ||
int | OC, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_mha_forward_fp16 | ( | half * | Y, |
half * | qkvr, | ||
half * | att, | ||
const half * | X, | ||
int | B, | ||
int | T, | ||
int | C, | ||
int | NH, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_mha_forward_fp32 | ( | float * | Y, |
float * | qkvr, | ||
float * | att, | ||
const float * | X, | ||
int | B, | ||
int | T, | ||
int | C, | ||
int | NH, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_residual_forward_fp16 | ( | half * | Y, |
const half * | X1, | ||
const half * | X2, | ||
int | N, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_residual_forward_fp32 | ( | float * | Y, |
const float * | X1, | ||
const float * | X2, | ||
int | N, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_softmax_crossentropy_backward | ( | TPrecision * | dlogits, |
const TPrecision * | dlosses, | ||
const TPrecision * | probs, | ||
const int * | targets, | ||
int | batch_size, | ||
int | seq_len, | ||
int | vocab_size, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_softmax_crossentropy_forward | ( | TPrecision * | losses, |
TPrecision * | probs, | ||
const TPrecision * | logits, | ||
const int * | targets, | ||
int | batch_size, | ||
int | seq_len, | ||
int | vocab_size, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_softmax_forward | ( | TPrecision * | Y, |
const TPrecision * | X, | ||
int | N, | ||
int | C, | ||
cudaStream_t | stream | ||
) |
void Mila::Dnn::Compute::cuda_softmax_forward_general | ( | TPrecision * | Y, |
const TPrecision * | X, | ||
int | outer_size, | ||
int | dim_size, | ||
int | inner_size, | ||
cudaStream_t | stream | ||
) |
|
inlineexport |
Checks the last CUDA error and throws if an error occurred.
location | Source location information (automatically populated by default). |
CudaError | if the last error is not cudaSuccess. |
|
inlineexport |
Checks the status of a CUDA operation and throws if an error occurred.
status | The CUDA error status code to check. |
location | Source location information (automatically populated by default). |
CudaError | if the status is not cudaSuccess. |
|
export |
Converts a DeviceType to its string representation.
device_type | The device type to convert. |
std::runtime_error | If the device type is invalid. |
|
inlineexport |
Finds the most appropriate CUDA device for computation.
Either validates a specific device ID if provided or finds the best available device when no preference is specified.
deviceId | Preferred device ID, or -1 to select the best device |
std::runtime_error | If no CUDA devices are found |
std::string Mila::Dnn::Compute::getBestDevice | ( | DeviceType | type, |
bool | preferMemory = false |
||
) |
Gets the best device of a specific type based on performance characteristics.
type | The device type to filter by (e.g., Cuda) |
preferMemory | When true, prioritizes memory bandwidth over compute capability |
|
inlineexport |
Identifies the best CUDA device based on performance characteristics.
Evaluates available CUDA devices and selects the one with highest performance potential. Selection criteria vary based on the intended workload type.
preferMemory | When true, prioritizes memory bandwidth over compute |
CudaError | If device properties cannot be accessed |
|
inlineexport |
Gets the number of available CUDA devices.
CudaError | If device enumeration fails |
|
export |
Gets the installed CUDA driver version.
CudaError | If driver version cannot be determined |
|
export |
Gets the installed CUDA runtime version.
CudaError | If runtime version cannot be determined |
|
export |
Checks if a specific device is available.
device_name | The name of the device to check (e.g., "CPU", "CUDA:0"). |
|
export |
Lists all available compute devices.
This function returns a list of all available compute devices that can be used with DeviceContext.
|
export |
Lists compute devices of a specific type.
Filters the available devices by their type, returning only devices that match the specified type. This allows clients to efficiently discover devices with specific capabilities.
type | The device type to filter by |
|
export |
Converts an operation type to its string representation.
This utility function converts an OperationType enum value to a human-readable string representation, which can be used for logging, debugging, or serialization.
op | The operation type to convert to string |
std::runtime_error | If the operation type is invalid or not recognized |
|
export |
Converts a string to the corresponding DeviceType.
Performs case-insensitive matching to convert device type strings to the corresponding enum value.
device_type | The string representation of the device type. |
std::runtime_error | If the string does not represent a valid device type. Valid options are: "CPU", "CUDA", "AUTO". |
|
export |
Validates that the provided context is compatible with the specified device type.
TDevice | The device type to validate against. |
context | The context to validate. |
std::invalid_argument | If the context is null. |
std::runtime_error | If the context is incompatible with TDevice. |
|
constexpr |
const float Mila::Dnn::Compute::GELU_SCALING_FACTOR = sqrtf( 2.0f / M_PI ) |