Namespaces
namespace	Detail
	Namespace for CUDA layer normalization implementation details.

Classes
class	AMPConfig

class	BinaryOperation
	Abstract class for binary operations in the neural network framework. More...

class	ComputeDevice
	Abstract interface for compute devices (CPU, CUDA, etc.). More...

class	ComputePrecision
	Controls automatic mixed precision behavior for neural network operations. More...

class	ComputeResource
	Abstract base class for compute resources. More...

class	CpuCrossEntropyOp
	CPU implementation of the cross entropy loss operation for neural networks. More...

class	CpuCrossEntropyOpRegistrar
	Class responsible for registering the CpuCrossEntropyOp operation. More...

class	CpuDevice
	Class representing a CPU compute device. More...

class	CpuEncoderOp
	CPU implementation of the encoder operation for neural networks. More...

class	CpuEncoderOpRegistrar
	Class responsible for registering the CpuEncoderOp operation. More...

class	CpuGeluOp

class	CpuGeluOpRegistrar
	Class responsible for registering the CpuGeluOp operation. More...

class	CpuLayerNormOp
	CPU implementation of the Layer Normalization operation for neural networks. More...

class	CpuLayerNormOpRegistrar
	Class responsible for registering the CpuLayerNormOp operation. More...

class	CpuLinearOp
	CPU implementation of the Fully Connected operation for neural networks. More...

class	CpuLinearOpRegistrar
	Class responsible for registering the CpuLinearOp operation. More...

class	CpuMemoryResource
	A memory resource for CPU memory allocation. More...

class	CpuMultiHeadAttentionOp
	CPU implementation of the Multi-Head Attention operation for neural networks. More...

class	CpuMultiHeadAttentionOpRegistrar
	Class responsible for registering the CpuMultiHeadAttention operation. More...

class	CpuResidualOp
	CPU implementation of the residual operation for neural networks. More...

class	CpuResidualOpRegistrar
	Class responsible for registering the CpuResidualOp operation. More...

class	CpuSoftmaxOp
	CPU implementation of the softmax operation for neural networks. More...

class	CpuSoftmaxOpRegistrar
	Class responsible for registering the CpuSoftmaxOp operation. More...

class	CublasLtError

class	CudaBadAlloc

class	CudaComputeResource

struct	CudaDataTypeMap
	Helper struct to map C++ types to CUDA data types for cuBLASLt. More...

struct	CudaDataTypeMap< __nv_bfloat16 >

struct	CudaDataTypeMap< float >

struct	CudaDataTypeMap< half >

class	CudaDevice
	Class representing a CUDA compute device. More...

class	CudaEncoderOp
	CUDA implementation of the Encoder operation for transformer models. More...

class	CudaEncoderOpRegistrar
	Class responsible for registering the CudaEncoderOp operation. More...

class	CudaError
	Exception class for CUDA runtime errors. More...

class	CudaGeluOp
	CUDA implementation of the GELU activation function for neural networks. More...

class	CudaGeluOpRegistrar
	Class responsible for registering the CudaGeluOp operation. More...

class	CudaLayerNormOp
	CUDA implementation of the Layer Normalization operation for neural networks. More...

class	CudaLayerNormOpRegistrar
	Class responsible for registering the CudaLayerNormOp operation. More...

class	CudaLinearOp
	CUDA implementation of the Fully Connected operation for neural networks. More...

class	CudaLinearOpRegistrar
	Class responsible for registering the CudaLinearOp operation. More...

class	CudaManagedMemoryResource
	A memory resource that uses CUDA managed memory. More...

class	CudaMatMulBiasGeluOp
	CUDA implementation of the fused MatMul-Bias-GELU operation. More...

class	CudaMatMulBiasGeluOpRegistrar
	Class responsible for registering the CudaMatMulBiasGeluOp operation. More...

class	CudaMemoryResource
	A memory resource that allocates memory on a CUDA device. More...

class	CudaMultiHeadAttentionOp
	CUDA implementation of the Multi-Head Attention operation for transformer models. More...

class	CudaMultiHeadAttentionOpRegistrar
	Class responsible for registering the CudaMultiHeadAttentionOp operation. More...

class	CudaPinnedMemoryResource
	A memory resource that allocates pinned (page-locked) memory using CUDA. More...

class	CudaResidualOp
	CUDA implementation of the residual operation for neural networks. More...

class	CudaResidualOpRegistrar
	Class responsible for registering the CudaResidualOp operation. More...

class	CudaSoftmaxOp
	CUDA implementation of the softmax operation for neural networks. More...

class	CudaSoftmaxOpRegistrar
	Class responsible for registering the CudaSoftmaxOp operation. More...

struct	DeviceAccessible

class	DeviceContext
	The DeviceContext class manages device contexts for module and tensor computations. More...

class	DeviceProps

class	DeviceRegistrar
	Class to manage compute device initialization. More...

class	DeviceRegistry
	Registry for compute device creation and management. More...

class	DynamicMemoryResource
	A class that represents a dynamically-determined memory resource. More...

struct	FusedOpMeta
	Metadata for fused operations in the neural network. More...

class	FusedSoftmaxCrossEntropyOp
	CUDA implementation of the fused softmax and cross entropy operation for neural networks. More...

class	FusedSoftmaxCrossEntropyOpRegistrar
	Class responsible for registering the FusedSoftmaxCrossEntropyOp operation. More...

struct	HostAccessible

class	HostComputeResource

struct	MemoryStats
	Global memory statistics for all TrackedMemoryResource instances. More...

struct	OperationAttributes
	Common attributes for neural network operations. More...

class	OperationBase
	Base class for all compute operations in the Mila neural network framework. More...

class	OperationRegistry
	A registry for operations that can be created based on operation names, type information, and device type. More...

class	OperationsRegistrar
	Class to manage compute operations initialization. More...

class	TrackedMemoryResource
	A memory resource wrapper that tracks allocation and deallocation statistics. More...

class	UnaryOperation
	Abstract base class for unary operations in the compute framework. More...

Concepts
concept	IsCpuComputeResource

concept	IsCudaComputeResource

Typedefs
using	Mila::Dnn::Compute::DeviceMemoryResource = CudaMemoryResource
	Alias for CudaMemoryResource that represents device-accessible memory.

using	Mila::Dnn::Compute::HostMemoryResource = CpuMemoryResource
	Alias for CpuMemoryResource that represents host-accessible memory.

using	Mila::Dnn::Compute::MemoryResource = std::pmr::memory_resource
	An alias for the standard polymorphic memory resource.

Enumerations
enum class	Mila::Dnn::Compute::DeviceType { Cpu , Cuda }
	Enumeration of supported compute device types. More...

enum class	Mila::Dnn::Compute::OperationType { CrossEntropyOp , EncoderOp , FusedOp , LinearOp , GeluOp , LayerNormOp , MultiHeadAttentionOp , ResidualOp , SoftmaxOp }
	Enumeration of all supported neural network operation types. More...

Functions
constexpr int	Mila::Dnn::Compute::ceil_div (int M, int N)
	Calculates ceiling division for kernel grid/block dimensions.

int	Mila::Dnn::Compute::checkDevice (int deviceId)
	Validates that a device ID is valid and available.

template<DeviceType TDeviceType>
std::shared_ptr< DeviceContext >	Mila::Dnn::Compute::CreateCompatibleContext ()
	Creates a device context compatible with the specified device type.

template<typename TDataType , typename TCompute = float> requires std::is_same_v<TDataType, float> \|\| std::is_same_v<TDataType, half> \|\| std::is_same_v<TDataType, __nv_bfloat16> \|\| std::is_same_v<TDataType, __nv_fp8_e4m3>
void	Mila::Dnn::Compute::cublaslt_matmul_forward (TDataType Y, const TDataType X, const TDataType weight, const TDataType bias, int outer_size, int C, int OC, cudaStream_t stream, cublasLtHandle_t cublasLtHandle)
	cuBLASLt implementation of matrix multiplication with bias addition

void	Mila::Dnn::Compute::cublasLtCheckStatus (cublasStatus_t status, const std::source_location &location=std::source_location::current())
	Checks the status of a cuBLASLt operation and throws if an error occurred.

void	cuda_encoder_forward_fp16 (half Y, const int X, const half wte, const half wpe, int B, int T, int C, cudaStream_t stream)

void	cuda_encoder_forward_fp32 (float Y, const int X, const float wte, const float wpe, int B, int T, int C, cudaStream_t stream)

void	cuda_gelu_backward_fp16 (half dX, const half X, const half *dY, const int N, cudaStream_t stream)

void	cuda_gelu_backward_fp32 (float dX, const float X, const float *dY, const int N, cudaStream_t stream)

void	cuda_gelu_forward_fp16 (half Y, const half X, int N, cudaStream_t stream)

void	cuda_gelu_forward_fp32 (float Y, const float X, int N, cudaStream_t stream)

void	cuda_layernorm_forward_fp16 (half Y, half mean, half rstd, const half X, const half weight, const half bias, int B, int T, int C, float epsilon, cudaStream_t stream)

void	cuda_layernorm_forward_fp32 (float Y, float mean, float rstd, const float X, const float weight, const float bias, int B, int T, int C, float epsilon, cudaStream_t stream)

void	cuda_matmul_forward_fp16 (half Y, const half X, const half weight, const half bias, int outer_size, int C, int OC, cudaStream_t stream)

void	cuda_matmul_forward_fp32 (float Y, const float X, const float weight, const float bias, int outer_size, int C, int OC, cudaStream_t stream)

void	cuda_mha_forward_fp16 (half Y, half qkvr, half att, const half X, int B, int T, int C, int NH, cudaStream_t stream)

void	cuda_mha_forward_fp32 (float Y, float qkvr, float att, const float X, int B, int T, int C, int NH, cudaStream_t stream)

void	cuda_residual_forward_fp16 (half Y, const half X1, const half *X2, int N, cudaStream_t stream)

void	cuda_residual_forward_fp32 (float Y, const float X1, const float *X2, int N, cudaStream_t stream)

template<typename TPrecision >
void	cuda_softmax_crossentropy_backward (TPrecision dlogits, const TPrecision dlosses, const TPrecision probs, const int targets, int batch_size, int seq_len, int vocab_size, cudaStream_t stream)

template<typename TPrecision >
void	cuda_softmax_crossentropy_forward (TPrecision losses, TPrecision probs, const TPrecision logits, const int targets, int batch_size, int seq_len, int vocab_size, cudaStream_t stream)

template<typename TPrecision >
void	cuda_softmax_forward (TPrecision Y, const TPrecision X, int N, int C, cudaStream_t stream)

template<typename TPrecision >
void	cuda_softmax_forward_general (TPrecision Y, const TPrecision X, int outer_size, int dim_size, int inner_size, cudaStream_t stream)

void	Mila::Dnn::Compute::cudaCheckLastError (const std::source_location &location=std::source_location::current())
	Checks the last CUDA error and throws if an error occurred.

void	Mila::Dnn::Compute::cudaCheckStatus (cudaError_t status, const std::source_location &location=std::source_location::current())
	Checks the status of a CUDA operation and throws if an error occurred.

std::string	Mila::Dnn::Compute::deviceToString (DeviceType device_type)
	Converts a DeviceType to its string representation.

int	Mila::Dnn::Compute::findCudaDevice (int deviceId=-1, bool preferMemory=false)
	Finds the most appropriate CUDA device for computation.

std::string	getBestDevice (DeviceType type, bool preferMemory=false)
	Gets the best device of a specific type based on performance characteristics.

int	Mila::Dnn::Compute::getBestDeviceId (bool preferMemory=false)
	Identifies the best CUDA device based on performance characteristics.

int	Mila::Dnn::Compute::getDeviceCount ()
	Gets the number of available CUDA devices.

int	Mila::Dnn::Compute::getDriverVersion ()
	Gets the installed CUDA driver version.

int	Mila::Dnn::Compute::getRuntimeVersion ()
	Gets the installed CUDA runtime version.

bool	Mila::Dnn::Compute::isDeviceAvailable (const std::string &device_name)
	Checks if a specific device is available.

std::vector< std::string >	Mila::Dnn::Compute::listDevices ()
	Lists all available compute devices.

std::vector< std::string >	Mila::Dnn::Compute::listDevicesByType (DeviceType type)
	Lists compute devices of a specific type.

std::string	Mila::Dnn::Compute::operationTypeToString (OperationType op)
	Converts an operation type to its string representation.

DeviceType	Mila::Dnn::Compute::toDeviceType (std::string device_type)
	Converts a string to the corresponding DeviceType.

template<DeviceType TDeviceType>
std::shared_ptr< DeviceContext >	Mila::Dnn::Compute::ValidateContext (std::shared_ptr< DeviceContext > context)
	Validates that the provided context is compatible with the specified device type.

Variables
template<typename T >
constexpr bool	always_false = false

const float	GELU_SCALING_FACTOR = sqrtf( 2.0f / M_PI )

Typedef Documentation

◆ DeviceMemoryResource

using Mila::Dnn::Compute::DeviceMemoryResource = typedef CudaMemoryResource

export

Alias for CudaMemoryResource that represents device-accessible memory.

This alias provides a semantic name that describes the memory's accessibility characteristics rather than its implementation details. Use DeviceMemoryResource when you need memory that can be accessed by CUDA device code and operations.

This naming follows CUDA conventions where "device" refers to GPU memory, while maintaining consistency with the architecture's naming pattern.

See also: CudaMemoryResource

◆ HostMemoryResource

using Mila::Dnn::Compute::HostMemoryResource = typedef CpuMemoryResource

export

Alias for CpuMemoryResource that represents host-accessible memory.

This alias provides a semantic name that describes the memory's accessibility characteristics rather than its implementation details. Use HostMemoryResource when you need memory that can be directly accessed from host (CPU) code.

See also: CpuMemoryResource

◆ MemoryResource

using Mila::Dnn::Compute::MemoryResource = typedef std::pmr::memory_resource

export

An alias for the standard polymorphic memory resource.

This provides a common abstraction for memory allocation and management across different compute devices and memory types. The memory_resource is the foundation for all memory allocations within the compute framework and can be extended for specific devices (CPU, CUDA, etc.).

See also: std::pmr::memory_resource; HostMemoryResource; CudaMemoryResource

Enumeration Type Documentation

◆ DeviceType

enum class Mila::Dnn::Compute::DeviceType

exportstrong

Enumeration of supported compute device types.

Defines the types of compute devices that can be used for tensor operations and neural network computations.

Enumerator
Cpu	CPU device type.
Cuda	CUDA GPU device type.

◆ OperationType

enum class Mila::Dnn::Compute::OperationType

exportstrong

Enumeration of all supported neural network operation types.

This enumeration defines the different types of operations that can be executed by the compute framework. Each operation type corresponds to a specific neural network function or layer.

Enumerator
CrossEntropyOp	Cross entropy loss operation.
EncoderOp	Encoder operation for transformer architecture.
FusedOp	Fused operation combining multiple operations for performance optimization.
LinearOp	Linear (fully connected/dense) layer operation.
GeluOp	Gaussian Error Linear Unit activation function.
LayerNormOp	Layer normalization operation.
MultiHeadAttentionOp	Multi-head attention operation for transformers.
ResidualOp	Residual connection operation.
SoftmaxOp	Softmax activation function.

Function Documentation

◆ ceil_div()

constexpr int Mila::Dnn::Compute::ceil_div	(	int	M,
		int	N
	)

constexprexport

Calculates ceiling division for kernel grid/block dimensions.

Parameters

M	Dividend value
N	Divisor value

Returns: Ceiling of M/N as an integer

◆ checkDevice()

int Mila::Dnn::Compute::checkDevice ( int deviceId )

export

Validates that a device ID is valid and available.

Parameters

deviceId CUDA device ID to check

Returns: The same device ID if valid

Exceptions

std::invalid_argument	If device ID is negative
std::runtime_error	If no CUDA devices are available
std::out_of_range	If device ID exceeds available device count
std::runtime_error	If device is in prohibited compute mode

Here is the call graph for this function:

Here is the caller graph for this function:

◆ CreateCompatibleContext()

template<DeviceType TDeviceType>

std::shared_ptr< DeviceContext > Mila::Dnn::Compute::CreateCompatibleContext ( )

export

Creates a device context compatible with the specified device type.

Template Parameters

TDevice The device type to create a context for.

Returns: std::shared_ptr<DeviceContext> A new context of the appropriate type.

◆ cublaslt_matmul_forward()

template<typename TDataType , typename TCompute = float>
requires std::is_same_v<TDataType, float> || std::is_same_v<TDataType, half> || std::is_same_v<TDataType, __nv_bfloat16> || std::is_same_v<TDataType, __nv_fp8_e4m3>

void Mila::Dnn::Compute::cublaslt_matmul_forward	(	TDataType *	Y,
		const TDataType *	X,
		const TDataType *	weight,
		const TDataType *	bias,
		int	outer_size,
		int	C,
		int	OC,
		cudaStream_t	stream,
		cublasLtHandle_t	cublasLtHandle
	)

export

cuBLASLt implementation of matrix multiplication with bias addition

Template Parameters

TPrecision Data type for computation (float, half, etc.)

Parameters

out	Output tensor data pointer
inp	Input tensor data pointer
weight	Weight tensor data pointer
bias	Bias tensor data pointer (can be nullptr)
B	Batch size
TPrecision	Sequence length
C	Input channels
OC	Output channels
stream	CUDA stream

Here is the call graph for this function:

◆ cublasLtCheckStatus()

void Mila::Dnn::Compute::cublasLtCheckStatus	(	cublasStatus_t	status,
		const std::source_location &	location = `std::source_location::current()`
	)

inlineexport

Checks the status of a cuBLASLt operation and throws if an error occurred.

Parameters

status	The cuBLASLt error status code to check.
location	Source location information (automatically populated by default).

Exceptions

CublasLtError if the status is not CUBLAS_STATUS_SUCCESS.

Here is the caller graph for this function:

◆ cuda_encoder_forward_fp16()

void Mila::Dnn::Compute::cuda_encoder_forward_fp16	(	half *	Y,
		const int *	X,
		const half *	wte,
		const half *	wpe,
		int	B,
		int	T,
		int	C,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_encoder_forward_fp32()

void Mila::Dnn::Compute::cuda_encoder_forward_fp32	(	float *	Y,
		const int *	X,
		const float *	wte,
		const float *	wpe,
		int	B,
		int	T,
		int	C,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_gelu_backward_fp16()

void Mila::Dnn::Compute::cuda_gelu_backward_fp16	(	half *	dX,
		const half *	X,
		const half *	dY,
		const int	N,
		cudaStream_t	stream
	)

◆ cuda_gelu_backward_fp32()

void Mila::Dnn::Compute::cuda_gelu_backward_fp32	(	float *	dX,
		const float *	X,
		const float *	dY,
		const int	N,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_gelu_forward_fp16()

void Mila::Dnn::Compute::cuda_gelu_forward_fp16	(	half *	Y,
		const half *	X,
		int	N,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_gelu_forward_fp32()

void Mila::Dnn::Compute::cuda_gelu_forward_fp32	(	float *	Y,
		const float *	X,
		int	N,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_layernorm_forward_fp16()

void Mila::Dnn::Compute::cuda_layernorm_forward_fp16	(	half *	Y,
		half *	mean,
		half *	rstd,
		const half *	X,
		const half *	weight,
		const half *	bias,
		int	B,
		int	T,
		int	C,
		float	epsilon,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_layernorm_forward_fp32()

void Mila::Dnn::Compute::cuda_layernorm_forward_fp32	(	float *	Y,
		float *	mean,
		float *	rstd,
		const float *	X,
		const float *	weight,
		const float *	bias,
		int	B,
		int	T,
		int	C,
		float	epsilon,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_matmul_forward_fp16()

void Mila::Dnn::Compute::cuda_matmul_forward_fp16	(	half *	Y,
		const half *	X,
		const half *	weight,
		const half *	bias,
		int	outer_size,
		int	C,
		int	OC,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_matmul_forward_fp32()

void Mila::Dnn::Compute::cuda_matmul_forward_fp32	(	float *	Y,
		const float *	X,
		const float *	weight,
		const float *	bias,
		int	outer_size,
		int	C,
		int	OC,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_mha_forward_fp16()

void Mila::Dnn::Compute::cuda_mha_forward_fp16	(	half *	Y,
		half *	qkvr,
		half *	att,
		const half *	X,
		int	B,
		int	T,
		int	C,
		int	NH,
		cudaStream_t	stream
	)

◆ cuda_mha_forward_fp32()

void Mila::Dnn::Compute::cuda_mha_forward_fp32	(	float *	Y,
		float *	qkvr,
		float *	att,
		const float *	X,
		int	B,
		int	T,
		int	C,
		int	NH,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_residual_forward_fp16()

void Mila::Dnn::Compute::cuda_residual_forward_fp16	(	half *	Y,
		const half *	X1,
		const half *	X2,
		int	N,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_residual_forward_fp32()

void Mila::Dnn::Compute::cuda_residual_forward_fp32	(	float *	Y,
		const float *	X1,
		const float *	X2,
		int	N,
		cudaStream_t	stream
	)

Here is the caller graph for this function:

◆ cuda_softmax_crossentropy_backward()

template<typename TPrecision >

void Mila::Dnn::Compute::cuda_softmax_crossentropy_backward	(	TPrecision *	dlogits,
		const TPrecision *	dlosses,
		const TPrecision *	probs,
		const int *	targets,
		int	batch_size,
		int	seq_len,
		int	vocab_size,
		cudaStream_t	stream
	)

◆ cuda_softmax_crossentropy_forward()

template<typename TPrecision >

void Mila::Dnn::Compute::cuda_softmax_crossentropy_forward	(	TPrecision *	losses,
		TPrecision *	probs,
		const TPrecision *	logits,
		const int *	targets,
		int	batch_size,
		int	seq_len,
		int	vocab_size,
		cudaStream_t	stream
	)

◆ cuda_softmax_forward()

template<typename TPrecision >

void Mila::Dnn::Compute::cuda_softmax_forward	(	TPrecision *	Y,
		const TPrecision *	X,
		int	N,
		int	C,
		cudaStream_t	stream
	)

◆ cuda_softmax_forward_general()

template<typename TPrecision >

void Mila::Dnn::Compute::cuda_softmax_forward_general	(	TPrecision *	Y,
		const TPrecision *	X,
		int	outer_size,
		int	dim_size,
		int	inner_size,
		cudaStream_t	stream
	)

◆ cudaCheckLastError()

void Mila::Dnn::Compute::cudaCheckLastError ( const std::source_location & location = std::source_location::current() )

inlineexport

Checks the last CUDA error and throws if an error occurred.

Parameters

location Source location information (automatically populated by default).

Exceptions

CudaError if the last error is not cudaSuccess.

Here is the caller graph for this function:

◆ cudaCheckStatus()

void Mila::Dnn::Compute::cudaCheckStatus	(	cudaError_t	status,
		const std::source_location &	location = `std::source_location::current()`
	)

inlineexport

Checks the status of a CUDA operation and throws if an error occurred.

Parameters

status	The CUDA error status code to check.
location	Source location information (automatically populated by default).

Exceptions

CudaError if the status is not cudaSuccess.

Here is the caller graph for this function:

◆ deviceToString()

std::string Mila::Dnn::Compute::deviceToString ( DeviceType device_type )

export

Converts a DeviceType to its string representation.

Parameters

device_type The device type to convert.

Returns: std::string The string representation of the device type ("CPU" or "CUDA").

Exceptions

std::runtime_error If the device type is invalid.

Here is the caller graph for this function:

◆ findCudaDevice()

int Mila::Dnn::Compute::findCudaDevice	(	int	deviceId = `-1`,
		bool	preferMemory = `false`
	)

inlineexport

Finds the most appropriate CUDA device for computation.

Either validates a specific device ID if provided or finds the best available device when no preference is specified.

Parameters

deviceId Preferred device ID, or -1 to select the best device

Returns: Valid CUDA device ID

Exceptions

std::runtime_error If no CUDA devices are found

Here is the call graph for this function:

◆ getBestDevice()

std::string Mila::Dnn::Compute::getBestDevice	(	DeviceType	type,
		bool	preferMemory = `false`
	)

Gets the best device of a specific type based on performance characteristics.

Parameters

type	The device type to filter by (e.g., Cuda)
preferMemory	When true, prioritizes memory bandwidth over compute capability

Returns: std::string Identifier of the best available device

Here is the call graph for this function:

◆ getBestDeviceId()

int Mila::Dnn::Compute::getBestDeviceId ( bool preferMemory = false )

inlineexport

Identifies the best CUDA device based on performance characteristics.

Evaluates available CUDA devices and selects the one with highest performance potential. Selection criteria vary based on the intended workload type.

Parameters

preferMemory When true, prioritizes memory bandwidth over compute

Returns: Device ID of the best available CUDA device

Exceptions

CudaError If device properties cannot be accessed

Here is the call graph for this function:

Here is the caller graph for this function:

◆ getDeviceCount()

int Mila::Dnn::Compute::getDeviceCount ( )

inlineexport

Gets the number of available CUDA devices.

Returns: Number of CUDA devices available to the application

Exceptions

CudaError If device enumeration fails

Here is the call graph for this function:

Here is the caller graph for this function:

◆ getDriverVersion()

int Mila::Dnn::Compute::getDriverVersion ( )

export

Gets the installed CUDA driver version.

Returns: Integer representation of the CUDA driver version

Exceptions

CudaError If driver version cannot be determined

Here is the call graph for this function:

Here is the caller graph for this function:

◆ getRuntimeVersion()

int Mila::Dnn::Compute::getRuntimeVersion ( )

export

Gets the installed CUDA runtime version.

Returns: Integer representation of the CUDA runtime version

Exceptions

CudaError If runtime version cannot be determined

Here is the call graph for this function:

Here is the caller graph for this function:

◆ isDeviceAvailable()

bool Mila::Dnn::Compute::isDeviceAvailable ( const std::string & device_name )

export

Checks if a specific device is available.

Parameters

device_name The name of the device to check (e.g., "CPU", "CUDA:0").

Returns: bool True if the device is available, false otherwise.

Here is the call graph for this function:

◆ listDevices()

std::vector< std::string > Mila::Dnn::Compute::listDevices ( )

export

Lists all available compute devices.

This function returns a list of all available compute devices that can be used with DeviceContext.

Returns: std::vector<std::string> A list of device identifiers (e.g., "CPU", "CUDA:0", "CUDA:1").

Here is the call graph for this function:

Here is the caller graph for this function:

◆ listDevicesByType()

std::vector< std::string > Mila::Dnn::Compute::listDevicesByType ( DeviceType type )

export

Lists compute devices of a specific type.

Filters the available devices by their type, returning only devices that match the specified type. This allows clients to efficiently discover devices with specific capabilities.

Parameters

type	The device type to filter by

Returns: std::vector<std::string> List of matching device identifiers

Here is the call graph for this function:

◆ operationTypeToString()

std::string Mila::Dnn::Compute::operationTypeToString ( OperationType op )

export

Converts an operation type to its string representation.

This utility function converts an OperationType enum value to a human-readable string representation, which can be used for logging, debugging, or serialization.

Parameters

op	The operation type to convert to string

Returns: std::string The string representation of the operation type

Exceptions

std::runtime_error If the operation type is invalid or not recognized

Here is the caller graph for this function:

◆ toDeviceType()

DeviceType Mila::Dnn::Compute::toDeviceType ( std::string device_type )

export

Converts a string to the corresponding DeviceType.

Performs case-insensitive matching to convert device type strings to the corresponding enum value.

Parameters

device_type The string representation of the device type.

Returns: DeviceType The corresponding device type enum value.

Exceptions

std::runtime_error If the string does not represent a valid device type. Valid options are: "CPU", "CUDA", "AUTO".

Note: "AUTO" option is currently commented out in implementation.

◆ ValidateContext()

template<DeviceType TDeviceType>

std::shared_ptr< DeviceContext > Mila::Dnn::Compute::ValidateContext ( std::shared_ptr< DeviceContext > context )

export

Validates that the provided context is compatible with the specified device type.

Template Parameters

TDevice The device type to validate against.

Parameters

context The context to validate.

Returns: std::shared_ptr<DeviceContext> The validated context.

Exceptions

std::invalid_argument	If the context is null.
std::runtime_error	If the context is incompatible with TDevice.

Variable Documentation

◆ always_false

template<typename T >

constexpr bool Mila::Dnn::Compute::always_false = false

constexpr

◆ GELU_SCALING_FACTOR

const float Mila::Dnn::Compute::GELU_SCALING_FACTOR = sqrtf( 2.0f / M_PI )

Namespaces

Classes

Concepts

Typedefs

Enumerations

Functions

Variables

Typedef Documentation

◆ DeviceMemoryResource

◆ HostMemoryResource

◆ MemoryResource

Enumeration Type Documentation

◆ DeviceType

◆ OperationType

Function Documentation

◆ ceil_div()

◆ checkDevice()

◆ CreateCompatibleContext()

◆ cublaslt_matmul_forward()

◆ cublasLtCheckStatus()

◆ cuda_encoder_forward_fp16()

◆ cuda_encoder_forward_fp32()

◆ cuda_gelu_backward_fp16()

◆ cuda_gelu_backward_fp32()

◆ cuda_gelu_forward_fp16()

◆ cuda_gelu_forward_fp32()

◆ cuda_layernorm_forward_fp16()

◆ cuda_layernorm_forward_fp32()

◆ cuda_matmul_forward_fp16()

◆ cuda_matmul_forward_fp32()

◆ cuda_mha_forward_fp16()

◆ cuda_mha_forward_fp32()

◆ cuda_residual_forward_fp16()

◆ cuda_residual_forward_fp32()

◆ cuda_softmax_crossentropy_backward()

◆ cuda_softmax_crossentropy_forward()

◆ cuda_softmax_forward()

◆ cuda_softmax_forward_general()

◆ cudaCheckLastError()

◆ cudaCheckStatus()

◆ deviceToString()

◆ findCudaDevice()

◆ getBestDevice()

◆ getBestDeviceId()

◆ getDeviceCount()

◆ getDriverVersion()

◆ getRuntimeVersion()

◆ isDeviceAvailable()

◆ listDevices()

◆ listDevicesByType()

◆ operationTypeToString()

◆ toDeviceType()

◆ ValidateContext()

Variable Documentation

◆ always_false

◆ GELU_SCALING_FACTOR