CUDA specialization of TensorOps for mathematical operations. More...

Static Public Member Functions
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static void	add (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr)
	Element-wise addition of two tensors.
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static void	divide (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr)
	Element-wise division of two tensors.
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static void	multiply (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr)
	Element-wise multiplication of two tensors.
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static void	subtract (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr)
	Element-wise subtraction of two tensors.
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static float	sum (const Tensor< TDataType, TMemoryResource > &tensor, IExecutionContext *exec_context=nullptr)
	Computes sum of all tensor elements.

Static Private Member Functions
template<TensorDataType TDataType>
static void	addImpl (const void a_data, const void b_data, void *result_data, size_t count, cudaStream_t stream, int device_id)
template<TensorDataType TDataType>
static void	divideImpl (const void a_data, const void b_data, void *result_data, size_t count, cudaStream_t stream, int device_id)
template<TensorDataType TDataType>
static void	multiplyImpl (const void a_data, const void b_data, void *result_data, size_t count, cudaStream_t stream, int device_id)
template<TensorDataType TDataType>
static void	subtractImpl (const void a_data, const void b_data, void *result_data, size_t count, cudaStream_t stream, int device_id)
template<TensorDataType TDataType>
static float	sumImpl (const void *tensor_data, size_t count, cudaStream_t stream, int device_id)

Detailed Description

CUDA specialization of TensorOps for mathematical operations.

Provides CUDA-specific implementations of tensor mathematical operations using optimized device kernels for parallel execution on NVIDIA GPUs. Supports all CUDA-compatible tensor data types with automatic type handling.

Key features:

Element-wise binary operations (add, subtract, multiply, divide)
Element-wise unary operations (negate, abs, sqrt)
Scalar operations (add scalar, multiply scalar)
Activation functions (ReLU, Sigmoid, Tanh)
Reduction operations (sum, mean, max, min)
Stream-based asynchronous execution
Zero-overhead ExecutionContext borrowing (raw pointer)
Automatic fallback to default stream

Member Function Documentation

◆ add()

template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>

void Mila::Dnn::Compute::Cuda::MathOps::add	(	const Tensor< TDataType, TMemoryResource > &	a,
		const Tensor< TDataType, TMemoryResource > &	b,
		Tensor< TDataType, TMemoryResource > &	result,
		IExecutionContext *	exec_context = nullptr )

inlinestatic

Element-wise addition of two tensors.

Computes result[i] = a[i] + b[i] for all elements using CUDA kernels. Tensors must have identical shapes. Borrows execution context for stream control with zero overhead.

Template Parameters

TDataType	Abstract tensor data type
TMemoryResource	Memory resource type

Parameters

a	First input tensor
b	Second input tensor
result	Output tensor (must be pre-allocated with matching shape)
exec_context	Optional execution context for stream control (borrowed, not owned)

Exceptions

std::invalid_argument	If tensor shapes don't match
std::runtime_error	If CUDA operations fail

Note: exec_context must outlive this function call; When exec_context provided, caller controls synchronization; When exec_context is null, uses default stream and synchronizes before returning

Example:

auto ctx = std::make_unique<CudaExecutionContext>(0);
add(tensor_a, tensor_b, result, ctx.get());
ctx->synchronize();

Here is the call graph for this function:

◆ addImpl()

template<TensorDataType TDataType>

void Mila::Dnn::Compute::Cuda::MathOps::addImpl	(	const void *	a_data,
		const void *	b_data,
		void *	result_data,
		size_t	count,
		cudaStream_t	stream,
		int	device_id )

inlinestaticprivate

Here is the call graph for this function:

Here is the caller graph for this function:

◆ divide()

template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>

void Mila::Dnn::Compute::Cuda::MathOps::divide	(	const Tensor< TDataType, TMemoryResource > &	a,
		const Tensor< TDataType, TMemoryResource > &	b,
		Tensor< TDataType, TMemoryResource > &	result,
		IExecutionContext *	exec_context = nullptr )

inlinestatic

Element-wise division of two tensors.

Computes result[i] = a[i] / b[i] for all elements using CUDA kernels. Follows IEEE 754 standards for floating-point division by zero.

Template Parameters

TDataType	Abstract tensor data type
TMemoryResource	Memory resource type

Parameters

a	First input tensor (dividend)
b	Second input tensor (divisor)
result	Output tensor (must be pre-allocated with matching shape)
exec_context	Optional execution context for stream control (borrowed, not owned)

Exceptions

std::invalid_argument	If tensor shapes don't match
std::runtime_error	If CUDA operations fail

Note: For floating-point types, division by zero produces infinity or NaN per IEEE 754; For integer types, division by zero behavior depends on kernel implementation

Here is the call graph for this function:

◆ divideImpl()

template<TensorDataType TDataType>

void Mila::Dnn::Compute::Cuda::MathOps::divideImpl	(	const void *	a_data,
		const void *	b_data,
		void *	result_data,
		size_t	count,
		cudaStream_t	stream,
		int	device_id )

inlinestaticprivate

Here is the call graph for this function:

Here is the caller graph for this function:

◆ multiply()

template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>

void Mila::Dnn::Compute::Cuda::MathOps::multiply	(	const Tensor< TDataType, TMemoryResource > &	a,
		const Tensor< TDataType, TMemoryResource > &	b,
		Tensor< TDataType, TMemoryResource > &	result,
		IExecutionContext *	exec_context = nullptr )

inlinestatic

Element-wise multiplication of two tensors.

Computes result[i] = a[i] * b[i] for all elements using CUDA kernels.

Template Parameters

TDataType	Abstract tensor data type
TMemoryResource	Memory resource type

Parameters

a	First input tensor
b	Second input tensor
result	Output tensor (must be pre-allocated with matching shape)
exec_context	Optional execution context for stream control (borrowed, not owned)

Here is the call graph for this function:

◆ multiplyImpl()

template<TensorDataType TDataType>

void Mila::Dnn::Compute::Cuda::MathOps::multiplyImpl	(	const void *	a_data,
		const void *	b_data,
		void *	result_data,
		size_t	count,
		cudaStream_t	stream,
		int	device_id )

inlinestaticprivate

Here is the call graph for this function:

Here is the caller graph for this function:

◆ subtract()

template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>

void Mila::Dnn::Compute::Cuda::MathOps::subtract	(	const Tensor< TDataType, TMemoryResource > &	a,
		const Tensor< TDataType, TMemoryResource > &	b,
		Tensor< TDataType, TMemoryResource > &	result,
		IExecutionContext *	exec_context = nullptr )

inlinestatic

Element-wise subtraction of two tensors.

Computes result[i] = a[i] - b[i] for all elements using CUDA kernels.

Template Parameters

TDataType	Abstract tensor data type
TMemoryResource	Memory resource type

Parameters

a	First input tensor (minuend)
b	Second input tensor (subtrahend)
result	Output tensor (must be pre-allocated with matching shape)
exec_context	Optional execution context for stream control (borrowed, not owned)

Here is the call graph for this function:

◆ subtractImpl()

template<TensorDataType TDataType>

void Mila::Dnn::Compute::Cuda::MathOps::subtractImpl	(	const void *	a_data,
		const void *	b_data,
		void *	result_data,
		size_t	count,
		cudaStream_t	stream,
		int	device_id )

inlinestaticprivate

Here is the call graph for this function:

Here is the caller graph for this function:

◆ sum()

template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>

float Mila::Dnn::Compute::Cuda::MathOps::sum	(	const Tensor< TDataType, TMemoryResource > &	tensor,
		IExecutionContext *	exec_context = nullptr )

inlinestatic

Computes sum of all tensor elements.

Reduces tensor to a single scalar value representing the sum of all elements. Uses optimized CUDA reduction with shared memory and warp primitives.

Template Parameters

TDataType	Abstract tensor data type
TMemoryResource	Memory resource type

Parameters

tensor	Input tensor
exec_context	Optional execution context for stream control (borrowed, not owned)

Returns: Sum of all elements as float

Note: Always returns after synchronization (even with exec_context); Result is returned as float for consistency across data types

Here is the call graph for this function:

◆ sumImpl()

template<TensorDataType TDataType>

float Mila::Dnn::Compute::Cuda::MathOps::sumImpl	(	const void *	tensor_data,
		size_t	count,
		cudaStream_t	stream,
		int	device_id )

inlinestaticprivate

Here is the call graph for this function:

Here is the caller graph for this function:

The documentation for this struct was generated from the following file:

/__w/Mila/Mila/Mila/Src/Dnn/Compute/Devices/Cuda/Tensors/Operations/CudaTensorOps.Math.ixx

Static Public Member Functions

Static Private Member Functions

Detailed Description

Member Function Documentation

◆ add()

◆ addImpl()

◆ divide()

◆ divideImpl()

◆ multiply()

◆ multiplyImpl()

◆ subtract()

◆ subtractImpl()

◆ sum()

◆ sumImpl()