CUDA specialization of TensorOps for mathematical operations.
More...
|
template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource> |
| static void | add (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr) |
| | Element-wise addition of two tensors.
|
template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource> |
| static void | divide (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr) |
| | Element-wise division of two tensors.
|
template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource> |
| static void | multiply (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr) |
| | Element-wise multiplication of two tensors.
|
template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource> |
| static void | subtract (const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr) |
| | Element-wise subtraction of two tensors.
|
template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource> |
| static float | sum (const Tensor< TDataType, TMemoryResource > &tensor, IExecutionContext *exec_context=nullptr) |
| | Computes sum of all tensor elements.
|
|
| template<TensorDataType TDataType> |
| static void | addImpl (const void *a_data, const void *b_data, void *result_data, size_t count, cudaStream_t stream, int device_id) |
| template<TensorDataType TDataType> |
| static void | divideImpl (const void *a_data, const void *b_data, void *result_data, size_t count, cudaStream_t stream, int device_id) |
| template<TensorDataType TDataType> |
| static void | multiplyImpl (const void *a_data, const void *b_data, void *result_data, size_t count, cudaStream_t stream, int device_id) |
| template<TensorDataType TDataType> |
| static void | subtractImpl (const void *a_data, const void *b_data, void *result_data, size_t count, cudaStream_t stream, int device_id) |
| template<TensorDataType TDataType> |
| static float | sumImpl (const void *tensor_data, size_t count, cudaStream_t stream, int device_id) |
CUDA specialization of TensorOps for mathematical operations.
Provides CUDA-specific implementations of tensor mathematical operations using optimized device kernels for parallel execution on NVIDIA GPUs. Supports all CUDA-compatible tensor data types with automatic type handling.
Key features:
- Element-wise binary operations (add, subtract, multiply, divide)
- Element-wise unary operations (negate, abs, sqrt)
- Scalar operations (add scalar, multiply scalar)
- Activation functions (ReLU, Sigmoid, Tanh)
- Reduction operations (sum, mean, max, min)
- Stream-based asynchronous execution
- Zero-overhead ExecutionContext borrowing (raw pointer)
- Automatic fallback to default stream
◆ add()
template<
TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>
| void Mila::Dnn::Compute::Cuda::MathOps::add |
( |
const Tensor< TDataType, TMemoryResource > & | a, |
|
|
const Tensor< TDataType, TMemoryResource > & | b, |
|
|
Tensor< TDataType, TMemoryResource > & | result, |
|
|
IExecutionContext * | exec_context = nullptr ) |
|
inlinestatic |
Element-wise addition of two tensors.
Computes result[i] = a[i] + b[i] for all elements using CUDA kernels. Tensors must have identical shapes. Borrows execution context for stream control with zero overhead.
- Template Parameters
-
| TDataType | Abstract tensor data type |
| TMemoryResource | Memory resource type |
- Parameters
-
| a | First input tensor |
| b | Second input tensor |
| result | Output tensor (must be pre-allocated with matching shape) |
| exec_context | Optional execution context for stream control (borrowed, not owned) |
- Exceptions
-
| std::invalid_argument | If tensor shapes don't match |
| std::runtime_error | If CUDA operations fail |
- Note
- exec_context must outlive this function call
-
When exec_context provided, caller controls synchronization
-
When exec_context is null, uses default stream and synchronizes before returning
Example:
auto ctx = std::make_unique<CudaExecutionContext>(0);
add(tensor_a, tensor_b, result, ctx.get());
ctx->synchronize();
static void add(const Tensor< TDataType, TMemoryResource > &a, const Tensor< TDataType, TMemoryResource > &b, Tensor< TDataType, TMemoryResource > &result, IExecutionContext *exec_context=nullptr)
Element-wise addition of two tensors (CPU implementation).
Definition CpuTensorOps.Math.ixx:74
◆ addImpl()
| void Mila::Dnn::Compute::Cuda::MathOps::addImpl |
( |
const void * | a_data, |
|
|
const void * | b_data, |
|
|
void * | result_data, |
|
|
size_t | count, |
|
|
cudaStream_t | stream, |
|
|
int | device_id ) |
|
inlinestaticprivate |
◆ divide()
template<
TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>
| void Mila::Dnn::Compute::Cuda::MathOps::divide |
( |
const Tensor< TDataType, TMemoryResource > & | a, |
|
|
const Tensor< TDataType, TMemoryResource > & | b, |
|
|
Tensor< TDataType, TMemoryResource > & | result, |
|
|
IExecutionContext * | exec_context = nullptr ) |
|
inlinestatic |
Element-wise division of two tensors.
Computes result[i] = a[i] / b[i] for all elements using CUDA kernels. Follows IEEE 754 standards for floating-point division by zero.
- Template Parameters
-
| TDataType | Abstract tensor data type |
| TMemoryResource | Memory resource type |
- Parameters
-
| a | First input tensor (dividend) |
| b | Second input tensor (divisor) |
| result | Output tensor (must be pre-allocated with matching shape) |
| exec_context | Optional execution context for stream control (borrowed, not owned) |
- Exceptions
-
| std::invalid_argument | If tensor shapes don't match |
| std::runtime_error | If CUDA operations fail |
- Note
- For floating-point types, division by zero produces infinity or NaN per IEEE 754
-
For integer types, division by zero behavior depends on kernel implementation
◆ divideImpl()
| void Mila::Dnn::Compute::Cuda::MathOps::divideImpl |
( |
const void * | a_data, |
|
|
const void * | b_data, |
|
|
void * | result_data, |
|
|
size_t | count, |
|
|
cudaStream_t | stream, |
|
|
int | device_id ) |
|
inlinestaticprivate |
◆ multiply()
template<
TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>
| void Mila::Dnn::Compute::Cuda::MathOps::multiply |
( |
const Tensor< TDataType, TMemoryResource > & | a, |
|
|
const Tensor< TDataType, TMemoryResource > & | b, |
|
|
Tensor< TDataType, TMemoryResource > & | result, |
|
|
IExecutionContext * | exec_context = nullptr ) |
|
inlinestatic |
Element-wise multiplication of two tensors.
Computes result[i] = a[i] * b[i] for all elements using CUDA kernels.
- Template Parameters
-
| TDataType | Abstract tensor data type |
| TMemoryResource | Memory resource type |
- Parameters
-
| a | First input tensor |
| b | Second input tensor |
| result | Output tensor (must be pre-allocated with matching shape) |
| exec_context | Optional execution context for stream control (borrowed, not owned) |
◆ multiplyImpl()
| void Mila::Dnn::Compute::Cuda::MathOps::multiplyImpl |
( |
const void * | a_data, |
|
|
const void * | b_data, |
|
|
void * | result_data, |
|
|
size_t | count, |
|
|
cudaStream_t | stream, |
|
|
int | device_id ) |
|
inlinestaticprivate |
◆ subtract()
template<
TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>
| void Mila::Dnn::Compute::Cuda::MathOps::subtract |
( |
const Tensor< TDataType, TMemoryResource > & | a, |
|
|
const Tensor< TDataType, TMemoryResource > & | b, |
|
|
Tensor< TDataType, TMemoryResource > & | result, |
|
|
IExecutionContext * | exec_context = nullptr ) |
|
inlinestatic |
Element-wise subtraction of two tensors.
Computes result[i] = a[i] - b[i] for all elements using CUDA kernels.
- Template Parameters
-
| TDataType | Abstract tensor data type |
| TMemoryResource | Memory resource type |
- Parameters
-
| a | First input tensor (minuend) |
| b | Second input tensor (subtrahend) |
| result | Output tensor (must be pre-allocated with matching shape) |
| exec_context | Optional execution context for stream control (borrowed, not owned) |
◆ subtractImpl()
| void Mila::Dnn::Compute::Cuda::MathOps::subtractImpl |
( |
const void * | a_data, |
|
|
const void * | b_data, |
|
|
void * | result_data, |
|
|
size_t | count, |
|
|
cudaStream_t | stream, |
|
|
int | device_id ) |
|
inlinestaticprivate |
◆ sum()
template<
TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>
| float Mila::Dnn::Compute::Cuda::MathOps::sum |
( |
const Tensor< TDataType, TMemoryResource > & | tensor, |
|
|
IExecutionContext * | exec_context = nullptr ) |
|
inlinestatic |
Computes sum of all tensor elements.
Reduces tensor to a single scalar value representing the sum of all elements. Uses optimized CUDA reduction with shared memory and warp primitives.
- Template Parameters
-
| TDataType | Abstract tensor data type |
| TMemoryResource | Memory resource type |
- Parameters
-
| tensor | Input tensor |
| exec_context | Optional execution context for stream control (borrowed, not owned) |
- Returns
- Sum of all elements as float
- Note
- Always returns after synchronization (even with exec_context)
-
Result is returned as float for consistency across data types
◆ sumImpl()
| float Mila::Dnn::Compute::Cuda::MathOps::sumImpl |
( |
const void * | tensor_data, |
|
|
size_t | count, |
|
|
cudaStream_t | stream, |
|
|
int | device_id ) |
|
inlinestaticprivate |
The documentation for this struct was generated from the following file: