CUDA specialization of TensorOps for initialization operations. More...

Public Types
template<TensorDataType TDataType>
using	host_value_t

Static Public Member Functions
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static void	fill (Tensor< TDataType, TMemoryResource > &tensor, host_value_t< TDataType > host_value, IExecutionContext *exec_context=nullptr)
	Fill tensor with scalar host value using CUDA kernels.
template<TensorDataType TDataType, typename TMemoryResource> requires isValidTensor<TDataType, TMemoryResource>
static void	fill (Tensor< TDataType, TMemoryResource > &tensor, std::span< const host_value_t< TDataType > > host_values, IExecutionContext *exec_context=nullptr)
	Fill tensor with array of host values using CUDA kernels.

Detailed Description

CUDA specialization of TensorOps for initialization operations.

Provides CUDA-specific implementations of tensor fill operations using optimized device kernels for parallel execution on NVIDIA GPUs. Supports all CUDA-compatible tensor data types with automatic type conversion and quantization from host representations.

Key features:

Asynchronous kernel execution using CUDA streams
Zero-overhead borrowing of ExecutionContext (raw pointer semantics)
Automatic fallback to default stream when no context provided
Memory-efficient chunked processing for large arrays
Automatic host-to-device type conversion in kernels
Compile-time type dispatch for zero runtime overhead
Support for FP32, FP16, BF16, FP8, and integer types

Member Typedef Documentation

◆ host_value_t

template<TensorDataType TDataType>

using Mila::Dnn::Compute::Cuda::FillOps::host_value_t

Initial value:

std::conditional_t<

TensorDataTypeTraits<TDataType>::is_integer_type, int32_t, float>

Mila::Dnn::TensorDataTypeTraits

Compile-time traits for TensorDataType enumeration values.

Definition TensorDataTypeTraits.ixx:46

Member Function Documentation

◆ fill() [1/2]

template<TensorDataType TDataType, typename TMemoryResource>
requires isValidTensor<TDataType, TMemoryResource>

void Mila::Dnn::Compute::Cuda::FillOps::fill	(	Tensor< TDataType, TMemoryResource > &	tensor,
		host_value_t< TDataType >	host_value,
		IExecutionContext *	exec_context = nullptr )

inlinestatic

Fill tensor with scalar host value using CUDA kernels.

Broadcasts a single host scalar value to all elements of a CUDA device tensor using optimized constant fill kernels. No temporary device memory is required - conversion happens directly in the kernel. Borrows execution context for stream control with zero overhead.

Implementation:

Integer types: Use int32_t host representation with kernel conversion
Float types: Use float host representation with kernel conversion
Grid-stride loop kernels for scalability across tensor sizes
Asynchronous execution via provided or default CUDA stream
Compile-time type dispatch based on tensor data type

Template Parameters

TDataType	Abstract tensor data type
TMemoryResource	Memory resource type

Parameters

tensor	Destination CUDA device tensor to fill
host_value	Scalar value in canonical host representation
exec_context	Optional execution context for stream control (borrowed, not owned)

Note: Host value is automatically converted to device native type; Uses CUDA stream from exec_context if provided, default stream otherwise; When using default stream, synchronizes before returning; When exec_context provided, caller controls synchronization; Optimized for constant broadcasts - no temporary memory allocation; exec_context must outlive this function call

Example:

// With explicit context (caller manages sync)
auto ctx = std::make_unique<CudaExecutionContext>(0);
fill(tensor1, 0.0f, ctx.get());
fill(tensor2, 1.0f, ctx.get());
ctx->synchronize();
 
// Without context (automatic sync)
fill(tensor, 3.14f);  // Uses default stream, returns after sync

Here is the call graph for this function:

◆ fill() [2/2]