|
Mila 0.13.48
Deep Neural Network Library
|
CUDA implementation of the Lpe (token + positional embedding) operation. More...


Public Types | |
| using | ConfigType = LpeConfig |
| using | CudaExecutionContext = ExecutionContext<DeviceType::Cuda> |
| using | MR = CudaDeviceMemoryResource |
| using | NativeType = typename Mila::Dnn::Compute::Cuda::TensorDataTypeMap<TPrecision>::device_type |
| using | TensorType = Tensor<TPrecision, MR> |
| using | UnaryOperationBase = UnaryOperation<DeviceType::Cuda, TInput, TPrecision> |
| Public Types inherited from Mila::Dnn::Compute::UnaryOperation< DeviceType::Cuda, TInput, TInput > | |
| using | MR |
| using | TensorInputType |
| using | TensorOutputType |
| Public Types inherited from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision > | |
| using | DataTypeTraits |
Public Member Functions | |
| CudaLpeOp (IExecutionContext *context, const LpeConfig &config) | |
| void | backward (const ITensor &input, const ITensor &output_grad, ITensor &input_grad) const override |
| Backward pass accumulating gradients into wte and wpe (hot path). | |
| void | build (const BuildContext &config) override |
| Prepare the operation for a concrete input shape (cold path). | |
| void | decode (const ITensor &input, ITensor &output, int position) override |
| Chunked prefill with explicit position offset. | |
| void | forward (const ITensor &input, ITensor &output) const override |
| Full-sequence forward pass (hot path). | |
| std::string | getName () const override |
| Human-readable operation name. | |
| OperationType | getOperationType () const override |
| Operation type identifier. | |
| void | setGradients (ITensor *wte_grad, ITensor *wpe_grad) override |
| Bind wte and wpe gradient tensors for training (module retains ownership). | |
| void | setParameters (ITensor *wte, ITensor *wpe) override |
| Bind wte and wpe parameter tensors (module retains ownership). | |
| Public Member Functions inherited from Mila::Dnn::Compute::UnaryOperation< DeviceType::Cuda, TInput, TInput > | |
| virtual | ~UnaryOperation ()=default |
| Public Member Functions inherited from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision > | |
| virtual | ~Operation ()=default |
| virtual void | clearGradients () noexcept |
| Clear any cached gradient pointers held by the operation. | |
| virtual TensorDataType | getDataType () const |
| Tensor data type for this operation. | |
| virtual DeviceType | getDeviceType () const |
| Device type for this operation. | |
| virtual std::size_t | getStateMemorySize () const |
| Returns the number of bytes of state memory allocated by this operation. | |
| virtual bool | isBuilt () const |
| Whether build() completed successfully for a concrete input shape. | |
| virtual bool | isEvalMode () const |
| Query whether operation is configured for training. | |
| virtual void | setTrainingMode (TrainingMode training_mode) |
| Configure operation training-mode behavior. | |
| Public Member Functions inherited from Mila::Dnn::Compute::IPositionalDecode | |
| virtual | ~IPositionalDecode ()=default |
Private Member Functions | |
| void | validateInputShape (const shape_t &input_shape) const |
Private Attributes | |
| int | batch_size_ { 0 } |
| LpeConfig | config_ |
| CudaExecutionContext * | context_ |
| int | embedding_dim_ { 0 } |
| int | seq_length_ { 0 } |
| NativeType * | wpe_ { nullptr } |
| int | wpe_embedding_dim_ { 0 } |
| NativeType * | wpe_grad_ { nullptr } |
| int | wpe_max_seq_len_ { 0 } |
| NativeType * | wte_ { nullptr } |
| int | wte_embedding_dim_ { 0 } |
| NativeType * | wte_grad_ { nullptr } |
| int | wte_vocab_size_ { 0 } |
Additional Inherited Members | |
| Static Public Attributes inherited from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision > | |
| static constexpr TensorDataType | data_type |
| static constexpr DeviceType | device_type |
| Static Protected Member Functions inherited from Mila::Dnn::Compute::UnaryOperation< DeviceType::Cuda, TInput, TInput > | |
| static const TensorInputType & | asInputTensor (const ITensor &t) |
| static TensorOutputType & | asOutputTensor (ITensor &t) |
| Protected Attributes inherited from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision > | |
| bool | is_built_ |
| TrainingMode | training_mode_ |
CUDA implementation of the Lpe (token + positional embedding) operation.
Combines a token embedding lookup (wte) with a positional embedding lookup (wpe) on CUDA devices, supporting FP32 and FP16 precision.
Design:
| TInput | Data type of token index input (typically INT32). |
| TPrecision | Precision of embedding output (FP32 or FP16). |
| using Mila::Dnn::Compute::Cuda::Lpe::CudaLpeOp< TInput, TPrecision >::ConfigType = LpeConfig |
| using Mila::Dnn::Compute::Cuda::Lpe::CudaLpeOp< TInput, TPrecision >::CudaExecutionContext = ExecutionContext<DeviceType::Cuda> |
| using Mila::Dnn::Compute::Cuda::Lpe::CudaLpeOp< TInput, TPrecision >::MR = CudaDeviceMemoryResource |
| using Mila::Dnn::Compute::Cuda::Lpe::CudaLpeOp< TInput, TPrecision >::NativeType = typename Mila::Dnn::Compute::Cuda::TensorDataTypeMap<TPrecision>::device_type |
| using Mila::Dnn::Compute::Cuda::Lpe::CudaLpeOp< TInput, TPrecision >::TensorType = Tensor<TPrecision, MR> |
| using Mila::Dnn::Compute::Cuda::Lpe::CudaLpeOp< TInput, TPrecision >::UnaryOperationBase = UnaryOperation<DeviceType::Cuda, TInput, TPrecision> |
|
inline |
|
inlineoverridevirtual |
Backward pass accumulating gradients into wte and wpe (hot path).
Token indices are non-differentiable; input_grad is unused.
| input | Token indices used in forward [B, T] (INT32). |
| output_grad | Upstream embedding gradient [B, T, C]. |
| input_grad | Unused (non-differentiable input). |
Implements Mila::Dnn::Compute::UnaryOperation< DeviceType::Cuda, TInput, TInput >.
|
inlineoverridevirtual |
Prepare the operation for a concrete input shape (cold path).
Validates parameters, caches B, T, and C for hot-path dispatch, and verifies that the sequence length fits within the positional embedding table. Must be called after setParameters() and before forward(), backward(), or decode().
| input_shape | Token index input shape [B, T]. |
| std::runtime_error | if parameters are not bound. |
| std::invalid_argument | if input shape is invalid or sequence length exceeds the positional embedding capacity. |
Reimplemented from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision >.
|
inlineoverridevirtual |
Chunked prefill with explicit position offset.
Computes output[b,t,:] = wte[X[b,t],:] + wpe[position_offset + t,:] by shifting the wpe base pointer before calling the standard forward kernel. No dedicated prefill kernel is needed.
| input | Token indices [B, T] (INT32). |
| output | Pre-allocated embeddings [B, T, C]. |
| position_offset | Absolute position of the first token in this chunk. |
Single-token decode with an explicit sequence position (hot path).
Computes output[b,:] = wte[X[b,0],:] + wpe[position,:] for each batch element. The dispatch implementation shifts the wpe pointer to row position and calls the forward kernel with T=1, so no dedicated decode kernel is required.
| input | Single-token indices [B, 1] (INT32). |
| output | Pre-allocated output buffer [B, 1, C]. |
| position | Zero-based absolute sequence position for the wpe lookup. |
Implements Mila::Dnn::Compute::IPositionalDecode.
|
inlineoverridevirtual |
Full-sequence forward pass (hot path).
For each (b, t): output[b,t,:] = wte[X[b,t],:] + wpe[t,:].
| input | Token indices [B, T] (INT32). |
| output | Pre-allocated embeddings [B, T, C]. |
| std::runtime_error | if the input shape exceeds the built maximum. |
Implements Mila::Dnn::Compute::UnaryOperation< DeviceType::Cuda, TInput, TInput >.
|
inlineoverridevirtual |
Human-readable operation name.
Implements Mila::Dnn::Compute::Operation< TDeviceType, TPrecision >.
|
inlineoverridevirtual |
Operation type identifier.
Implements Mila::Dnn::Compute::Operation< TDeviceType, TPrecision >.
|
inlineoverridevirtual |
Bind wte and wpe gradient tensors for training (module retains ownership).
| wte_grad | Gradient buffer for wte — CUDA tensor of shape [vocab_size, C]. |
| wpe_grad | Gradient buffer for wpe — CUDA tensor of shape [max_seq_len, C]. |
| std::invalid_argument | on null or non-CUDA tensors. |
Reimplemented from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision >.
|
inlineoverridevirtual |
Bind wte and wpe parameter tensors (module retains ownership).
Caches native device pointers and validates tensor shapes against the configuration. Must be called before build().
| wte | Token embedding table — CUDA tensor of shape [vocab_size, C]. |
| wpe | Positional embedding table — CUDA tensor of shape [max_seq_len, C]. |
| std::invalid_argument | on null, non-CUDA, or shape-mismatched tensors. |
Reimplemented from Mila::Dnn::Compute::Operation< TDeviceType, TPrecision >.
|
inlineprivate |

|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |