|
Mila 0.13.48
Deep Neural Network Library
|
KV-cache inference interface for packed-QKV MHA backends. More...


Public Member Functions | |
| ~IPackedKvInference () override=default | |
| virtual void | decode (const ITensor &input, ITensor &output, int position)=0 |
| Process a single autoregressive token against the KV cache. | |
| virtual void | prefill (const ITensor &qkv, ITensor &output)=0 |
| Populate the KV cache from a packed QKV sequence and compute output. | |
| Public Member Functions inherited from Mila::Dnn::Compute::IKvCacheLifecycle | |
| virtual | ~IKvCacheLifecycle ()=default |
| virtual void | initializeKvCache (int batch_size, int max_sequence_length)=0 |
| Allocate the KV cache for a given batch size and maximum sequence length. | |
| virtual void | resetKvCache ()=0 |
| Reset the KV cache to an empty state, preserving the allocation. | |
KV-cache inference interface for packed-QKV MHA backends.
Implemented by GPT-style MHA backends (e.g. CudaMultiHeadAttentionOp). Uses fused QKV input throughout — Q, K, and V are concatenated along the feature axis and split internally by the backend kernel.
Position is implicit: GPT-style MHA always begins prefill at position 0. Absolute positional encoding is handled upstream by Lpe, not inside attention.
Two-phase inference protocol: prefill — populate the KV cache from the full prompt sequence. decode — process one autoregressive token against the accumulated cache.
|
overridedefault |
|
pure virtual |
Process a single autoregressive token against the KV cache.
| input | Packed QKV single-token input [B, 1, 3 * embedding_dim]. |
| output | Pre-allocated output [B, 1, embedding_dim]. |
| position | Zero-based absolute sequence position into the KV cache. |
Implemented in Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TPrecision >, Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::BF16 >, and Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::FP32 >.
|
pure virtual |
Populate the KV cache from a packed QKV sequence and compute output.
| qkv | Packed QKV input [B, T, 3 * embedding_dim]. |
| output | Pre-allocated attention output [B, T, embedding_dim]. |
Implemented in Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TPrecision >, Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::BF16 >, and Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::FP32 >.