|
Mila 0.13.48
Deep Neural Network Library
|
Capability interface for KV-cache state management. More...

Public Member Functions | |
| virtual | ~IKvCacheLifecycle ()=default |
| virtual void | initializeKvCache (int batch_size, int max_sequence_length)=0 |
| Allocate the KV cache for a given batch size and maximum sequence length. | |
| virtual void | resetKvCache ()=0 |
| Reset the KV cache to an empty state, preserving the allocation. | |
Capability interface for KV-cache state management.
Implemented by attention operations (GQA, MHA) that allocate and maintain key/value caches across autoregressive decode steps. This concern is orthogonal to positional dispatch — an operation may implement both IPositionalUnaryOp and IKVCacheLifecycle.
|
virtualdefault |
|
pure virtual |
Allocate the KV cache for a given batch size and maximum sequence length.
| batch_size | Number of sequences in the batch. |
| max_sequence_length | Maximum number of tokens the cache must hold. |
Implemented in Mila::Dnn::Compute::Cuda::Gqa::CudaGqaOp< TPrecision >, Mila::Dnn::Compute::Cuda::Gqa::CudaGqaOp< TensorDataType::BF16 >, Mila::Dnn::Compute::Cuda::Gqa::CudaGqaOp< TensorDataType::FP32 >, Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TPrecision >, Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::BF16 >, and Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::FP32 >.
|
pure virtual |
Reset the KV cache to an empty state, preserving the allocation.
Implemented in Mila::Dnn::Compute::Cuda::Gqa::CudaGqaOp< TPrecision >, Mila::Dnn::Compute::Cuda::Gqa::CudaGqaOp< TensorDataType::BF16 >, Mila::Dnn::Compute::Cuda::Gqa::CudaGqaOp< TensorDataType::FP32 >, Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TPrecision >, Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::BF16 >, and Mila::Dnn::Compute::Cuda::MultiHeadAttention::CudaMultiHeadAttentionOp< TensorDataType::FP32 >.