Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage > Struct Template Referenceexport

Symmetric per-head per-token FP8 KV cache compression policy. More...

Static Public Attributes

static constexpr bool kIsActive = true
 Activates KV cache compression in CudaGqaOp.
static constexpr bool kPerHeadPerToken = true
 One scale per KV head per token position.
static constexpr TensorDataType kScaleDtype = TensorDataType::FP32
 Dtype for per-head per-token scale factors.
static constexpr TensorDataType kStorageDtype = TStorage
 Storage dtype for compressed K and V values.
static constexpr bool kSymmetric = true
 K and V share the same compression policy.

Detailed Description

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
struct Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >

Symmetric per-head per-token FP8 KV cache compression policy.

Compresses K and V cache tensors from BF16 to FP8_E4M3 on every prefill chunk write and every decode append. Only the FP8 representation is stored in the cache.

Scale granularity is per-head per-token: one float32 scale per KV head per cached token position, computed as max(abs(x[head,token,:])) / 448.0f. Scale tensor shape is [num_kv_heads, max_seq_len].

K and V use this policy symmetrically (kSymmetric = true). Asymmetric K/V compression is not a current Mila target.

On the read path, FP8 K and V are dequantized to transient BF16 buffers immediately before attention score and weighted-sum computation. Dequantized values are never written back to the cache.

Template Parameters
TStorageStorage dtype for compressed cache values. Defaults to FP8_E4M3 (4 exponent bits, 3 mantissa bits), consistent with weight quantization. FP8_E5M2 is reserved for gradients and is not a current Mila target.

Member Data Documentation

◆ kIsActive

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
bool Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >::kIsActive = true
staticconstexpr

Activates KV cache compression in CudaGqaOp.

◆ kPerHeadPerToken

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
bool Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >::kPerHeadPerToken = true
staticconstexpr

One scale per KV head per token position.

◆ kScaleDtype

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
TensorDataType Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >::kScaleDtype = TensorDataType::FP32
staticconstexpr

Dtype for per-head per-token scale factors.

◆ kStorageDtype

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
TensorDataType Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >::kStorageDtype = TStorage
staticconstexpr

Storage dtype for compressed K and V values.

◆ kSymmetric

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
bool Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >::kSymmetric = true
staticconstexpr

K and V share the same compression policy.


The documentation for this struct was generated from the following file: