Symmetric per-head per-token FP8 KV cache compression policy. More...

Static Public Attributes
static constexpr bool	kIsActive = true
	Activates KV cache compression in `CudaGqaOp`.
static constexpr bool	kPerHeadPerToken = true
	One scale per KV head per token position.
static constexpr TensorDataType	kScaleDtype = TensorDataType::FP32
	Dtype for per-head per-token scale factors.
static constexpr TensorDataType	kStorageDtype = TStorage
	Storage dtype for compressed K and V values.
static constexpr bool	kSymmetric = true
	K and V share the same compression policy.

Detailed Description

template<TensorDataType TStorage = TensorDataType::FP8_E4M3>
struct Mila::Dnn::Quant::KvCache::PerChannelKvFp8< TStorage >

Symmetric per-head per-token FP8 KV cache compression policy.

Compresses K and V cache tensors from BF16 to FP8_E4M3 on every prefill chunk write and every decode append. Only the FP8 representation is stored in the cache.

Scale granularity is per-head per-token: one float32 scale per KV head per cached token position, computed as max(abs(x[head,token,:])) / 448.0f. Scale tensor shape is [num_kv_heads, max_seq_len].

K and V use this policy symmetrically (kSymmetric = true). Asymmetric K/V compression is not a current Mila target.

On the read path, FP8 K and V are dequantized to transient BF16 buffers immediately before attention score and weighted-sum computation. Dequantized values are never written back to the cache.

Template Parameters

TStorage Storage dtype for compressed cache values. Defaults to FP8_E4M3 (4 exponent bits, 3 mantissa bits), consistent with weight quantization. FP8_E5M2 is reserved for gradients and is not a current Mila target.