|
Mila 0.13.48
Deep Neural Network Library
|
Symmetric per-head per-token FP8 KV cache compression policy. More...
Static Public Attributes | |
| static constexpr bool | kIsActive = true |
Activates KV cache compression in CudaGqaOp. | |
| static constexpr bool | kPerHeadPerToken = true |
| One scale per KV head per token position. | |
| static constexpr TensorDataType | kScaleDtype = TensorDataType::FP32 |
| Dtype for per-head per-token scale factors. | |
| static constexpr TensorDataType | kStorageDtype = TStorage |
| Storage dtype for compressed K and V values. | |
| static constexpr bool | kSymmetric = true |
| K and V share the same compression policy. | |
Symmetric per-head per-token FP8 KV cache compression policy.
Compresses K and V cache tensors from BF16 to FP8_E4M3 on every prefill chunk write and every decode append. Only the FP8 representation is stored in the cache.
Scale granularity is per-head per-token: one float32 scale per KV head per cached token position, computed as max(abs(x[head,token,:])) / 448.0f. Scale tensor shape is [num_kv_heads, max_seq_len].
K and V use this policy symmetrically (kSymmetric = true). Asymmetric K/V compression is not a current Mila target.
On the read path, FP8 K and V are dequantized to transient BF16 buffers immediately before attention score and weighted-sum computation. Dequantized values are never written back to the cache.
| TStorage | Storage dtype for compressed cache values. Defaults to FP8_E4M3 (4 exponent bits, 3 mantissa bits), consistent with weight quantization. FP8_E5M2 is reserved for gradients and is not a current Mila target. |
|
staticconstexpr |
Activates KV cache compression in CudaGqaOp.
|
staticconstexpr |
One scale per KV head per token position.
|
staticconstexpr |
Dtype for per-head per-token scale factors.
|
staticconstexpr |
Storage dtype for compressed K and V values.
|
staticconstexpr |
K and V share the same compression policy.