|
Mila 0.13.48
Deep Neural Network Library
|
Concept for quantization-based KV cache compression policies. More...
Concept for quantization-based KV cache compression policies.
Refinement of KvCachePolicy requiring the additional dtype and granularity fields consumed by the FP8 cache kernels in CudaGqaOp. Used where the operation needs to inspect storage and scale dtypes, for example during kernel selection and scale tensor allocation.
NoKvCompression satisfies KvCachePolicy but does not satisfy QuantKvPolicy. CudaGqaOp guards all QuantKvPolicy-specific paths with if constexpr (kKvCompressed) before accessing these fields.| T | Candidate policy type. |