Concept for quantization-based KV cache compression policies. More...

Concept definition

template<typename T>
concept Mila::Dnn::Quant::KvCache::QuantKvPolicy =  KvCachePolicy<T> && requires
    {
        { T::kStorageDtype    } -> std::convertible_to<TensorDataType>;
        {
            T::kScaleDtype
        } -> std::convertible_to<TensorDataType>;
        {
            T::kPerHeadPerToken
        } -> std::convertible_to<bool>;
        {
            T::kSymmetric
        } -> std::convertible_to<bool>;
    }

Detailed Description

Concept for quantization-based KV cache compression policies.

Refinement of KvCachePolicy requiring the additional dtype and granularity fields consumed by the FP8 cache kernels in CudaGqaOp. Used where the operation needs to inspect storage and scale dtypes, for example during kernel selection and scale tensor allocation.

Note: NoKvCompression satisfies KvCachePolicy but does not satisfy QuantKvPolicy. CudaGqaOp guards all QuantKvPolicy-specific paths with if constexpr (kKvCompressed) before accessing these fields.

Template Parameters

T	Candidate policy type.