Configuration interface for the Grouped-Query Attention component. More...

#include <stdexcept>
#include <string>
#include <utility>
#include <sstream>
import Serialization.Metadata;
import Dnn.ComponentConfig;
import Dnn.TensorTypes;
import Dnn.Component;

Classes
class	Mila::Dnn::GqaConfig
	Configuration class for the Grouped-Query Attention module. More...

Namespaces
namespace	Mila
	Mila main API namespace.
namespace	Mila::Dnn

Detailed Description

Configuration interface for the Grouped-Query Attention component.

Grouped-Query Attention (GQA) extends Multi-Head Attention by decoupling the number of Q heads from the number of K/V heads. Each K/V head is shared by a contiguous group of Q heads, reducing KV cache size and memory bandwidth during inference proportionally to (num_heads / num_kv_heads).

Special cases: num_kv_heads == num_heads → standard Multi-Head Attention num_kv_heads == 1 → Multi-Query Attention (MQA)

Classes

Namespaces

Detailed Description