Deployment configuration for Llama language models. More...

#include <stdexcept>
#include <string>
#include <type_traits>
#include <concepts>
import Dnn.TensorTypes;
import Dnn.LanguageModelConfig;

Classes
struct	Mila::Dnn::LlamaModelConfig
	Deployment configuration for Llama language models. More...

Namespaces
namespace	Mila
	Mila main API namespace.
namespace	Mila::Dnn

Detailed Description

Deployment configuration for Llama language models.

LlamaModelConfig is the concrete configuration type passed to LlamaModel::fromPretrained(). It inherits all universal language model deployment concerns from LanguageModelConfig<LlamaModelConfig>:

context_length — maximum sequence length
weight_quantization — Linear weight storage strategy
kv_cache_compression — GroupedQueryAttention cache strategy

All Llama architectural parameters (num_layers, num_heads, hidden_dim, rope_theta, vocab_size, etc.) are read from checkpoint metadata at load time and are not deployment concerns. LlamaModelConfig carries no architecture-specific fields beyond what the base provides.

Usage

// Standard BF16 inference
auto config = LlamaModelConfig( context_length );
 
// FP8 weights + FP8 KV cache
auto config = LlamaModelConfig( context_length )
    .withFP8Quantization();
 
// FP8 weights only — no KV compression
auto config = LlamaModelConfig( context_length )
    .withWeightQuantization( WeightQuantization::FP8 );

Classes

Namespaces

Detailed Description

Usage