Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Dnn::LanguageModelConfig< TDerived > Struct Template Referenceexport

CRTP base configuration for all deployable Mila language models. More...

Public Member Functions

 LanguageModelConfig ()=default
 LanguageModelConfig (dim_t context_length)
 Construct with a required context length.
std::string baseToString () const
 Produce the base fields portion of a toString() summary.
dim_t getContextLength () const noexcept
KvCacheCompression getKvCacheCompression () const noexcept
WeightQuantization getWeightQuantization () const noexcept
TDerived & withContextLength (dim_t context_length)
 Set the maximum sequence length.
TDerived & withFP4Quantization ()
 FP4 quantization — FP4 weights, FP8 KV cache.
TDerived & withFP8Quantization ()
 FP8 quantization — FP8 weights, FP8 KV cache.
TDerived & withFullPrecision ()
 Full precision — BF16 weights, BF16 KV cache.
TDerived & withKvCacheCompression (KvCacheCompression kv)
 Set the KV cache compression mode independently.
TDerived & withWeightQuantization (WeightQuantization wq)
 Set the weight quantization mode independently.

Protected Attributes

dim_t context_length_ { 0 }
KvCacheCompression kv_cache_compression_ { KvCacheCompression::None }
WeightQuantization weight_quantization_ { WeightQuantization::None }

Detailed Description

template<typename TDerived>
struct Mila::Dnn::LanguageModelConfig< TDerived >

CRTP base configuration for all deployable Mila language models.

Template Parameters
TDerivedConcrete config type (e.g. LlamaModelConfig). All fluent setters return TDerived& to support unbroken chain syntax across base and derived methods.

Constructor & Destructor Documentation

◆ LanguageModelConfig() [1/2]

template<typename TDerived>
Mila::Dnn::LanguageModelConfig< TDerived >::LanguageModelConfig ( )
default

◆ LanguageModelConfig() [2/2]

template<typename TDerived>
Mila::Dnn::LanguageModelConfig< TDerived >::LanguageModelConfig ( dim_t context_length)
inlineexplicit

Construct with a required context length.

Parameters
context_lengthMaximum sequence length in tokens. Must be > 0.
Exceptions
std::invalid_argumentif context_length is zero.

Member Function Documentation

◆ baseToString()

template<typename TDerived>
std::string Mila::Dnn::LanguageModelConfig< TDerived >::baseToString ( ) const
inline

Produce the base fields portion of a toString() summary.

Concrete model configs call this from their own toString() implementation and append architecture-specific fields.

◆ getContextLength()

template<typename TDerived>
dim_t Mila::Dnn::LanguageModelConfig< TDerived >::getContextLength ( ) const
inlinenoexcept
Here is the caller graph for this function:

◆ getKvCacheCompression()

template<typename TDerived>
KvCacheCompression Mila::Dnn::LanguageModelConfig< TDerived >::getKvCacheCompression ( ) const
inlinenoexcept

◆ getWeightQuantization()

template<typename TDerived>
WeightQuantization Mila::Dnn::LanguageModelConfig< TDerived >::getWeightQuantization ( ) const
inlinenoexcept

◆ withContextLength()

template<typename TDerived>
TDerived & Mila::Dnn::LanguageModelConfig< TDerived >::withContextLength ( dim_t context_length)
inline

Set the maximum sequence length.

Required before passing the config to fromPretrained(). RoPE embeddings and KV cache buffers are sized to this value at build time.

Parameters
context_lengthMaximum sequence length in tokens. Must be > 0.
Exceptions
std::invalid_argumentif context_length is zero.

◆ withFP4Quantization()

template<typename TDerived>
TDerived & Mila::Dnn::LanguageModelConfig< TDerived >::withFP4Quantization ( )
inline

FP4 quantization — FP4 weights, FP8 KV cache.

Maps to PerGroupFp4<> on Linear (future) and PerChannelKvFp8<> on GroupedQueryAttention. Aggressive compression; some quality loss relative to FP8. FP4 KV cache is not a Mila target.

◆ withFP8Quantization()

template<typename TDerived>
TDerived & Mila::Dnn::LanguageModelConfig< TDerived >::withFP8Quantization ( )
inline

FP8 quantization — FP8 weights, FP8 KV cache.

Maps to PerChannelFp8<> on Linear and PerChannelKvFp8<> on GroupedQueryAttention. Good quality/compression tradeoff for standard inference on Ada Lovelace and later.

◆ withFullPrecision()

template<typename TDerived>
TDerived & Mila::Dnn::LanguageModelConfig< TDerived >::withFullPrecision ( )
inline

Full precision — BF16 weights, BF16 KV cache.

Resets both quantization axes to their defaults. Useful for explicitly documenting intent or overriding a previously set preset.

◆ withKvCacheCompression()

template<typename TDerived>
TDerived & Mila::Dnn::LanguageModelConfig< TDerived >::withKvCacheCompression ( KvCacheCompression kv)
inline

Set the KV cache compression mode independently.

Use when the desired KV cache compression does not pair with the default weight quantization of a preset, or when a preset does not exist for the desired combination.

Parameters
kvKV cache compression mode to apply.

◆ withWeightQuantization()

template<typename TDerived>
TDerived & Mila::Dnn::LanguageModelConfig< TDerived >::withWeightQuantization ( WeightQuantization wq)
inline

Set the weight quantization mode independently.

Use when the desired weight quantization does not pair with the default KV cache compression of a preset, or when a preset does not exist for the desired combination.

Parameters
wqWeight quantization mode to apply.

Member Data Documentation

◆ context_length_

template<typename TDerived>
dim_t Mila::Dnn::LanguageModelConfig< TDerived >::context_length_ { 0 }
protected

◆ kv_cache_compression_

template<typename TDerived>
KvCacheCompression Mila::Dnn::LanguageModelConfig< TDerived >::kv_cache_compression_ { KvCacheCompression::None }
protected

◆ weight_quantization_

template<typename TDerived>
WeightQuantization Mila::Dnn::LanguageModelConfig< TDerived >::weight_quantization_ { WeightQuantization::None }
protected

The documentation for this struct was generated from the following file: