|
Mila 0.13.48
Deep Neural Network Library
|
CRTP base configuration for all deployable Mila language models. More...
Public Member Functions | |
| LanguageModelConfig ()=default | |
| LanguageModelConfig (dim_t context_length) | |
| Construct with a required context length. | |
| std::string | baseToString () const |
| Produce the base fields portion of a toString() summary. | |
| dim_t | getContextLength () const noexcept |
| KvCacheCompression | getKvCacheCompression () const noexcept |
| WeightQuantization | getWeightQuantization () const noexcept |
| TDerived & | withContextLength (dim_t context_length) |
| Set the maximum sequence length. | |
| TDerived & | withFP4Quantization () |
| FP4 quantization — FP4 weights, FP8 KV cache. | |
| TDerived & | withFP8Quantization () |
| FP8 quantization — FP8 weights, FP8 KV cache. | |
| TDerived & | withFullPrecision () |
| Full precision — BF16 weights, BF16 KV cache. | |
| TDerived & | withKvCacheCompression (KvCacheCompression kv) |
| Set the KV cache compression mode independently. | |
| TDerived & | withWeightQuantization (WeightQuantization wq) |
| Set the weight quantization mode independently. | |
Protected Attributes | |
| dim_t | context_length_ { 0 } |
| KvCacheCompression | kv_cache_compression_ { KvCacheCompression::None } |
| WeightQuantization | weight_quantization_ { WeightQuantization::None } |
CRTP base configuration for all deployable Mila language models.
| TDerived | Concrete config type (e.g. LlamaModelConfig). All fluent setters return TDerived& to support unbroken chain syntax across base and derived methods. |
|
default |
|
inlineexplicit |
Construct with a required context length.
| context_length | Maximum sequence length in tokens. Must be > 0. |
| std::invalid_argument | if context_length is zero. |
|
inline |
Produce the base fields portion of a toString() summary.
Concrete model configs call this from their own toString() implementation and append architecture-specific fields.
|
inlinenoexcept |

|
inlinenoexcept |
|
inlinenoexcept |
|
inline |
Set the maximum sequence length.
Required before passing the config to fromPretrained(). RoPE embeddings and KV cache buffers are sized to this value at build time.
| context_length | Maximum sequence length in tokens. Must be > 0. |
| std::invalid_argument | if context_length is zero. |
|
inline |
FP4 quantization — FP4 weights, FP8 KV cache.
Maps to PerGroupFp4<> on Linear (future) and PerChannelKvFp8<> on GroupedQueryAttention. Aggressive compression; some quality loss relative to FP8. FP4 KV cache is not a Mila target.
|
inline |
FP8 quantization — FP8 weights, FP8 KV cache.
Maps to PerChannelFp8<> on Linear and PerChannelKvFp8<> on GroupedQueryAttention. Good quality/compression tradeoff for standard inference on Ada Lovelace and later.
|
inline |
Full precision — BF16 weights, BF16 KV cache.
Resets both quantization axes to their defaults. Useful for explicitly documenting intent or overriding a previously set preset.
|
inline |
Set the KV cache compression mode independently.
Use when the desired KV cache compression does not pair with the default weight quantization of a preset, or when a preset does not exist for the desired combination.
| kv | KV cache compression mode to apply. |
|
inline |
Set the weight quantization mode independently.
Use when the desired weight quantization does not pair with the default KV cache compression of a preset, or when a preset does not exist for the desired combination.
| wq | Weight quantization mode to apply. |
|
protected |
|
protected |
|
protected |