Configuration for the BPE vocabulary.
More...
Configuration for the BPE vocabulary.
Describes both training hyperparameters and the runtime properties (byte-level encoding, pre-tokenization pattern, special token set) that apply to all BPE families. Serialized with vocabulary files to provide full provenance and to enable validation of vocabulary compatibility.
Typical usage for pretrained models (no training validation needed):
.withVocabSize( 128256 )
.withByteLevel( true )
BpeVocabularyConfig()=default
constexpr const char * LLAMA3_PRETOKENIZATION_PATTERN
Definition BpePreTokenizationMode.ixx:55
@ Llama3Regex
Definition BpePreTokenizationMode.ixx:24
static SpecialTokens llamaStyle()
Llama 3.x configuration.
Definition SpecialTokens.ixx:261
◆ BpeVocabularyConfig()
| Mila::Data::BpeVocabularyConfig::BpeVocabularyConfig |
( |
| ) |
|
|
default |
◆ fromMetadata()
Restore configuration from metadata.
All fields use tryGet* so that files produced by older builds without a given field fall back silently to the in-class defaults.
◆ getMaxMerges()
| size_t Mila::Data::BpeVocabularyConfig::getMaxMerges |
( |
| ) |
const |
|
inline |
◆ getMinFrequency()
| size_t Mila::Data::BpeVocabularyConfig::getMinFrequency |
( |
| ) |
const |
|
inline |
◆ getPreTokenizationMode()
◆ getPreTokenizationPattern()
| const std::string & Mila::Data::BpeVocabularyConfig::getPreTokenizationPattern |
( |
| ) |
const |
|
inline |
◆ getSpecialTokens()
| const SpecialTokens & Mila::Data::BpeVocabularyConfig::getSpecialTokens |
( |
| ) |
const |
|
inline |
◆ getVocabSize()
| size_t Mila::Data::BpeVocabularyConfig::getVocabSize |
( |
| ) |
const |
|
inline |
◆ isByteLevel()
| bool Mila::Data::BpeVocabularyConfig::isByteLevel |
( |
| ) |
const |
|
inline |
◆ isMergeCachingEnabled()
| bool Mila::Data::BpeVocabularyConfig::isMergeCachingEnabled |
( |
| ) |
const |
|
inline |
◆ toMetadata()
Serialize configuration to metadata.
Persists all fields including token strings so that round-tripped vocabularies reproduce the correct special token set on load.
◆ toString()
| std::string Mila::Data::BpeVocabularyConfig::toString |
( |
| ) |
const |
|
inline |
◆ validate()
| void Mila::Data::BpeVocabularyConfig::validate |
( |
| ) |
const |
|
inline |
Validate configuration for training.
Called by BpeTrainer before training begins. Must not be called for pretrained vocabularies loaded via factory methods.
- Exceptions
-
| std::invalid_argument | on invalid training configuration. |
◆ withByteLevel()
◆ withMaxMerges()
◆ withMergeCaching()
◆ withMinFrequency()
◆ withPreTokenization()
◆ withPreTokenizationPattern()
| BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withPreTokenizationPattern |
( |
const std::string & | pattern | ) |
|
|
inline |
◆ withSpecialTokens()
◆ withVocabSize()
◆ byte_level_
| bool Mila::Data::BpeVocabularyConfig::byte_level_ = true |
|
private |
◆ enable_merge_caching_
| bool Mila::Data::BpeVocabularyConfig::enable_merge_caching_ = true |
|
private |
◆ max_merges_
| size_t Mila::Data::BpeVocabularyConfig::max_merges_ = 0 |
|
private |
◆ min_frequency_
| size_t Mila::Data::BpeVocabularyConfig::min_frequency_ = 2 |
|
private |
◆ pre_tokenization_mode_
◆ pre_tokenization_pattern_
| std::string Mila::Data::BpeVocabularyConfig::pre_tokenization_pattern_ = "" |
|
private |
◆ special_tokens_
◆ vocab_size_
| size_t Mila::Data::BpeVocabularyConfig::vocab_size_ = 32000 |
|
private |
The documentation for this class was generated from the following file: