Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::BpeVocabularyConfig Class Referenceexport

Configuration for the BPE vocabulary. More...

Collaboration diagram for Mila::Data::BpeVocabularyConfig:

Public Member Functions

 BpeVocabularyConfig ()=default
void fromMetadata (const SerializationMetadata &meta)
 Restore configuration from metadata.
size_t getMaxMerges () const
size_t getMinFrequency () const
PreTokenizationMode getPreTokenizationMode () const
const std::string & getPreTokenizationPattern () const
const SpecialTokensgetSpecialTokens () const
size_t getVocabSize () const
bool isByteLevel () const
bool isMergeCachingEnabled () const
SerializationMetadata toMetadata () const
 Serialize configuration to metadata.
std::string toString () const
void validate () const
 Validate configuration for training.
BpeVocabularyConfigwithByteLevel (bool byte_level)
BpeVocabularyConfigwithMaxMerges (size_t max_merges)
BpeVocabularyConfigwithMergeCaching (bool enable)
BpeVocabularyConfigwithMinFrequency (size_t frequency)
BpeVocabularyConfigwithPreTokenization (PreTokenizationMode mode)
BpeVocabularyConfigwithPreTokenizationPattern (const std::string &pattern)
BpeVocabularyConfigwithSpecialTokens (const SpecialTokens &tokens)
BpeVocabularyConfigwithVocabSize (size_t size)

Private Attributes

bool byte_level_ = true
bool enable_merge_caching_ = true
size_t max_merges_ = 0
size_t min_frequency_ = 2
PreTokenizationMode pre_tokenization_mode_ = PreTokenizationMode::None
std::string pre_tokenization_pattern_ = ""
SpecialTokens special_tokens_ = SpecialTokens::standard()
size_t vocab_size_ = 32000

Detailed Description

Configuration for the BPE vocabulary.

Describes both training hyperparameters and the runtime properties (byte-level encoding, pre-tokenization pattern, special token set) that apply to all BPE families. Serialized with vocabulary files to provide full provenance and to enable validation of vocabulary compatibility.

Typical usage for pretrained models (no training validation needed):

auto config = BpeVocabularyConfig()
.withVocabSize( 128256 )
.withByteLevel( true )
.withPreTokenization( PreTokenizationMode::Llama3Regex )
.withPreTokenizationPattern( LLAMA3_PRETOKENIZATION_PATTERN )
.withSpecialTokens( SpecialTokens::llamaStyle() );
constexpr const char * LLAMA3_PRETOKENIZATION_PATTERN
Definition BpePreTokenizationMode.ixx:55
@ Llama3Regex
Definition BpePreTokenizationMode.ixx:24
static SpecialTokens llamaStyle()
Llama 3.x configuration.
Definition SpecialTokens.ixx:261

Constructor & Destructor Documentation

◆ BpeVocabularyConfig()

Mila::Data::BpeVocabularyConfig::BpeVocabularyConfig ( )
default
Here is the caller graph for this function:

Member Function Documentation

◆ fromMetadata()

void Mila::Data::BpeVocabularyConfig::fromMetadata ( const SerializationMetadata & meta)
inline

Restore configuration from metadata.

All fields use tryGet* so that files produced by older builds without a given field fall back silently to the in-class defaults.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ getMaxMerges()

size_t Mila::Data::BpeVocabularyConfig::getMaxMerges ( ) const
inline

◆ getMinFrequency()

size_t Mila::Data::BpeVocabularyConfig::getMinFrequency ( ) const
inline

◆ getPreTokenizationMode()

PreTokenizationMode Mila::Data::BpeVocabularyConfig::getPreTokenizationMode ( ) const
inline

◆ getPreTokenizationPattern()

const std::string & Mila::Data::BpeVocabularyConfig::getPreTokenizationPattern ( ) const
inline

◆ getSpecialTokens()

const SpecialTokens & Mila::Data::BpeVocabularyConfig::getSpecialTokens ( ) const
inline
Here is the caller graph for this function:

◆ getVocabSize()

size_t Mila::Data::BpeVocabularyConfig::getVocabSize ( ) const
inline

◆ isByteLevel()

bool Mila::Data::BpeVocabularyConfig::isByteLevel ( ) const
inline

◆ isMergeCachingEnabled()

bool Mila::Data::BpeVocabularyConfig::isMergeCachingEnabled ( ) const
inline

◆ toMetadata()

SerializationMetadata Mila::Data::BpeVocabularyConfig::toMetadata ( ) const
inline

Serialize configuration to metadata.

Persists all fields including token strings so that round-tripped vocabularies reproduce the correct special token set on load.

Here is the call graph for this function:

◆ toString()

std::string Mila::Data::BpeVocabularyConfig::toString ( ) const
inline

◆ validate()

void Mila::Data::BpeVocabularyConfig::validate ( ) const
inline

Validate configuration for training.

Called by BpeTrainer before training begins. Must not be called for pretrained vocabularies loaded via factory methods.

Exceptions
std::invalid_argumenton invalid training configuration.
Here is the caller graph for this function:

◆ withByteLevel()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withByteLevel ( bool byte_level)
inline
Here is the call graph for this function:
Here is the caller graph for this function:

◆ withMaxMerges()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withMaxMerges ( size_t max_merges)
inline
Here is the call graph for this function:

◆ withMergeCaching()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withMergeCaching ( bool enable)
inline
Here is the call graph for this function:

◆ withMinFrequency()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withMinFrequency ( size_t frequency)
inline
Here is the call graph for this function:

◆ withPreTokenization()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withPreTokenization ( PreTokenizationMode mode)
inline
Here is the call graph for this function:
Here is the caller graph for this function:

◆ withPreTokenizationPattern()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withPreTokenizationPattern ( const std::string & pattern)
inline
Here is the call graph for this function:
Here is the caller graph for this function:

◆ withSpecialTokens()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withSpecialTokens ( const SpecialTokens & tokens)
inline
Here is the call graph for this function:
Here is the caller graph for this function:

◆ withVocabSize()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withVocabSize ( size_t size)
inline
Here is the call graph for this function:
Here is the caller graph for this function:

Member Data Documentation

◆ byte_level_

bool Mila::Data::BpeVocabularyConfig::byte_level_ = true
private

◆ enable_merge_caching_

bool Mila::Data::BpeVocabularyConfig::enable_merge_caching_ = true
private

◆ max_merges_

size_t Mila::Data::BpeVocabularyConfig::max_merges_ = 0
private

◆ min_frequency_

size_t Mila::Data::BpeVocabularyConfig::min_frequency_ = 2
private

◆ pre_tokenization_mode_

PreTokenizationMode Mila::Data::BpeVocabularyConfig::pre_tokenization_mode_ = PreTokenizationMode::None
private

◆ pre_tokenization_pattern_

std::string Mila::Data::BpeVocabularyConfig::pre_tokenization_pattern_ = ""
private

◆ special_tokens_

SpecialTokens Mila::Data::BpeVocabularyConfig::special_tokens_ = SpecialTokens::standard()
private

◆ vocab_size_

size_t Mila::Data::BpeVocabularyConfig::vocab_size_ = 32000
private

The documentation for this class was generated from the following file: