Configuration for the BPE vocabulary. More...

Collaboration diagram for Mila::Data::BpeVocabularyConfig:

Public Member Functions
	BpeVocabularyConfig ()=default
void	fromMetadata (const SerializationMetadata &meta)
	Restore configuration from metadata.
size_t	getMaxMerges () const
size_t	getMinFrequency () const
PreTokenizationMode	getPreTokenizationMode () const
const std::string &	getPreTokenizationPattern () const
const SpecialTokens &	getSpecialTokens () const
size_t	getVocabSize () const
bool	isByteLevel () const
bool	isMergeCachingEnabled () const
SerializationMetadata	toMetadata () const
	Serialize configuration to metadata.
std::string	toString () const
void	validate () const
	Validate configuration for training.
BpeVocabularyConfig &	withByteLevel (bool byte_level)
BpeVocabularyConfig &	withMaxMerges (size_t max_merges)
BpeVocabularyConfig &	withMergeCaching (bool enable)
BpeVocabularyConfig &	withMinFrequency (size_t frequency)
BpeVocabularyConfig &	withPreTokenization (PreTokenizationMode mode)
BpeVocabularyConfig &	withPreTokenizationPattern (const std::string &pattern)
BpeVocabularyConfig &	withSpecialTokens (const SpecialTokens &tokens)
BpeVocabularyConfig &	withVocabSize (size_t size)

Private Attributes
bool	byte_level_ = true
bool	enable_merge_caching_ = true
size_t	max_merges_ = 0
size_t	min_frequency_ = 2
PreTokenizationMode	pre_tokenization_mode_ = PreTokenizationMode::None
std::string	pre_tokenization_pattern_ = ""
SpecialTokens	special_tokens_ = SpecialTokens::standard()
size_t	vocab_size_ = 32000

Detailed Description

Configuration for the BPE vocabulary.

Describes both training hyperparameters and the runtime properties (byte-level encoding, pre-tokenization pattern, special token set) that apply to all BPE families. Serialized with vocabulary files to provide full provenance and to enable validation of vocabulary compatibility.

Typical usage for pretrained models (no training validation needed):

auto config = BpeVocabularyConfig()
    .withVocabSize( 128256 )
    .withByteLevel( true )
    .withPreTokenization( PreTokenizationMode::Llama3Regex )
    .withPreTokenizationPattern( LLAMA3_PRETOKENIZATION_PATTERN )
    .withSpecialTokens( SpecialTokens::llamaStyle() );

Constructor & Destructor Documentation

◆ BpeVocabularyConfig()

Mila::Data::BpeVocabularyConfig::BpeVocabularyConfig ( )

default

Here is the caller graph for this function:

Member Function Documentation

◆ fromMetadata()

void Mila::Data::BpeVocabularyConfig::fromMetadata ( const SerializationMetadata & meta )

inline

Restore configuration from metadata.

All fields use tryGet* so that files produced by older builds without a given field fall back silently to the in-class defaults.

Here is the call graph for this function:

Here is the caller graph for this function:

◆ getMaxMerges()

size_t Mila::Data::BpeVocabularyConfig::getMaxMerges ( ) const

inline

◆ getMinFrequency()

size_t Mila::Data::BpeVocabularyConfig::getMinFrequency ( ) const

inline

◆ getPreTokenizationMode()

PreTokenizationMode Mila::Data::BpeVocabularyConfig::getPreTokenizationMode ( ) const

inline

◆ getPreTokenizationPattern()

const std::string & Mila::Data::BpeVocabularyConfig::getPreTokenizationPattern ( ) const

inline

◆ getSpecialTokens()

const SpecialTokens & Mila::Data::BpeVocabularyConfig::getSpecialTokens ( ) const

inline

Here is the caller graph for this function:

◆ getVocabSize()

size_t Mila::Data::BpeVocabularyConfig::getVocabSize ( ) const

inline

◆ isByteLevel()

bool Mila::Data::BpeVocabularyConfig::isByteLevel ( ) const

inline

◆ isMergeCachingEnabled()

bool Mila::Data::BpeVocabularyConfig::isMergeCachingEnabled ( ) const

inline

◆ toMetadata()

SerializationMetadata Mila::Data::BpeVocabularyConfig::toMetadata ( ) const

inline

Serialize configuration to metadata.

Persists all fields including token strings so that round-tripped vocabularies reproduce the correct special token set on load.

Here is the call graph for this function:

◆ toString()

std::string Mila::Data::BpeVocabularyConfig::toString ( ) const

inline

◆ validate()

void Mila::Data::BpeVocabularyConfig::validate ( ) const

inline

Validate configuration for training.

Called by BpeTrainer before training begins. Must not be called for pretrained vocabularies loaded via factory methods.

Exceptions

std::invalid_argument on invalid training configuration.

Here is the caller graph for this function:

◆ withByteLevel()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withByteLevel ( bool byte_level )

inline

Here is the call graph for this function:

Here is the caller graph for this function:

◆ withMaxMerges()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withMaxMerges ( size_t max_merges )

inline

Here is the call graph for this function:

◆ withMergeCaching()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withMergeCaching ( bool enable )

inline

Here is the call graph for this function:

◆ withMinFrequency()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withMinFrequency ( size_t frequency )

inline

Here is the call graph for this function:

◆ withPreTokenization()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withPreTokenization ( PreTokenizationMode mode )

inline

Here is the call graph for this function:

Here is the caller graph for this function:

◆ withPreTokenizationPattern()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withPreTokenizationPattern ( const std::string & pattern )

inline

Here is the call graph for this function:

Here is the caller graph for this function:

◆ withSpecialTokens()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withSpecialTokens ( const SpecialTokens & tokens )

inline

Here is the call graph for this function:

Here is the caller graph for this function:

◆ withVocabSize()

BpeVocabularyConfig & Mila::Data::BpeVocabularyConfig::withVocabSize ( size_t size )

inline

Here is the call graph for this function:

Here is the caller graph for this function:

Member Data Documentation

◆ byte_level_

bool Mila::Data::BpeVocabularyConfig::byte_level_ = true

private

◆ enable_merge_caching_

bool Mila::Data::BpeVocabularyConfig::enable_merge_caching_ = true

private

◆ max_merges_

size_t Mila::Data::BpeVocabularyConfig::max_merges_ = 0

private

◆ min_frequency_

size_t Mila::Data::BpeVocabularyConfig::min_frequency_ = 2

private

◆ pre_tokenization_mode_

PreTokenizationMode Mila::Data::BpeVocabularyConfig::pre_tokenization_mode_ = PreTokenizationMode::None

private

◆ pre_tokenization_pattern_

std::string Mila::Data::BpeVocabularyConfig::pre_tokenization_pattern_ = ""

private

◆ special_tokens_

SpecialTokens Mila::Data::BpeVocabularyConfig::special_tokens_ = SpecialTokens::standard()

private

◆ vocab_size_

size_t Mila::Data::BpeVocabularyConfig::vocab_size_ = 32000

private

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Data/Tokenizers/Bpe/BpeVocabularyConfig.ixx

Public Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

◆ BpeVocabularyConfig()

Member Function Documentation

◆ fromMetadata()

◆ getMaxMerges()

◆ getMinFrequency()

◆ getPreTokenizationMode()

◆ getPreTokenizationPattern()

◆ getSpecialTokens()

◆ getVocabSize()

◆ isByteLevel()

◆ isMergeCachingEnabled()

◆ toMetadata()

◆ toString()

◆ validate()

◆ withByteLevel()

◆ withMaxMerges()

◆ withMergeCaching()

◆ withMinFrequency()

◆ withPreTokenization()

◆ withPreTokenizationPattern()

◆ withSpecialTokens()

◆ withVocabSize()

Member Data Documentation

◆ byte_level_

◆ enable_merge_caching_

◆ max_merges_

◆ min_frequency_

◆ pre_tokenization_mode_

◆ pre_tokenization_pattern_

◆ special_tokens_

◆ vocab_size_