Configuration for Character-level tokenizer training. More...

Collaboration diagram for Mila::Data::CharVocabularyConfig:

Public Member Functions
	CharVocabularyConfig ()=default
void	fromMetadata (const SerializationMetadata &meta)
	Populate configuration from metadata.
const SpecialTokens &	getSpecialTokens () const
bool	isByteLevel () const
bool	isCaseSensitive () const
bool	shouldNormalizeUnicode () const
SerializationMetadata	toMetadata () const
	Convert configuration to metadata for serialization.
std::string	toString () const
	Produce human-readable summary of configuration.
void	validate () const
	Validate configuration parameters.
CharVocabularyConfig &	withByteLevel (bool byte_level)
	Set whether to use byte-level encoding.
CharVocabularyConfig &	withCaseSensitive (bool sensitive)
	Set whether tokenization is case-sensitive.
CharVocabularyConfig &	withNormalizeUnicode (bool normalize)
	Set whether to normalize Unicode characters.
CharVocabularyConfig &	withSpecialTokens (const SpecialTokens &tokens)
	Configure special tokens.

Private Attributes
bool	byte_level_ = false
bool	case_sensitive_ = true
bool	normalize_unicode_ = false
SpecialTokens	special_tokens_ {}

Detailed Description

Configuration for Character-level tokenizer training.

Character tokenizers split text into individual characters (or bytes). This is the simplest tokenization approach with minimal configuration needs.

Fluent interface allows chaining:

auto config = CharTokenizerConfig()
    .withSpecialTokens(my_tokens)
    .withCaseSensitive(false)
    .withNormalizeUnicode(true);

Constructor & Destructor Documentation

◆ CharVocabularyConfig()

Mila::Data::CharVocabularyConfig::CharVocabularyConfig ( )

default

Here is the caller graph for this function:

Member Function Documentation

◆ fromMetadata()

void Mila::Data::CharVocabularyConfig::fromMetadata ( const SerializationMetadata & meta )

inline

Populate configuration from metadata.

Missing keys are ignored leaving defaults intact.

Parameters

meta	Metadata to read configuration from.

Here is the call graph for this function:

Here is the caller graph for this function:

◆ getSpecialTokens()

const SpecialTokens & Mila::Data::CharVocabularyConfig::getSpecialTokens ( ) const

inline

◆ isByteLevel()

bool Mila::Data::CharVocabularyConfig::isByteLevel ( ) const

inline

◆ isCaseSensitive()

bool Mila::Data::CharVocabularyConfig::isCaseSensitive ( ) const

inline

◆ shouldNormalizeUnicode()

bool Mila::Data::CharVocabularyConfig::shouldNormalizeUnicode ( ) const

inline

◆ toMetadata()

SerializationMetadata Mila::Data::CharVocabularyConfig::toMetadata ( ) const

inline

Convert configuration to metadata for serialization.

Returns: SerializationMetadata containing all configuration parameters.

Here is the call graph for this function:

◆ toString()

std::string Mila::Data::CharVocabularyConfig::toString ( ) const

inline

Produce human-readable summary of configuration.

Suitable for logging and debugging.

Returns: std::string Configuration summary.

◆ validate()

void Mila::Data::CharVocabularyConfig::validate ( ) const

inline

Validate configuration parameters.

Exceptions

std::invalid_argument if configuration is invalid.

Here is the caller graph for this function:

◆ withByteLevel()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withByteLevel ( bool byte_level )

inline

Set whether to use byte-level encoding.

When true, operates on raw UTF-8 bytes instead of Unicode characters. This guarantees any text can be represented but increases sequence length.

Parameters

byte_level True for byte-level, false for character-level (default).

Returns: Reference to this config for method chaining.

Here is the call graph for this function:

◆ withCaseSensitive()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withCaseSensitive ( bool sensitive )

inline

Set whether tokenization is case-sensitive.

When false, text is converted to lowercase before tokenization. This reduces vocabulary size but loses case information.

Parameters

sensitive True for case-sensitive (default), false for case-insensitive.

Returns: Reference to this config for method chaining.

Here is the call graph for this function:

◆ withNormalizeUnicode()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withNormalizeUnicode ( bool normalize )

inline

Set whether to normalize Unicode characters.

When true, applies Unicode normalization (e.g., NFC) to ensure consistent representation of characters with multiple encodings.

Parameters

normalize True to normalize Unicode, false otherwise (default).

Returns: Reference to this config for method chaining.

Here is the call graph for this function:

◆ withSpecialTokens()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withSpecialTokens ( const SpecialTokens & tokens )

inline

Configure special tokens.

Parameters

tokens SpecialTokens configuration.

Returns: Reference to this config for method chaining.

Here is the call graph for this function:

Member Data Documentation

◆ byte_level_

bool Mila::Data::CharVocabularyConfig::byte_level_ = false

private

◆ case_sensitive_

bool Mila::Data::CharVocabularyConfig::case_sensitive_ = true

private

◆ normalize_unicode_

bool Mila::Data::CharVocabularyConfig::normalize_unicode_ = false

private

◆ special_tokens_

SpecialTokens Mila::Data::CharVocabularyConfig::special_tokens_ {}

private

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Data/Tokenizers/Char/CharVocabularyConfig.ixx

Public Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

◆ CharVocabularyConfig()

Member Function Documentation

◆ fromMetadata()

◆ getSpecialTokens()

◆ isByteLevel()

◆ isCaseSensitive()

◆ shouldNormalizeUnicode()

◆ toMetadata()

◆ toString()

◆ validate()

◆ withByteLevel()

◆ withCaseSensitive()

◆ withNormalizeUnicode()

◆ withSpecialTokens()

Member Data Documentation

◆ byte_level_

◆ case_sensitive_

◆ normalize_unicode_

◆ special_tokens_