Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::CharVocabularyConfig Class Referenceexport

Configuration for Character-level tokenizer training. More...

Collaboration diagram for Mila::Data::CharVocabularyConfig:

Public Member Functions

 CharVocabularyConfig ()=default
void fromMetadata (const SerializationMetadata &meta)
 Populate configuration from metadata.
const SpecialTokensgetSpecialTokens () const
bool isByteLevel () const
bool isCaseSensitive () const
bool shouldNormalizeUnicode () const
SerializationMetadata toMetadata () const
 Convert configuration to metadata for serialization.
std::string toString () const
 Produce human-readable summary of configuration.
void validate () const
 Validate configuration parameters.
CharVocabularyConfigwithByteLevel (bool byte_level)
 Set whether to use byte-level encoding.
CharVocabularyConfigwithCaseSensitive (bool sensitive)
 Set whether tokenization is case-sensitive.
CharVocabularyConfigwithNormalizeUnicode (bool normalize)
 Set whether to normalize Unicode characters.
CharVocabularyConfigwithSpecialTokens (const SpecialTokens &tokens)
 Configure special tokens.

Private Attributes

bool byte_level_ = false
bool case_sensitive_ = true
bool normalize_unicode_ = false
SpecialTokens special_tokens_ {}

Detailed Description

Configuration for Character-level tokenizer training.

Character tokenizers split text into individual characters (or bytes). This is the simplest tokenization approach with minimal configuration needs.

Fluent interface allows chaining:

auto config = CharTokenizerConfig()
.withSpecialTokens(my_tokens)
.withCaseSensitive(false)
.withNormalizeUnicode(true);

Constructor & Destructor Documentation

◆ CharVocabularyConfig()

Mila::Data::CharVocabularyConfig::CharVocabularyConfig ( )
default
Here is the caller graph for this function:

Member Function Documentation

◆ fromMetadata()

void Mila::Data::CharVocabularyConfig::fromMetadata ( const SerializationMetadata & meta)
inline

Populate configuration from metadata.

Missing keys are ignored leaving defaults intact.

Parameters
metaMetadata to read configuration from.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ getSpecialTokens()

const SpecialTokens & Mila::Data::CharVocabularyConfig::getSpecialTokens ( ) const
inline

◆ isByteLevel()

bool Mila::Data::CharVocabularyConfig::isByteLevel ( ) const
inline

◆ isCaseSensitive()

bool Mila::Data::CharVocabularyConfig::isCaseSensitive ( ) const
inline

◆ shouldNormalizeUnicode()

bool Mila::Data::CharVocabularyConfig::shouldNormalizeUnicode ( ) const
inline

◆ toMetadata()

SerializationMetadata Mila::Data::CharVocabularyConfig::toMetadata ( ) const
inline

Convert configuration to metadata for serialization.

Returns
SerializationMetadata containing all configuration parameters.
Here is the call graph for this function:

◆ toString()

std::string Mila::Data::CharVocabularyConfig::toString ( ) const
inline

Produce human-readable summary of configuration.

Suitable for logging and debugging.

Returns
std::string Configuration summary.

◆ validate()

void Mila::Data::CharVocabularyConfig::validate ( ) const
inline

Validate configuration parameters.

Exceptions
std::invalid_argumentif configuration is invalid.
Here is the caller graph for this function:

◆ withByteLevel()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withByteLevel ( bool byte_level)
inline

Set whether to use byte-level encoding.

When true, operates on raw UTF-8 bytes instead of Unicode characters. This guarantees any text can be represented but increases sequence length.

Parameters
byte_levelTrue for byte-level, false for character-level (default).
Returns
Reference to this config for method chaining.
Here is the call graph for this function:

◆ withCaseSensitive()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withCaseSensitive ( bool sensitive)
inline

Set whether tokenization is case-sensitive.

When false, text is converted to lowercase before tokenization. This reduces vocabulary size but loses case information.

Parameters
sensitiveTrue for case-sensitive (default), false for case-insensitive.
Returns
Reference to this config for method chaining.
Here is the call graph for this function:

◆ withNormalizeUnicode()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withNormalizeUnicode ( bool normalize)
inline

Set whether to normalize Unicode characters.

When true, applies Unicode normalization (e.g., NFC) to ensure consistent representation of characters with multiple encodings.

Parameters
normalizeTrue to normalize Unicode, false otherwise (default).
Returns
Reference to this config for method chaining.
Here is the call graph for this function:

◆ withSpecialTokens()

CharVocabularyConfig & Mila::Data::CharVocabularyConfig::withSpecialTokens ( const SpecialTokens & tokens)
inline

Configure special tokens.

Parameters
tokensSpecialTokens configuration.
Returns
Reference to this config for method chaining.
Here is the call graph for this function:

Member Data Documentation

◆ byte_level_

bool Mila::Data::CharVocabularyConfig::byte_level_ = false
private

◆ case_sensitive_

bool Mila::Data::CharVocabularyConfig::case_sensitive_ = true
private

◆ normalize_unicode_

bool Mila::Data::CharVocabularyConfig::normalize_unicode_ = false
private

◆ special_tokens_

SpecialTokens Mila::Data::CharVocabularyConfig::special_tokens_ {}
private

The documentation for this class was generated from the following file: