|
Mila 0.13.48
Deep Neural Network Library
|
Configuration for Character-level tokenizer training. More...

Public Member Functions | |
| CharVocabularyConfig ()=default | |
| void | fromMetadata (const SerializationMetadata &meta) |
| Populate configuration from metadata. | |
| const SpecialTokens & | getSpecialTokens () const |
| bool | isByteLevel () const |
| bool | isCaseSensitive () const |
| bool | shouldNormalizeUnicode () const |
| SerializationMetadata | toMetadata () const |
| Convert configuration to metadata for serialization. | |
| std::string | toString () const |
| Produce human-readable summary of configuration. | |
| void | validate () const |
| Validate configuration parameters. | |
| CharVocabularyConfig & | withByteLevel (bool byte_level) |
| Set whether to use byte-level encoding. | |
| CharVocabularyConfig & | withCaseSensitive (bool sensitive) |
| Set whether tokenization is case-sensitive. | |
| CharVocabularyConfig & | withNormalizeUnicode (bool normalize) |
| Set whether to normalize Unicode characters. | |
| CharVocabularyConfig & | withSpecialTokens (const SpecialTokens &tokens) |
| Configure special tokens. | |
Private Attributes | |
| bool | byte_level_ = false |
| bool | case_sensitive_ = true |
| bool | normalize_unicode_ = false |
| SpecialTokens | special_tokens_ {} |
Configuration for Character-level tokenizer training.
Character tokenizers split text into individual characters (or bytes). This is the simplest tokenization approach with minimal configuration needs.
Fluent interface allows chaining:
|
default |

|
inline |
Populate configuration from metadata.
Missing keys are ignored leaving defaults intact.
| meta | Metadata to read configuration from. |


|
inline |
|
inline |
|
inline |
|
inline |
|
inline |
Convert configuration to metadata for serialization.

|
inline |
Produce human-readable summary of configuration.
Suitable for logging and debugging.
|
inline |
Validate configuration parameters.
| std::invalid_argument | if configuration is invalid. |

|
inline |
Set whether to use byte-level encoding.
When true, operates on raw UTF-8 bytes instead of Unicode characters. This guarantees any text can be represented but increases sequence length.
| byte_level | True for byte-level, false for character-level (default). |

|
inline |
Set whether tokenization is case-sensitive.
When false, text is converted to lowercase before tokenization. This reduces vocabulary size but loses case information.
| sensitive | True for case-sensitive (default), false for case-insensitive. |

|
inline |
Set whether to normalize Unicode characters.
When true, applies Unicode normalization (e.g., NFC) to ensure consistent representation of characters with multiple encodings.
| normalize | True to normalize Unicode, false otherwise (default). |

|
inline |
Configure special tokens.
| tokens | SpecialTokens configuration. |

|
private |
|
private |
|
private |
|
private |