|
Mila 0.13.48
Deep Neural Network Library
|
Character vocabulary for tokenization. More...


Public Member Functions | |
| CharVocabulary ()=delete | |
| TokenId | charToIndex (char c) const |
| const CharVocabularyConfig & | getConfig () const |
| Get the configuration used to create this vocabulary. | |
| size_t | getSize () const override |
| Get the number of tokens in the vocabulary. | |
| bool | hasSpecialTokens () const |
| std::optional< std::string > | idToToken (TokenId id) const override |
| Map a numeric id back to its token string. | |
| char | indexToChar (TokenId idx) const |
| TokenId | padTokenId () const |
| void | save (const fs::path &path) const override |
| Serialize vocabulary to disk with configuration. | |
| std::optional< TokenId > | tokenToId (const std::string &token) const override |
| Map a token string to its numeric id. | |
| TokenId | unkTokenId () const |
| Public Member Functions inherited from Mila::Data::TokenizerVocabulary | |
| virtual | ~TokenizerVocabulary ()=default |
| Virtual destructor. | |
| virtual void | save (const std::filesystem::path &path) const =0 |
| Serialize the vocabulary to disk at the given path. | |
Static Public Member Functions | |
| static CharVocabulary | load (const fs::path &path) |
| Load vocabulary from Mila binary format. | |
| static CharVocabulary | train (const std::string &corpus, const CharVocabularyConfig &config) |
| Build a character vocabulary from text corpus. | |
| static CharVocabulary | trainFromFile (const fs::path &corpus_path, const CharVocabularyConfig &config) |
| Build a character vocabulary from corpus file. | |
Private Member Functions | |
| CharVocabulary (const CharVocabularyConfig &config) | |
| void | addRegularTokens (const std::vector< unsigned char > &sorted_bytes) |
| void | addSpecialTokensFromConfig () |
| void | buildFromText (const std::string &corpus) |
| std::unordered_map< unsigned char, bool > | extractUniqueBytes (const std::string &text) const |
| void | loadContent (std::istream &file) |
| std::string | normalizeText (const std::string &text) const |
| void | saveContent (std::ostream &file) const |
| std::vector< unsigned char > | sortBytes (const std::unordered_map< unsigned char, bool > &unique_bytes) const |
Private Attributes | |
| std::unordered_map< char, TokenId > | char_to_idx_ |
| CharVocabularyConfig | config_ |
| std::vector< char > | idx_to_char_ |
| TokenId | pad_token_id_ |
| TokenId | unk_token_id_ |
Character vocabulary for tokenization.
Immutable vocabulary created via static factory methods. Stores configuration for full provenance tracking and serialization.
Thread safety: Immutable after construction, safe for concurrent reads.
|
delete |

|
inlineexplicitprivate |
|
inlineprivate |

|
inlineprivate |

|
inlineprivate |


|
inline |
|
inlineprivate |

|
inline |
Get the configuration used to create this vocabulary.
|
inlineoverridevirtual |
Get the number of tokens in the vocabulary.
Implements Mila::Data::TokenizerVocabulary.
|
inline |
|
inlineoverridevirtual |
Map a numeric id back to its token string.
Returns an empty optional if the id is out of range or not defined.
| id | Token id to convert. |
Implements Mila::Data::TokenizerVocabulary.
|
inline |
|
inlinestatic |
Load vocabulary from Mila binary format.
Reads vocabulary and configuration from file written by save().
| path | Input file path. |
| std::runtime_error | on I/O errors or format incompatibility. |


|
inlineprivate |

|
inlineprivate |

|
inline |
|
inlineoverride |
Serialize vocabulary to disk with configuration.
File format includes MilaFileHeader with config metadata followed by binary vocabulary content.
| path | Output file path. Parent directory must exist. |
| std::runtime_error | on I/O errors. |


|
inlineprivate |

|
inlineprivate |

|
inlineoverridevirtual |
Map a token string to its numeric id.
The lookup returns an empty optional when the token is not present in the vocabulary. Implementations may provide an explicit unknown-token id; callers should interpret an empty optional as "no mapping".
| token | UTF-8 encoded token string to look up. |
Implements Mila::Data::TokenizerVocabulary.
|
inlinestatic |
Build a character vocabulary from text corpus.
| corpus | Training text corpus. |
| config | Vocabulary configuration. |
| std::invalid_argument | if config is invalid. |


|
inlinestatic |
Build a character vocabulary from corpus file.
| corpus_path | Path to training corpus text file. |
| config | Vocabulary configuration. |
| std::runtime_error | if file cannot be opened. |
| std::invalid_argument | if config is invalid. |

|
inline |
|
private |
|
private |
|
private |
|
private |
|
private |