|
Mila 0.13.48
Deep Neural Network Library
|
Factory for creating tokenizer trainers and loading vocabularies. More...
Static Public Member Functions | |
| static std::unique_ptr< Tokenizer > | loadTokenizer (TokenizerType type, const std::filesystem::path &vocabPath) |
| Load a tokenizer from a saved vocabulary file. | |
| template<typename Config> | |
| static void | trainVocabulary (TokenizerType type, const Config &config, std::span< const std::filesystem::path > corpusFiles, const std::filesystem::path &outputPath) |
| Train a vocabulary from corpus files. | |
Factory for creating tokenizer trainers and loading vocabularies.
This class centralizes construction and loading logic so CLI tools and applications can remain tokenizer-agnostic. The factory creates trainers and loads vocabularies based on TokenizerType discriminator.
|
inlinestatic |
Load a tokenizer from a saved vocabulary file.
This loads the vocabulary and creates a tokenizer in one operation. The returned tokenizer owns its vocabulary.
| type | TokenizerType discriminator (Char, Bpe, etc.). |
| vocabPath | Path to the saved vocabulary file. |
Example:

|
inlinestatic |
Train a vocabulary from corpus files.
Creates a trainer with the provided config, adds corpus files, trains the vocabulary, and saves it to disk.
| type | TokenizerType discriminator (Char, Bpe, etc.). |
| corpusFiles | List of corpus file paths to train on. |
| outputPath | Where to save the trained vocabulary. |
Example:
