Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::TrainerFactory Class Referenceexport

Factory for creating tokenizer trainers and loading vocabularies. More...

Static Public Member Functions

static std::unique_ptr< TokenizerloadTokenizer (TokenizerType type, const std::filesystem::path &vocabPath)
 Load a tokenizer from a saved vocabulary file.
template<typename Config>
static void trainVocabulary (TokenizerType type, const Config &config, std::span< const std::filesystem::path > corpusFiles, const std::filesystem::path &outputPath)
 Train a vocabulary from corpus files.

Detailed Description

Factory for creating tokenizer trainers and loading vocabularies.

This class centralizes construction and loading logic so CLI tools and applications can remain tokenizer-agnostic. The factory creates trainers and loads vocabularies based on TokenizerType discriminator.

Member Function Documentation

◆ loadTokenizer()

std::unique_ptr< Tokenizer > Mila::Data::TrainerFactory::loadTokenizer ( TokenizerType type,
const std::filesystem::path & vocabPath )
inlinestatic

Load a tokenizer from a saved vocabulary file.

This loads the vocabulary and creates a tokenizer in one operation. The returned tokenizer owns its vocabulary.

Parameters
typeTokenizerType discriminator (Char, Bpe, etc.).
vocabPathPath to the saved vocabulary file.
Returns
std::unique_ptr<Tokenizer> Ready-to-use tokenizer.

Example:

"vocab.bin"
);
auto tokens = tokenizer->encode("Hello, world!");
static std::unique_ptr< Tokenizer > loadTokenizer(TokenizerType type, const std::filesystem::path &vocabPath)
Load a tokenizer from a saved vocabulary file.
Definition TrainerFactory.ixx:136
@ Bpe
Definition TokenizerType.ixx:14
Here is the call graph for this function:

◆ trainVocabulary()

template<typename Config>
void Mila::Data::TrainerFactory::trainVocabulary ( TokenizerType type,
const Config & config,
std::span< const std::filesystem::path > corpusFiles,
const std::filesystem::path & outputPath )
inlinestatic

Train a vocabulary from corpus files.

Creates a trainer with the provided config, adds corpus files, trains the vocabulary, and saves it to disk.

Parameters
typeTokenizerType discriminator (Char, Bpe, etc.).
corpusFilesList of corpus file paths to train on.
outputPathWhere to save the trained vocabulary.

Example:

{"corpus1.txt", "corpus2.txt"},
"vocab.bin"
);
static void trainVocabulary(TokenizerType type, const Config &config, std::span< const std::filesystem::path > corpusFiles, const std::filesystem::path &outputPath)
Train a vocabulary from corpus files.
Definition TrainerFactory.ixx:70
Here is the call graph for this function:

The documentation for this class was generated from the following file: