Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::TokenizerTrainer Class Referenceabstractexport

Abstract interface for training tokenizer vocabularies from text corpora. More...

Public Member Functions

virtual ~TokenizerTrainer ()=default
 Virtual destructor.
void addCorpus (std::string_view text)
 Add training corpus data from a string.
void addCorpusFromFile (std::string_view filepath)
 Add training corpus data from a file.
virtual void addCorpusFromStream (std::istream &stream)=0
 Add training corpus data from an input stream.
virtual std::shared_ptr< TokenizerVocabularytrain ()=0
 Train the tokenizer and return the resulting vocabulary.

Detailed Description

Abstract interface for training tokenizer vocabularies from text corpora.

TokenizerTrainer provides a generic contract for building tokenizer vocabularies by processing text data from various sources (streams, files, or strings) and producing a serialized vocabulary file suitable for later use in tokenization.

Typical workflow:

  1. Create a concrete trainer instance (e.g., BpeTokenizerTrainer, CharTokenizerTrainer)
  2. Add corpus data via addCorpusFromStream(), addCorpusFromFile(), or addCorpus()
  3. Call train() to build and save the vocabulary to disk
  4. Load the saved vocabulary using the appropriate TokenizerVocabulary implementation

Thread safety:

  • Implementations are NOT required to be thread-safe
  • Callers must ensure external synchronization if corpus addition or training occurs concurrently

Design rationale:

  • Training is typically an offline, one-time process; vocabularies are saved to disk and loaded for inference rather than kept in memory
  • Stream-based corpus addition enables memory-efficient processing of large datasets
  • Abstract interface allows for different tokenization algorithms (BPE, character-level, WordPiece, etc.) with a consistent API
See also
TokenizerVocabulary

Constructor & Destructor Documentation

◆ ~TokenizerTrainer()

virtual Mila::Data::TokenizerTrainer::~TokenizerTrainer ( )
virtualdefault

Virtual destructor.

Ensures derived trainer implementations are properly destroyed when deleted via base pointers.

Member Function Documentation

◆ addCorpus()

void Mila::Data::TokenizerTrainer::addCorpus ( std::string_view text)
inline

Add training corpus data from a string.

Convenience method for adding small text samples, useful for testing or supplementing file-based corpora. For large corpora, prefer addCorpusFromStream() or addCorpusFromFile() for better memory efficiency.

The input text is expected to be UTF-8 encoded.

Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.

Behavior on error:

Parameters
textUTF-8 encoded text to add to the training corpus.
Exceptions
Exceptionsas documented by addCorpusFromStream()
Note
This method creates a temporary std::istringstream; for large strings (>1MB), consider using addCorpusFromStream() with a pre-constructed stream for better performance.
Here is the call graph for this function:

◆ addCorpusFromFile()

void Mila::Data::TokenizerTrainer::addCorpusFromFile ( std::string_view filepath)
inline

Add training corpus data from a file.

Convenience method that opens the specified file and delegates to addCorpusFromStream(). The file is opened in binary mode to preserve exact byte sequences for UTF-8 text.

Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.

Behavior on error:

  • Throws std::runtime_error if the file cannot be opened
  • May throw other exceptions from addCorpusFromStream() during processing
Parameters
filepathPath to a UTF-8 encoded text file.
Exceptions
std::runtime_errorif the file cannot be opened for reading
Otherexceptions as documented by addCorpusFromStream()
Here is the call graph for this function:

◆ addCorpusFromStream()

virtual void Mila::Data::TokenizerTrainer::addCorpusFromStream ( std::istream & stream)
pure virtual

Add training corpus data from an input stream.

This is the primary corpus addition method that all implementations must provide. The stream is read incrementally to support memory-efficient processing of large corpora. Text is expected to be UTF-8 encoded unless documented otherwise by the concrete implementation.

Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.

Implementation requirements:

  • Must process the stream incrementally without requiring the entire corpus in memory
  • Should handle stream read failures gracefully (may throw exceptions)
  • Must support binary and text mode streams
  • Should document any encoding assumptions beyond UTF-8

Behavior on error:

  • Implementations may throw exceptions (e.g., std::runtime_error, std::ios_base::failure) on stream read failures or encoding errors
  • After an exception, the trainer state is implementation-defined; callers should typically discard the trainer instance
Parameters
streamInput stream containing UTF-8 encoded text corpus. The stream is not closed by this method.
Exceptions
std::runtime_erroror derived exceptions on stream read errors (implementation-defined)
Note
The stream position after this call is implementation-defined but typically will be at end-of-stream if the entire stream was consumed.
Here is the caller graph for this function:

◆ train()

virtual std::shared_ptr< TokenizerVocabulary > Mila::Data::TokenizerTrainer::train ( )
pure virtual

Train the tokenizer and return the resulting vocabulary.

Analyzes all previously added corpus data to build a vocabulary according to the concrete implementation's algorithm (e.g., BPE merges, character collection). The returned vocabulary is ready for immediate use in tokenization.

The vocabulary can be saved to disk using its save() method for later reuse, and can be shared among multiple Tokenizer instances.

Returns
std::shared_ptr<TokenizerVocabulary> The trained vocabulary. Ownership is shared; the vocabulary can be used by multiple tokenizers or saved for later use.
Exceptions
std::runtime_errorif training fails
std::invalid_argumentif insufficient corpus data was provided

Example usage:

BpeTokenizerTrainer trainer(vocab_size);
trainer.addCorpusFromFile("corpus.txt");
auto vocab = trainer.train();
// Use immediately
BpeTokenizer tokenizer(vocab);
auto tokens = tokenizer.encode("hello world");
// Save for later
vocab->save("vocab.bin");
Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families.
Definition BpeTokenizer.ixx:68

The documentation for this class was generated from the following file: