Abstract interface for training tokenizer vocabularies from text corpora. More...

Public Member Functions
virtual	~TokenizerTrainer ()=default
	Virtual destructor.
void	addCorpus (std::string_view text)
	Add training corpus data from a string.
void	addCorpusFromFile (std::string_view filepath)
	Add training corpus data from a file.
virtual void	addCorpusFromStream (std::istream &stream)=0
	Add training corpus data from an input stream.
virtual std::shared_ptr< TokenizerVocabulary >	train ()=0
	Train the tokenizer and return the resulting vocabulary.

Detailed Description

Abstract interface for training tokenizer vocabularies from text corpora.

TokenizerTrainer provides a generic contract for building tokenizer vocabularies by processing text data from various sources (streams, files, or strings) and producing a serialized vocabulary file suitable for later use in tokenization.

Typical workflow:

Create a concrete trainer instance (e.g., BpeTokenizerTrainer, CharTokenizerTrainer)
Add corpus data via addCorpusFromStream(), addCorpusFromFile(), or addCorpus()
Call train() to build and save the vocabulary to disk
Load the saved vocabulary using the appropriate TokenizerVocabulary implementation

Thread safety:

Implementations are NOT required to be thread-safe
Callers must ensure external synchronization if corpus addition or training occurs concurrently

Design rationale:

Training is typically an offline, one-time process; vocabularies are saved to disk and loaded for inference rather than kept in memory
Stream-based corpus addition enables memory-efficient processing of large datasets
Abstract interface allows for different tokenization algorithms (BPE, character-level, WordPiece, etc.) with a consistent API

See also: TokenizerVocabulary

Constructor & Destructor Documentation

◆ ~TokenizerTrainer()

virtual Mila::Data::TokenizerTrainer::~TokenizerTrainer ( )

virtualdefault

Virtual destructor.

Ensures derived trainer implementations are properly destroyed when deleted via base pointers.

Member Function Documentation

◆ addCorpus()

void Mila::Data::TokenizerTrainer::addCorpus ( std::string_view text )

inline

Add training corpus data from a string.

Convenience method for adding small text samples, useful for testing or supplementing file-based corpora. For large corpora, prefer addCorpusFromStream() or addCorpusFromFile() for better memory efficiency.

The input text is expected to be UTF-8 encoded.

Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.

Behavior on error:

May throw exceptions from addCorpusFromStream() during processing

Parameters

text	UTF-8 encoded text to add to the training corpus.

Exceptions

Exceptions as documented by addCorpusFromStream()

Note: This method creates a temporary std::istringstream; for large strings (>1MB), consider using addCorpusFromStream() with a pre-constructed stream for better performance.

Here is the call graph for this function:

◆ addCorpusFromFile()

void Mila::Data::TokenizerTrainer::addCorpusFromFile ( std::string_view filepath )

inline

Add training corpus data from a file.

Convenience method that opens the specified file and delegates to addCorpusFromStream(). The file is opened in binary mode to preserve exact byte sequences for UTF-8 text.

Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.

Behavior on error:

Throws std::runtime_error if the file cannot be opened
May throw other exceptions from addCorpusFromStream() during processing

Parameters

filepath Path to a UTF-8 encoded text file.

Exceptions

std::runtime_error	if the file cannot be opened for reading
Other	exceptions as documented by addCorpusFromStream()

Here is the call graph for this function:

◆ addCorpusFromStream()

virtual void Mila::Data::TokenizerTrainer::addCorpusFromStream ( std::istream & stream )

pure virtual

Add training corpus data from an input stream.

This is the primary corpus addition method that all implementations must provide. The stream is read incrementally to support memory-efficient processing of large corpora. Text is expected to be UTF-8 encoded unless documented otherwise by the concrete implementation.

Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.

Implementation requirements:

Must process the stream incrementally without requiring the entire corpus in memory
Should handle stream read failures gracefully (may throw exceptions)
Must support binary and text mode streams
Should document any encoding assumptions beyond UTF-8

Behavior on error:

Implementations may throw exceptions (e.g., std::runtime_error, std::ios_base::failure) on stream read failures or encoding errors
After an exception, the trainer state is implementation-defined; callers should typically discard the trainer instance

Parameters

stream Input stream containing UTF-8 encoded text corpus. The stream is not closed by this method.

Exceptions

std::runtime_error or derived exceptions on stream read errors (implementation-defined)

Note: The stream position after this call is implementation-defined but typically will be at end-of-stream if the entire stream was consumed.

Here is the caller graph for this function:

◆ train()

virtual std::shared_ptr< TokenizerVocabulary > Mila::Data::TokenizerTrainer::train ( )

pure virtual

Train the tokenizer and return the resulting vocabulary.

Analyzes all previously added corpus data to build a vocabulary according to the concrete implementation's algorithm (e.g., BPE merges, character collection). The returned vocabulary is ready for immediate use in tokenization.

The vocabulary can be saved to disk using its save() method for later reuse, and can be shared among multiple Tokenizer instances.

Returns: std::shared_ptr<TokenizerVocabulary> The trained vocabulary. Ownership is shared; the vocabulary can be used by multiple tokenizers or saved for later use.

Exceptions

std::runtime_error	if training fails
std::invalid_argument	if insufficient corpus data was provided

Example usage:

BpeTokenizerTrainer trainer(vocab_size);
trainer.addCorpusFromFile("corpus.txt");
auto vocab = trainer.train();
 
// Use immediately
BpeTokenizer tokenizer(vocab);
auto tokens = tokenizer.encode("hello world");
 
// Save for later
vocab->save("vocab.bin");

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Data/Core/TokenizerTrainer.ixx

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ ~TokenizerTrainer()

Member Function Documentation

◆ addCorpus()

◆ addCorpusFromFile()

◆ addCorpusFromStream()

◆ train()