|
Mila 0.13.48
Deep Neural Network Library
|
Abstract interface for training tokenizer vocabularies from text corpora. More...
Public Member Functions | |
| virtual | ~TokenizerTrainer ()=default |
| Virtual destructor. | |
| void | addCorpus (std::string_view text) |
| Add training corpus data from a string. | |
| void | addCorpusFromFile (std::string_view filepath) |
| Add training corpus data from a file. | |
| virtual void | addCorpusFromStream (std::istream &stream)=0 |
| Add training corpus data from an input stream. | |
| virtual std::shared_ptr< TokenizerVocabulary > | train ()=0 |
| Train the tokenizer and return the resulting vocabulary. | |
Abstract interface for training tokenizer vocabularies from text corpora.
TokenizerTrainer provides a generic contract for building tokenizer vocabularies by processing text data from various sources (streams, files, or strings) and producing a serialized vocabulary file suitable for later use in tokenization.
Typical workflow:
Thread safety:
Design rationale:
|
virtualdefault |
Virtual destructor.
Ensures derived trainer implementations are properly destroyed when deleted via base pointers.
|
inline |
Add training corpus data from a string.
Convenience method for adding small text samples, useful for testing or supplementing file-based corpora. For large corpora, prefer addCorpusFromStream() or addCorpusFromFile() for better memory efficiency.
The input text is expected to be UTF-8 encoded.
Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.
Behavior on error:
| text | UTF-8 encoded text to add to the training corpus. |
| Exceptions | as documented by addCorpusFromStream() |

|
inline |
Add training corpus data from a file.
Convenience method that opens the specified file and delegates to addCorpusFromStream(). The file is opened in binary mode to preserve exact byte sequences for UTF-8 text.
Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.
Behavior on error:
| filepath | Path to a UTF-8 encoded text file. |
| std::runtime_error | if the file cannot be opened for reading |
| Other | exceptions as documented by addCorpusFromStream() |

|
pure virtual |
Add training corpus data from an input stream.
This is the primary corpus addition method that all implementations must provide. The stream is read incrementally to support memory-efficient processing of large corpora. Text is expected to be UTF-8 encoded unless documented otherwise by the concrete implementation.
Multiple calls to corpus addition methods are cumulative; all provided text contributes to the final trained vocabulary.
Implementation requirements:
Behavior on error:
| stream | Input stream containing UTF-8 encoded text corpus. The stream is not closed by this method. |
| std::runtime_error | or derived exceptions on stream read errors (implementation-defined) |

|
pure virtual |
Train the tokenizer and return the resulting vocabulary.
Analyzes all previously added corpus data to build a vocabulary according to the concrete implementation's algorithm (e.g., BPE merges, character collection). The returned vocabulary is ready for immediate use in tokenization.
The vocabulary can be saved to disk using its save() method for later reuse, and can be shared among multiple Tokenizer instances.
| std::runtime_error | if training fails |
| std::invalid_argument | if insufficient corpus data was provided |
Example usage: