Character-level tokenizer trainer.
More...
Character-level tokenizer trainer.
Manages corpus accumulation and delegates vocabulary building to CharVocabulary factory methods. Provides a convenient API for incremental corpus loading and batch training.
Note: Character tokenization is simple enough that direct use of CharVocabulary::train() or CharVocabulary::trainFromFile() is often preferred. This trainer is maintained for API consistency with BpeTrainer and potential future extensions.
◆ CharTrainer()
Construct with configuration.
- Parameters
-
| config | Character vocabulary configuration. |
- Exceptions
-
| std::invalid_argument | if config is invalid. |
◆ addCorpusFromFile()
| void Mila::Data::CharTrainer::addCorpusFromFile |
( |
const fs::path & | path | ) |
|
|
inline |
Add corpus text from file.
Convenience method for loading corpus from filesystem.
- Parameters
-
| path | Path to corpus text file. |
- Exceptions
-
| std::runtime_error | if file cannot be opened. |
◆ addCorpusFromStream()
| void Mila::Data::CharTrainer::addCorpusFromStream |
( |
std::istream & | stream | ) |
|
|
inline |
Add corpus text from input stream.
Accumulates text for training. Can be called multiple times to add corpus from different sources.
- Parameters
-
| stream | Input stream containing corpus text. |
◆ clearCorpus()
| void Mila::Data::CharTrainer::clearCorpus |
( |
| ) |
|
|
inline |
Clear accumulated corpus.
Frees memory used by corpus accumulation.
◆ getConfig()
◆ getCorpusSize()
| size_t Mila::Data::CharTrainer::getCorpusSize |
( |
| ) |
const |
|
inline |
Get accumulated corpus size in bytes.
- Returns
- Size of accumulated corpus.
◆ train()
Build vocabulary on accumulated corpus.
Delegates to CharVocabulary::train() factory method. Clears accumulated corpus after training to free memory.
- Returns
- Built CharVocabulary instance.
- Exceptions
-
| std::runtime_error | if corpus is empty. |
| std::invalid_argument | if config is invalid. |
◆ config_
◆ corpus_
| std::string Mila::Data::CharTrainer::corpus_ |
|
private |
The documentation for this class was generated from the following file: