Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::CharTrainer Class Referenceexport

Character-level tokenizer trainer. More...

Collaboration diagram for Mila::Data::CharTrainer:

Public Member Functions

 CharTrainer (const CharVocabularyConfig &config=CharVocabularyConfig{})
 Construct with configuration.
void addCorpusFromFile (const fs::path &path)
 Add corpus text from file.
void addCorpusFromStream (std::istream &stream)
 Add corpus text from input stream.
void clearCorpus ()
 Clear accumulated corpus.
const CharVocabularyConfiggetConfig () const
 Get the trainer configuration.
size_t getCorpusSize () const
 Get accumulated corpus size in bytes.
CharVocabulary train ()
 Build vocabulary on accumulated corpus.

Private Attributes

CharVocabularyConfig config_
std::string corpus_

Detailed Description

Character-level tokenizer trainer.

Manages corpus accumulation and delegates vocabulary building to CharVocabulary factory methods. Provides a convenient API for incremental corpus loading and batch training.

Note: Character tokenization is simple enough that direct use of CharVocabulary::train() or CharVocabulary::trainFromFile() is often preferred. This trainer is maintained for API consistency with BpeTrainer and potential future extensions.

Constructor & Destructor Documentation

◆ CharTrainer()

Mila::Data::CharTrainer::CharTrainer ( const CharVocabularyConfig & config = CharVocabularyConfig{})
inlineexplicit

Construct with configuration.

Parameters
configCharacter vocabulary configuration.
Exceptions
std::invalid_argumentif config is invalid.

Member Function Documentation

◆ addCorpusFromFile()

void Mila::Data::CharTrainer::addCorpusFromFile ( const fs::path & path)
inline

Add corpus text from file.

Convenience method for loading corpus from filesystem.

Parameters
pathPath to corpus text file.
Exceptions
std::runtime_errorif file cannot be opened.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ addCorpusFromStream()

void Mila::Data::CharTrainer::addCorpusFromStream ( std::istream & stream)
inline

Add corpus text from input stream.

Accumulates text for training. Can be called multiple times to add corpus from different sources.

Parameters
streamInput stream containing corpus text.
Here is the caller graph for this function:

◆ clearCorpus()

void Mila::Data::CharTrainer::clearCorpus ( )
inline

Clear accumulated corpus.

Frees memory used by corpus accumulation.

◆ getConfig()

const CharVocabularyConfig & Mila::Data::CharTrainer::getConfig ( ) const
inline

Get the trainer configuration.

Returns
const CharVocabularyConfig& Configuration reference.

◆ getCorpusSize()

size_t Mila::Data::CharTrainer::getCorpusSize ( ) const
inline

Get accumulated corpus size in bytes.

Returns
Size of accumulated corpus.

◆ train()

CharVocabulary Mila::Data::CharTrainer::train ( )
inline

Build vocabulary on accumulated corpus.

Delegates to CharVocabulary::train() factory method. Clears accumulated corpus after training to free memory.

Returns
Built CharVocabulary instance.
Exceptions
std::runtime_errorif corpus is empty.
std::invalid_argumentif config is invalid.
Here is the call graph for this function:
Here is the caller graph for this function:

Member Data Documentation

◆ config_

CharVocabularyConfig Mila::Data::CharTrainer::config_
private

◆ corpus_

std::string Mila::Data::CharTrainer::corpus_
private

The documentation for this class was generated from the following file: