Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::BpeTrainer Class Referenceexport

Corpus accumulator and trainer for BPE vocabularies. More...

Collaboration diagram for Mila::Data::BpeTrainer:

Public Member Functions

 BpeTrainer (const BpeVocabularyConfig &config=BpeVocabularyConfig{})
 Construct with a vocabulary configuration.
void addCorpusFromFile (const std::filesystem::path &path)
 Append corpus text from a file.
void addCorpusFromStream (std::istream &stream)
 Append corpus text from a stream.
void clearCorpus ()
const BpeVocabularyConfiggetConfig () const
size_t getCorpusSize () const
BpeVocabulary train ()
 Train a BPE vocabulary on the accumulated corpus.

Private Attributes

BpeVocabularyConfig config_
std::string corpus_

Detailed Description

Corpus accumulator and trainer for BPE vocabularies.

Typical usage:

.withVocabSize( 32000 )
.withByteLevel( true )
.withPreTokenization( PreTokenizationMode::Gpt2Regex )
.withPreTokenizationPattern( GPT2_PRETOKENIZATION_PATTERN ) );
trainer.addCorpusFromFile( "corpus.txt" );
BpeVocabulary vocab = trainer.train();
vocab.save( "my_vocab.bin" );
BpeTrainer(const BpeVocabularyConfig &config=BpeVocabularyConfig{})
Construct with a vocabulary configuration.
Definition BpeTrainer.ixx:54
Configuration for the BPE vocabulary.
Definition BpeVocabularyConfig.ixx:47
Unified Byte Pair Encoding (BPE) vocabulary.
Definition BpeVocabulary.ixx:55
void save(const fs::path &path) const override
Serialize the vocabulary to Mila binary format (content version 2).
Definition BpeVocabulary.ixx:218
static BpeVocabulary train(const std::string &corpus, const BpeVocabularyConfig &config)
Train a BPE vocabulary from a text corpus.
Definition BpeVocabulary.ixx:70
constexpr const char * GPT2_PRETOKENIZATION_PATTERN
Definition BpePreTokenizationMode.ixx:31
@ Gpt2Regex
Definition BpePreTokenizationMode.ixx:23

Constructor & Destructor Documentation

◆ BpeTrainer()

Mila::Data::BpeTrainer::BpeTrainer ( const BpeVocabularyConfig & config = BpeVocabularyConfig{})
inlineexplicit

Construct with a vocabulary configuration.

validate() is called immediately so misconfigured trainers fail at construction rather than at train() time.

Parameters
configBPE vocabulary configuration.
Exceptions
std::invalid_argumentif config fails validation.

Member Function Documentation

◆ addCorpusFromFile()

void Mila::Data::BpeTrainer::addCorpusFromFile ( const std::filesystem::path & path)
inline

Append corpus text from a file.

Parameters
pathPath to a UTF-8 text file.
Exceptions
std::runtime_errorif the file cannot be opened.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ addCorpusFromStream()

void Mila::Data::BpeTrainer::addCorpusFromStream ( std::istream & stream)
inline

Append corpus text from a stream.

May be called multiple times to accumulate text from different sources before a single train() call.

Parameters
streamInput stream containing UTF-8 text.
Here is the caller graph for this function:

◆ clearCorpus()

void Mila::Data::BpeTrainer::clearCorpus ( )
inline

◆ getConfig()

const BpeVocabularyConfig & Mila::Data::BpeTrainer::getConfig ( ) const
inline

◆ getCorpusSize()

size_t Mila::Data::BpeTrainer::getCorpusSize ( ) const
inline

◆ train()

BpeVocabulary Mila::Data::BpeTrainer::train ( )
inline

Train a BPE vocabulary on the accumulated corpus.

Delegates to BpeVocabulary::train() and clears the accumulated corpus afterwards to release memory. The returned vocabulary can be saved via BpeVocabulary::save() and later reloaded with BpeVocabulary::load().

Returns
Trained BpeVocabulary instance.
Exceptions
std::runtime_errorif no corpus has been accumulated.
Here is the call graph for this function:
Here is the caller graph for this function:

Member Data Documentation

◆ config_

BpeVocabularyConfig Mila::Data::BpeTrainer::config_
private

◆ corpus_

std::string Mila::Data::BpeTrainer::corpus_
private

The documentation for this class was generated from the following file: