Corpus accumulator and trainer for BPE vocabularies.
More...
Corpus accumulator and trainer for BPE vocabularies.
Typical usage:
.withVocabSize( 32000 )
.withByteLevel( true )
trainer.addCorpusFromFile( "corpus.txt" );
vocab.
save(
"my_vocab.bin" );
BpeTrainer(const BpeVocabularyConfig &config=BpeVocabularyConfig{})
Construct with a vocabulary configuration.
Definition BpeTrainer.ixx:54
Configuration for the BPE vocabulary.
Definition BpeVocabularyConfig.ixx:47
Unified Byte Pair Encoding (BPE) vocabulary.
Definition BpeVocabulary.ixx:55
void save(const fs::path &path) const override
Serialize the vocabulary to Mila binary format (content version 2).
Definition BpeVocabulary.ixx:218
static BpeVocabulary train(const std::string &corpus, const BpeVocabularyConfig &config)
Train a BPE vocabulary from a text corpus.
Definition BpeVocabulary.ixx:70
constexpr const char * GPT2_PRETOKENIZATION_PATTERN
Definition BpePreTokenizationMode.ixx:31
@ Gpt2Regex
Definition BpePreTokenizationMode.ixx:23
◆ BpeTrainer()
Construct with a vocabulary configuration.
validate() is called immediately so misconfigured trainers fail at construction rather than at train() time.
- Parameters
-
| config | BPE vocabulary configuration. |
- Exceptions
-
| std::invalid_argument | if config fails validation. |
◆ addCorpusFromFile()
| void Mila::Data::BpeTrainer::addCorpusFromFile |
( |
const std::filesystem::path & | path | ) |
|
|
inline |
Append corpus text from a file.
- Parameters
-
| path | Path to a UTF-8 text file. |
- Exceptions
-
| std::runtime_error | if the file cannot be opened. |
◆ addCorpusFromStream()
| void Mila::Data::BpeTrainer::addCorpusFromStream |
( |
std::istream & | stream | ) |
|
|
inline |
Append corpus text from a stream.
May be called multiple times to accumulate text from different sources before a single train() call.
- Parameters
-
| stream | Input stream containing UTF-8 text. |
◆ clearCorpus()
| void Mila::Data::BpeTrainer::clearCorpus |
( |
| ) |
|
|
inline |
◆ getConfig()
◆ getCorpusSize()
| size_t Mila::Data::BpeTrainer::getCorpusSize |
( |
| ) |
const |
|
inline |
◆ train()
◆ config_
◆ corpus_
| std::string Mila::Data::BpeTrainer::corpus_ |
|
private |
The documentation for this class was generated from the following file: