Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::CharVocabulary Class Referenceexport

Character vocabulary for tokenization. More...

Inheritance diagram for Mila::Data::CharVocabulary:
Collaboration diagram for Mila::Data::CharVocabulary:

Public Member Functions

 CharVocabulary ()=delete
TokenId charToIndex (char c) const
const CharVocabularyConfiggetConfig () const
 Get the configuration used to create this vocabulary.
size_t getSize () const override
 Get the number of tokens in the vocabulary.
bool hasSpecialTokens () const
std::optional< std::string > idToToken (TokenId id) const override
 Map a numeric id back to its token string.
char indexToChar (TokenId idx) const
TokenId padTokenId () const
void save (const fs::path &path) const override
 Serialize vocabulary to disk with configuration.
std::optional< TokenIdtokenToId (const std::string &token) const override
 Map a token string to its numeric id.
TokenId unkTokenId () const
Public Member Functions inherited from Mila::Data::TokenizerVocabulary
virtual ~TokenizerVocabulary ()=default
 Virtual destructor.
virtual void save (const std::filesystem::path &path) const =0
 Serialize the vocabulary to disk at the given path.

Static Public Member Functions

static CharVocabulary load (const fs::path &path)
 Load vocabulary from Mila binary format.
static CharVocabulary train (const std::string &corpus, const CharVocabularyConfig &config)
 Build a character vocabulary from text corpus.
static CharVocabulary trainFromFile (const fs::path &corpus_path, const CharVocabularyConfig &config)
 Build a character vocabulary from corpus file.

Private Member Functions

 CharVocabulary (const CharVocabularyConfig &config)
void addRegularTokens (const std::vector< unsigned char > &sorted_bytes)
void addSpecialTokensFromConfig ()
void buildFromText (const std::string &corpus)
std::unordered_map< unsigned char, bool > extractUniqueBytes (const std::string &text) const
void loadContent (std::istream &file)
std::string normalizeText (const std::string &text) const
void saveContent (std::ostream &file) const
std::vector< unsigned char > sortBytes (const std::unordered_map< unsigned char, bool > &unique_bytes) const

Private Attributes

std::unordered_map< char, TokenIdchar_to_idx_
CharVocabularyConfig config_
std::vector< char > idx_to_char_
TokenId pad_token_id_
TokenId unk_token_id_

Detailed Description

Character vocabulary for tokenization.

Immutable vocabulary created via static factory methods. Stores configuration for full provenance tracking and serialization.

Thread safety: Immutable after construction, safe for concurrent reads.

Constructor & Destructor Documentation

◆ CharVocabulary() [1/2]

Mila::Data::CharVocabulary::CharVocabulary ( )
delete
Here is the caller graph for this function:

◆ CharVocabulary() [2/2]

Mila::Data::CharVocabulary::CharVocabulary ( const CharVocabularyConfig & config)
inlineexplicitprivate

Member Function Documentation

◆ addRegularTokens()

void Mila::Data::CharVocabulary::addRegularTokens ( const std::vector< unsigned char > & sorted_bytes)
inlineprivate
Here is the caller graph for this function:

◆ addSpecialTokensFromConfig()

void Mila::Data::CharVocabulary::addSpecialTokensFromConfig ( )
inlineprivate
Here is the caller graph for this function:

◆ buildFromText()

void Mila::Data::CharVocabulary::buildFromText ( const std::string & corpus)
inlineprivate
Here is the call graph for this function:
Here is the caller graph for this function:

◆ charToIndex()

TokenId Mila::Data::CharVocabulary::charToIndex ( char c) const
inline

◆ extractUniqueBytes()

std::unordered_map< unsigned char, bool > Mila::Data::CharVocabulary::extractUniqueBytes ( const std::string & text) const
inlineprivate
Here is the caller graph for this function:

◆ getConfig()

const CharVocabularyConfig & Mila::Data::CharVocabulary::getConfig ( ) const
inline

Get the configuration used to create this vocabulary.

Returns
const CharVocabularyConfig& Configuration reference.

◆ getSize()

size_t Mila::Data::CharVocabulary::getSize ( ) const
inlineoverridevirtual

Get the number of tokens in the vocabulary.

Returns
size_t Number of entries (tokens) present in the vocabulary.

Implements Mila::Data::TokenizerVocabulary.

◆ hasSpecialTokens()

bool Mila::Data::CharVocabulary::hasSpecialTokens ( ) const
inline

◆ idToToken()

std::optional< std::string > Mila::Data::CharVocabulary::idToToken ( TokenId id) const
inlineoverridevirtual

Map a numeric id back to its token string.

Returns an empty optional if the id is out of range or not defined.

Parameters
idToken id to convert.
Returns
std::optional<std::string> The token string if present, otherwise empty.

Implements Mila::Data::TokenizerVocabulary.

◆ indexToChar()

char Mila::Data::CharVocabulary::indexToChar ( TokenId idx) const
inline

◆ load()

CharVocabulary Mila::Data::CharVocabulary::load ( const fs::path & path)
inlinestatic

Load vocabulary from Mila binary format.

Reads vocabulary and configuration from file written by save().

Parameters
pathInput file path.
Returns
Loaded CharVocabulary instance.
Exceptions
std::runtime_erroron I/O errors or format incompatibility.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ loadContent()

void Mila::Data::CharVocabulary::loadContent ( std::istream & file)
inlineprivate
Here is the caller graph for this function:

◆ normalizeText()

std::string Mila::Data::CharVocabulary::normalizeText ( const std::string & text) const
inlineprivate
Here is the caller graph for this function:

◆ padTokenId()

TokenId Mila::Data::CharVocabulary::padTokenId ( ) const
inline

◆ save()

void Mila::Data::CharVocabulary::save ( const fs::path & path) const
inlineoverride

Serialize vocabulary to disk with configuration.

File format includes MilaFileHeader with config metadata followed by binary vocabulary content.

Parameters
pathOutput file path. Parent directory must exist.
Exceptions
std::runtime_erroron I/O errors.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ saveContent()

void Mila::Data::CharVocabulary::saveContent ( std::ostream & file) const
inlineprivate
Here is the caller graph for this function:

◆ sortBytes()

std::vector< unsigned char > Mila::Data::CharVocabulary::sortBytes ( const std::unordered_map< unsigned char, bool > & unique_bytes) const
inlineprivate
Here is the caller graph for this function:

◆ tokenToId()

std::optional< TokenId > Mila::Data::CharVocabulary::tokenToId ( const std::string & token) const
inlineoverridevirtual

Map a token string to its numeric id.

The lookup returns an empty optional when the token is not present in the vocabulary. Implementations may provide an explicit unknown-token id; callers should interpret an empty optional as "no mapping".

Parameters
tokenUTF-8 encoded token string to look up.
Returns
std::optional<TokenId> The token id if present, otherwise empty.

Implements Mila::Data::TokenizerVocabulary.

◆ train()

CharVocabulary Mila::Data::CharVocabulary::train ( const std::string & corpus,
const CharVocabularyConfig & config )
inlinestatic

Build a character vocabulary from text corpus.

Parameters
corpusTraining text corpus.
configVocabulary configuration.
Returns
Trained CharVocabulary instance.
Exceptions
std::invalid_argumentif config is invalid.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ trainFromFile()

CharVocabulary Mila::Data::CharVocabulary::trainFromFile ( const fs::path & corpus_path,
const CharVocabularyConfig & config )
inlinestatic

Build a character vocabulary from corpus file.

Parameters
corpus_pathPath to training corpus text file.
configVocabulary configuration.
Returns
Trained CharVocabulary instance.
Exceptions
std::runtime_errorif file cannot be opened.
std::invalid_argumentif config is invalid.
Here is the call graph for this function:

◆ unkTokenId()

TokenId Mila::Data::CharVocabulary::unkTokenId ( ) const
inline

Member Data Documentation

◆ char_to_idx_

std::unordered_map<char, TokenId> Mila::Data::CharVocabulary::char_to_idx_
private

◆ config_

CharVocabularyConfig Mila::Data::CharVocabulary::config_
private

◆ idx_to_char_

std::vector<char> Mila::Data::CharVocabulary::idx_to_char_
private

◆ pad_token_id_

TokenId Mila::Data::CharVocabulary::pad_token_id_
private

◆ unk_token_id_

TokenId Mila::Data::CharVocabulary::unk_token_id_
private

The documentation for this class was generated from the following file: