Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::TokenizerVocabulary Class Referenceabstractexport

Generic tokenizer vocabulary interface. More...

Inheritance diagram for Mila::Data::TokenizerVocabulary:

Public Member Functions

virtual ~TokenizerVocabulary ()=default
 Virtual destructor.
virtual size_t getSize () const =0
 Get the number of tokens in the vocabulary.
virtual std::optional< std::string > idToToken (TokenId id) const =0
 Map a numeric id back to its token string.
virtual void save (const std::filesystem::path &path) const =0
 Serialize the vocabulary to disk at the given path.
virtual std::optional< TokenIdtokenToId (const std::string &token) const =0
 Map a token string to its numeric id.

Detailed Description

Generic tokenizer vocabulary interface.

TokenizerVocabulary provides a small, implementation-agnostic contract for converting between token strings and numeric ids and for serializing/deserializing vocabulary state.

Notes for implementers:

  • Token strings are expected to be UTF-8 encoded; implementations must document any different encoding assumptions.
  • The use of std::optional in lookup methods indicates a missing token or id. Callers should handle the empty optional as "not found".
  • Thread-safety is implementation-defined; callers should synchronize concurrent modifications (e.g., load) and/or consult concrete docs.

Constructor & Destructor Documentation

◆ ~TokenizerVocabulary()

virtual Mila::Data::TokenizerVocabulary::~TokenizerVocabulary ( )
virtualdefault

Virtual destructor.

Ensures derived vocabulary implementations are properly destroyed when deleted via base pointers.

Member Function Documentation

◆ getSize()

virtual size_t Mila::Data::TokenizerVocabulary::getSize ( ) const
pure virtual

Get the number of tokens in the vocabulary.

Returns
size_t Number of entries (tokens) present in the vocabulary.

Implemented in Mila::Data::BpeVocabulary, and Mila::Data::CharVocabulary.

◆ idToToken()

virtual std::optional< std::string > Mila::Data::TokenizerVocabulary::idToToken ( TokenId id) const
pure virtual

Map a numeric id back to its token string.

Returns an empty optional if the id is out of range or not defined.

Parameters
idToken id to convert.
Returns
std::optional<std::string> The token string if present, otherwise empty.

Implemented in Mila::Data::BpeVocabulary, and Mila::Data::CharVocabulary.

◆ save()

virtual void Mila::Data::TokenizerVocabulary::save ( const std::filesystem::path & path) const
pure virtual

Serialize the vocabulary to disk at the given path.

Implementations should produce a deterministic on-disk representation that can be consumed by the corresponding load() implementation.

Behavior on error:

  • Implementations may throw exceptions (e.g., std::runtime_error) on I/O or format errors. Callers should handle such exceptions as appropriate.
Parameters
pathFilesystem path to write the vocabulary to.

◆ tokenToId()

virtual std::optional< TokenId > Mila::Data::TokenizerVocabulary::tokenToId ( const std::string & token) const
pure virtual

Map a token string to its numeric id.

The lookup returns an empty optional when the token is not present in the vocabulary. Implementations may provide an explicit unknown-token id; callers should interpret an empty optional as "no mapping".

Parameters
tokenUTF-8 encoded token string to look up.
Returns
std::optional<TokenId> The token id if present, otherwise empty.

Implemented in Mila::Data::BpeVocabulary, and Mila::Data::CharVocabulary.


The documentation for this class was generated from the following file: