|
Mila 0.13.48
Deep Neural Network Library
|
Generic tokenizer vocabulary interface. More...

Public Member Functions | |
| virtual | ~TokenizerVocabulary ()=default |
| Virtual destructor. | |
| virtual size_t | getSize () const =0 |
| Get the number of tokens in the vocabulary. | |
| virtual std::optional< std::string > | idToToken (TokenId id) const =0 |
| Map a numeric id back to its token string. | |
| virtual void | save (const std::filesystem::path &path) const =0 |
| Serialize the vocabulary to disk at the given path. | |
| virtual std::optional< TokenId > | tokenToId (const std::string &token) const =0 |
| Map a token string to its numeric id. | |
Generic tokenizer vocabulary interface.
TokenizerVocabulary provides a small, implementation-agnostic contract for converting between token strings and numeric ids and for serializing/deserializing vocabulary state.
Notes for implementers:
|
virtualdefault |
Virtual destructor.
Ensures derived vocabulary implementations are properly destroyed when deleted via base pointers.
|
pure virtual |
Get the number of tokens in the vocabulary.
Implemented in Mila::Data::BpeVocabulary, and Mila::Data::CharVocabulary.
|
pure virtual |
Map a numeric id back to its token string.
Returns an empty optional if the id is out of range or not defined.
| id | Token id to convert. |
Implemented in Mila::Data::BpeVocabulary, and Mila::Data::CharVocabulary.
|
pure virtual |
Serialize the vocabulary to disk at the given path.
Implementations should produce a deterministic on-disk representation that can be consumed by the corresponding load() implementation.
Behavior on error:
| path | Filesystem path to write the vocabulary to. |
|
pure virtual |
Map a token string to its numeric id.
The lookup returns an empty optional when the token is not present in the vocabulary. Implementations may provide an explicit unknown-token id; callers should interpret an empty optional as "no mapping".
| token | UTF-8 encoded token string to look up. |
Implemented in Mila::Data::BpeVocabulary, and Mila::Data::CharVocabulary.