Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::CharTokenizer Class Referenceexport

Character-level tokenizer. More...

Inheritance diagram for Mila::Data::CharTokenizer:
Collaboration diagram for Mila::Data::CharTokenizer:

Public Member Functions

 CharTokenizer (CharVocabulary vocab)
 Construct a CharTokenizer with a vocabulary.
std::string decode (std::span< const TokenId > tokens) override
 Decode token ids back to text.
std::vector< TokenIdencode (const std::string &text) override
 Encode text into token ids (one id per input byte).
std::optional< TokenIdgetBosTokenId () const override
 BOS id query - not supported for char-level tokenizer.
std::optional< TokenIdgetEosTokenId () const override
 EOS id query - not supported for char-level tokenizer.
std::optional< TokenIdgetPadTokenId () const override
 PAD id query - not supported for char-level tokenizer.
size_t getVocabSize () const override
 Number of tokens in the underlying vocabulary.
bool isValidToken (TokenId tokenId) const override
 Check if token id is valid in the vocabulary.
std::string tokenToString (TokenId tokenId) const override
 Convert a token id to a debug string.
Public Member Functions inherited from Mila::Data::Tokenizer
virtual ~Tokenizer ()=default

Static Public Member Functions

static CharTokenizer load (const std::filesystem::path &path)

Private Attributes

CharVocabulary vocab_

Detailed Description

Character-level tokenizer.

This tokenizer treats tokens as single bytes (single-character strings). It delegates all token <-> id mapping and persistence to a TokenizerVocabulary implementation.

Ownership:

  • The tokenizer holds a shared pointer to the vocabulary so the same vocabulary instance may be shared between tokenizers or other users.

Encoding/decoding semantics:

  • encode() produces a TokenId for each byte in the input string. If a token is not found in the vocabulary the encoder emits 0u as a fallback id.
  • decode() converts each TokenId back to the first byte of the token string returned by the vocabulary; missing ids produce a '?' character.

Note: This implementation does not add or interpret BOS/EOS tokens; encodeWithSpecial() ignores the addBos/addEos flags because the generic TokenizerVocabulary interface does not expose special-token ids.

Constructor & Destructor Documentation

◆ CharTokenizer()

Mila::Data::CharTokenizer::CharTokenizer ( CharVocabulary vocab)
inlineexplicit

Construct a CharTokenizer with a vocabulary.

Parameters
vocabShared pointer to a TokenizerVocabulary implementation. Must remain valid for the lifetime of this tokenizer.
Here is the caller graph for this function:

Member Function Documentation

◆ decode()

std::string Mila::Data::CharTokenizer::decode ( std::span< const TokenId > tokens)
inlineoverridevirtual

Decode token ids back to text.

Each token id is converted to its token string via the vocabulary, and the first byte of that token is appended to the result. If an id is missing the character '?' is appended.

Parameters
tokensSpan of token ids to decode.
Returns
Decoded text string.

Implements Mila::Data::Tokenizer.

◆ encode()

std::vector< TokenId > Mila::Data::CharTokenizer::encode ( const std::string & text)
inlineoverridevirtual

Encode text into token ids (one id per input byte).

Parameters
textUTF-8 encoded text to encode. Each input byte is treated as a separate token; callers should handle multi-byte characters if needed.
Returns
std::vector<TokenId> Vector of token ids; missing tokens map to 0u.

Implements Mila::Data::Tokenizer.

◆ getBosTokenId()

std::optional< TokenId > Mila::Data::CharTokenizer::getBosTokenId ( ) const
inlineoverridevirtual

BOS id query - not supported for char-level tokenizer.

Returns empty optional because the generic vocabulary does not expose special-token ids in the interface.

Implements Mila::Data::Tokenizer.

◆ getEosTokenId()

std::optional< TokenId > Mila::Data::CharTokenizer::getEosTokenId ( ) const
inlineoverridevirtual

EOS id query - not supported for char-level tokenizer.

Implements Mila::Data::Tokenizer.

◆ getPadTokenId()

std::optional< TokenId > Mila::Data::CharTokenizer::getPadTokenId ( ) const
inlineoverridevirtual

PAD id query - not supported for char-level tokenizer.

Implements Mila::Data::Tokenizer.

◆ getVocabSize()

size_t Mila::Data::CharTokenizer::getVocabSize ( ) const
inlineoverridevirtual

Number of tokens in the underlying vocabulary.

Implements Mila::Data::Tokenizer.

◆ isValidToken()

bool Mila::Data::CharTokenizer::isValidToken ( TokenId tokenId) const
inlineoverridevirtual

Check if token id is valid in the vocabulary.

Implements Mila::Data::Tokenizer.

◆ load()

CharTokenizer Mila::Data::CharTokenizer::load ( const std::filesystem::path & path)
inlinestatic
Here is the call graph for this function:

◆ tokenToString()

std::string Mila::Data::CharTokenizer::tokenToString ( TokenId tokenId) const
inlineoverridevirtual

Convert a token id to a debug string.

Returns the token string from the vocabulary or an empty string if not found.

Implements Mila::Data::Tokenizer.

Member Data Documentation

◆ vocab_

CharVocabulary Mila::Data::CharTokenizer::vocab_
private

The documentation for this class was generated from the following file: