Character-level tokenizer. More...

Inheritance diagram for Mila::Data::CharTokenizer:

Collaboration diagram for Mila::Data::CharTokenizer:

[legend]

Public Member Functions
	CharTokenizer (CharVocabulary vocab)
	Construct a CharTokenizer with a vocabulary.
std::string	decode (std::span< const TokenId > tokens) override
	Decode token ids back to text.
std::vector< TokenId >	encode (const std::string &text) override
	Encode text into token ids (one id per input byte).
std::optional< TokenId >	getBosTokenId () const override
	BOS id query - not supported for char-level tokenizer.
std::optional< TokenId >	getEosTokenId () const override
	EOS id query - not supported for char-level tokenizer.
std::optional< TokenId >	getPadTokenId () const override
	PAD id query - not supported for char-level tokenizer.
size_t	getVocabSize () const override
	Number of tokens in the underlying vocabulary.
bool	isValidToken (TokenId tokenId) const override
	Check if token id is valid in the vocabulary.
std::string	tokenToString (TokenId tokenId) const override
	Convert a token id to a debug string.
Public Member Functions inherited from Mila::Data::Tokenizer
virtual	~Tokenizer ()=default

Static Public Member Functions
static CharTokenizer	load (const std::filesystem::path &path)

Private Attributes
CharVocabulary	vocab_

Detailed Description

Character-level tokenizer.

This tokenizer treats tokens as single bytes (single-character strings). It delegates all token <-> id mapping and persistence to a TokenizerVocabulary implementation.

Ownership:

The tokenizer holds a shared pointer to the vocabulary so the same vocabulary instance may be shared between tokenizers or other users.

Encoding/decoding semantics:

encode() produces a TokenId for each byte in the input string. If a token is not found in the vocabulary the encoder emits 0u as a fallback id.
decode() converts each TokenId back to the first byte of the token string returned by the vocabulary; missing ids produce a '?' character.

Note: This implementation does not add or interpret BOS/EOS tokens; encodeWithSpecial() ignores the addBos/addEos flags because the generic TokenizerVocabulary interface does not expose special-token ids.

Constructor & Destructor Documentation

◆ CharTokenizer()

Mila::Data::CharTokenizer::CharTokenizer ( CharVocabulary vocab )

inlineexplicit

Construct a CharTokenizer with a vocabulary.

Parameters

vocab Shared pointer to a TokenizerVocabulary implementation. Must remain valid for the lifetime of this tokenizer.

Here is the caller graph for this function:

Member Function Documentation

◆ decode()

std::string Mila::Data::CharTokenizer::decode ( std::span< const TokenId > tokens )

inlineoverridevirtual

Decode token ids back to text.

Each token id is converted to its token string via the vocabulary, and the first byte of that token is appended to the result. If an id is missing the character '?' is appended.

Parameters

tokens Span of token ids to decode.

Returns: Decoded text string.

Implements Mila::Data::Tokenizer.

◆ encode()

std::vector< TokenId > Mila::Data::CharTokenizer::encode ( const std::string & text )

inlineoverridevirtual

Encode text into token ids (one id per input byte).

Parameters

text	UTF-8 encoded text to encode. Each input byte is treated as a separate token; callers should handle multi-byte characters if needed.

Returns: std::vector<TokenId> Vector of token ids; missing tokens map to 0u.

Implements Mila::Data::Tokenizer.

◆ getBosTokenId()

std::optional< TokenId > Mila::Data::CharTokenizer::getBosTokenId ( ) const

inlineoverridevirtual

BOS id query - not supported for char-level tokenizer.

Returns empty optional because the generic vocabulary does not expose special-token ids in the interface.

Implements Mila::Data::Tokenizer.

◆ getEosTokenId()

std::optional< TokenId > Mila::Data::CharTokenizer::getEosTokenId ( ) const

inlineoverridevirtual

EOS id query - not supported for char-level tokenizer.

Implements Mila::Data::Tokenizer.

◆ getPadTokenId()

std::optional< TokenId > Mila::Data::CharTokenizer::getPadTokenId ( ) const

inlineoverridevirtual

PAD id query - not supported for char-level tokenizer.

Implements Mila::Data::Tokenizer.

◆ getVocabSize()

size_t Mila::Data::CharTokenizer::getVocabSize ( ) const

inlineoverridevirtual

Number of tokens in the underlying vocabulary.

Implements Mila::Data::Tokenizer.

◆ isValidToken()

bool Mila::Data::CharTokenizer::isValidToken ( TokenId tokenId ) const

inlineoverridevirtual

Check if token id is valid in the vocabulary.

Implements Mila::Data::Tokenizer.

◆ load()

CharTokenizer Mila::Data::CharTokenizer::load ( const std::filesystem::path & path )

inlinestatic

Here is the call graph for this function:

◆ tokenToString()

std::string Mila::Data::CharTokenizer::tokenToString ( TokenId tokenId ) const

inlineoverridevirtual

Convert a token id to a debug string.

Returns the token string from the vocabulary or an empty string if not found.

Implements Mila::Data::Tokenizer.

Member Data Documentation

◆ vocab_

CharVocabulary Mila::Data::CharTokenizer::vocab_

private

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Data/Tokenizers/Char/CharTokenizer.ixx

Public Member Functions

Static Public Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

◆ CharTokenizer()

Member Function Documentation

◆ decode()

◆ encode()

◆ getBosTokenId()

◆ getEosTokenId()

◆ getPadTokenId()

◆ getVocabSize()

◆ isValidToken()

◆ load()

◆ tokenToString()

Member Data Documentation

◆ vocab_