|
| | CharTokenizer (CharVocabulary vocab) |
| | Construct a CharTokenizer with a vocabulary.
|
| std::string | decode (std::span< const TokenId > tokens) override |
| | Decode token ids back to text.
|
| std::vector< TokenId > | encode (const std::string &text) override |
| | Encode text into token ids (one id per input byte).
|
| std::optional< TokenId > | getBosTokenId () const override |
| | BOS id query - not supported for char-level tokenizer.
|
| std::optional< TokenId > | getEosTokenId () const override |
| | EOS id query - not supported for char-level tokenizer.
|
| std::optional< TokenId > | getPadTokenId () const override |
| | PAD id query - not supported for char-level tokenizer.
|
| size_t | getVocabSize () const override |
| | Number of tokens in the underlying vocabulary.
|
| bool | isValidToken (TokenId tokenId) const override |
| | Check if token id is valid in the vocabulary.
|
| std::string | tokenToString (TokenId tokenId) const override |
| | Convert a token id to a debug string.
|
| virtual | ~Tokenizer ()=default |
Character-level tokenizer.
This tokenizer treats tokens as single bytes (single-character strings). It delegates all token <-> id mapping and persistence to a TokenizerVocabulary implementation.
Ownership:
- The tokenizer holds a shared pointer to the vocabulary so the same vocabulary instance may be shared between tokenizers or other users.
Encoding/decoding semantics:
- encode() produces a TokenId for each byte in the input string. If a token is not found in the vocabulary the encoder emits 0u as a fallback id.
- decode() converts each TokenId back to the first byte of the token string returned by the vocabulary; missing ids produce a '?' character.
Note: This implementation does not add or interpret BOS/EOS tokens; encodeWithSpecial() ignores the addBos/addEos flags because the generic TokenizerVocabulary interface does not expose special-token ids.