|
Mila 0.13.48
Deep Neural Network Library
|
Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...


Public Member Functions | |
| BpeTokenizer (BpeVocabulary vocab) | |
| std::string | decode (std::span< const TokenId > tokens) override |
| Decode a sequence of token IDs back to a UTF-8 string. | |
| std::vector< TokenId > | encode (const std::string &text) override |
| Encode text to a sequence of token IDs. | |
| std::optional< TokenId > | getBosTokenId () const override |
| std::optional< TokenId > | getEosTokenId () const override |
| std::optional< TokenId > | getPadTokenId () const override |
| const BpeVocabulary & | getVocab () const |
| size_t | getVocabSize () const override |
| bool | isValidToken (TokenId tokenId) const override |
| std::string | tokenToString (TokenId tokenId) const override |
| Public Member Functions inherited from Mila::Data::Tokenizer | |
| virtual | ~Tokenizer ()=default |
Static Public Member Functions | |
| static BpeTokenizer | load (const std::filesystem::path &path) |
| Load a tokenizer from a Mila binary vocabulary file. | |
| static std::shared_ptr< BpeTokenizer > | loadGpt2 (const std::filesystem::path &path) |
| Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py. | |
| static std::shared_ptr< BpeTokenizer > | loadLlama32 (const std::filesystem::path &path) |
| Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py. | |
| static std::shared_ptr< BpeTokenizer > | loadMistral (const std::filesystem::path &vocab_path, const std::filesystem::path &merges_path) |
| Load a Mistral tokenizer. | |
Private Member Functions | |
| void | decodeToken (const std::string &token, std::string &out) |
Reverse byte-encode a single token string and append to out. | |
| void | encodeSegment (const std::string &text, std::vector< TokenId > &out) |
| Encode a plain text segment (guaranteed to contain no special tokens). | |
| void | encodeSegmentBpe (const std::vector< std::string > &words, std::vector< TokenId > &out) |
| BPE merge encode for GPT-2 style vocabularies. | |
| void | encodeSegmentMaxMunch (const std::vector< std::string > &words, std::vector< TokenId > &out) |
| Max-munch encode for TikToken vocabularies (Llama 3.x). | |
| void | initializePreTokenization () |
| Build the pre-tokenization regex from the vocabulary config. | |
| std::vector< std::string > | preTokenize (const std::string &text) |
| Split text into pre-tokens using the configured regex. | |
Static Private Member Functions | |
| static size_t | utf8CharLength (unsigned char first_byte) |
Private Attributes | |
| std::optional< std::regex > | pre_tokenization_regex_ |
| BpeVocabulary | vocab_ |
Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families.
Construct from a pre-built vocabulary or via the convenience factory methods:
The special token pre-pass is enabled whenever the vocabulary registers at least one special token. For GPT-2, this means "<|endoftext|>" is intercepted before BPE runs; for Llama 3.x, the full set of named and extended tokens is intercepted.
|
inlineexplicit |


|
inlineoverridevirtual |
Decode a sequence of token IDs back to a UTF-8 string.
Each token string is byte-decoded using the GPT-2 style byte mapping. IDs with no vocabulary entry emit a '?' placeholder.
| tokens | Sequence of token IDs. |
Implements Mila::Data::Tokenizer.

|
inlineprivate |
Reverse byte-encode a single token string and append to out.
For byte-level vocabularies each UTF-8 character in the token string maps back to one raw byte via the GPT-2 byte decoder. Characters without a decoder entry emit '?'.
| token | Token string from the vocabulary. |
| out | Output UTF-8 string to append to. |


|
inlineoverridevirtual |
Encode text to a sequence of token IDs.
Performs the special token pre-pass first when the vocabulary has registered special tokens. Plain text segments between special tokens are processed through the standard pre-tokenization and BPE merge pipeline.
| text | Input text (UTF-8). |
Implements Mila::Data::Tokenizer.

|
inlineprivate |
Encode a plain text segment (guaranteed to contain no special tokens).
/**
Encode a plain text segment (guaranteed to contain no special tokens).
Dispatches to the BPE merge path when explicit merge rules are present, or to the max-munch path for TikToken-style vocabularies (Llama 3.x).
| text | Plain text segment. |
| out | Accumulator for output token IDs. |


|
inlineprivate |
BPE merge encode for GPT-2 style vocabularies.
Byte-encodes each pre-token then applies explicit merge rules greedily (lowest priority index first) until no more merges are possible.
| words | Pre-tokenized segments from the regex pass. |
| out | Accumulator for output token IDs. |


|
inlineprivate |
Max-munch encode for TikToken vocabularies (Llama 3.x).
Byte-encodes each pre-token then scans for the longest vocabulary match at each position. Falls back to ID 0 for unrecognised units.
| words | Pre-tokenized segments from the regex pass. |
| out | Accumulator for output token IDs. |


|
inlineoverridevirtual |
Implements Mila::Data::Tokenizer.
|
inlineoverridevirtual |
Implements Mila::Data::Tokenizer.
|
inlineoverridevirtual |
Implements Mila::Data::Tokenizer.
|
inline |
|
inlineoverridevirtual |
Implements Mila::Data::Tokenizer.
|
inlineprivate |
Build the pre-tokenization regex from the vocabulary config.
Attempts to compile the Unicode pattern first. If std::regex rejects it (MSVC ECMAScript mode does not support {L} / {N}), falls back to the ASCII-only approximation for the detected mode. Llama3Regex and Gpt2Regex each have a dedicated ASCII fallback; an unrecognised pattern that fails compilation is treated as a hard error.

|
inlineoverridevirtual |
Implements Mila::Data::Tokenizer.
|
inlinestatic |
Load a tokenizer from a Mila binary vocabulary file.
| path | Path to a vocabulary file written by BpeVocabulary::save(). |
| std::runtime_error | on I/O or format errors. |

|
inlinestatic |
Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py.
| path | Path to the GPT-2 tokenizer binary. |
| std::runtime_error | on I/O or format errors. |

|
inlinestatic |
Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py.
| path | Path to the Llama 3.2 tokenizer binary. |
| std::runtime_error | on I/O or format errors. |

|
inlinestatic |
Load a Mistral tokenizer.
| std::runtime_error | always. |

|
inlineprivate |
Split text into pre-tokens using the configured regex.
Returns the entire text as a single element when no regex is configured (e.g., vocabularies built with PreTokenizationMode::None).
| text | Input text segment. |

|
inlineoverridevirtual |
Implements Mila::Data::Tokenizer.
|
inlinestaticprivate |

|
private |
|
private |