Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::BpeTokenizer Class Referenceexport

Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...

Inheritance diagram for Mila::Data::BpeTokenizer:
Collaboration diagram for Mila::Data::BpeTokenizer:

Public Member Functions

 BpeTokenizer (BpeVocabulary vocab)
std::string decode (std::span< const TokenId > tokens) override
 Decode a sequence of token IDs back to a UTF-8 string.
std::vector< TokenIdencode (const std::string &text) override
 Encode text to a sequence of token IDs.
std::optional< TokenIdgetBosTokenId () const override
std::optional< TokenIdgetEosTokenId () const override
std::optional< TokenIdgetPadTokenId () const override
const BpeVocabularygetVocab () const
size_t getVocabSize () const override
bool isValidToken (TokenId tokenId) const override
std::string tokenToString (TokenId tokenId) const override
Public Member Functions inherited from Mila::Data::Tokenizer
virtual ~Tokenizer ()=default

Static Public Member Functions

static BpeTokenizer load (const std::filesystem::path &path)
 Load a tokenizer from a Mila binary vocabulary file.
static std::shared_ptr< BpeTokenizerloadGpt2 (const std::filesystem::path &path)
 Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py.
static std::shared_ptr< BpeTokenizerloadLlama32 (const std::filesystem::path &path)
 Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py.
static std::shared_ptr< BpeTokenizerloadMistral (const std::filesystem::path &vocab_path, const std::filesystem::path &merges_path)
 Load a Mistral tokenizer.

Private Member Functions

void decodeToken (const std::string &token, std::string &out)
 Reverse byte-encode a single token string and append to out.
void encodeSegment (const std::string &text, std::vector< TokenId > &out)
 Encode a plain text segment (guaranteed to contain no special tokens).
void encodeSegmentBpe (const std::vector< std::string > &words, std::vector< TokenId > &out)
 BPE merge encode for GPT-2 style vocabularies.
void encodeSegmentMaxMunch (const std::vector< std::string > &words, std::vector< TokenId > &out)
 Max-munch encode for TikToken vocabularies (Llama 3.x).
void initializePreTokenization ()
 Build the pre-tokenization regex from the vocabulary config.
std::vector< std::string > preTokenize (const std::string &text)
 Split text into pre-tokens using the configured regex.

Static Private Member Functions

static size_t utf8CharLength (unsigned char first_byte)

Private Attributes

std::optional< std::regex > pre_tokenization_regex_
BpeVocabulary vocab_

Detailed Description

Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families.

Construct from a pre-built vocabulary or via the convenience factory methods:

// GPT-2
auto tok = BpeTokenizer::loadGpt2( "gpt2_tokenizer.bin" );
auto ids = tok->encode( "Hello, world!" );
// Llama 3.2
auto tok = BpeTokenizer::loadLlama32( "llama32_tokenizer.bin" );
auto ids = tok->encode( "<|begin_of_text|>Hello, world!" );
static std::shared_ptr< BpeTokenizer > loadGpt2(const std::filesystem::path &path)
Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py.
Definition BpeTokenizer.ixx:100
static std::shared_ptr< BpeTokenizer > loadLlama32(const std::filesystem::path &path)
Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py.
Definition BpeTokenizer.ixx:112

The special token pre-pass is enabled whenever the vocabulary registers at least one special token. For GPT-2, this means "<|endoftext|>" is intercepted before BPE runs; for Llama 3.x, the full set of named and extended tokens is intercepted.

Constructor & Destructor Documentation

◆ BpeTokenizer()

Mila::Data::BpeTokenizer::BpeTokenizer ( BpeVocabulary vocab)
inlineexplicit
Here is the call graph for this function:
Here is the caller graph for this function:

Member Function Documentation

◆ decode()

std::string Mila::Data::BpeTokenizer::decode ( std::span< const TokenId > tokens)
inlineoverridevirtual

Decode a sequence of token IDs back to a UTF-8 string.

Each token string is byte-decoded using the GPT-2 style byte mapping. IDs with no vocabulary entry emit a '?' placeholder.

Parameters
tokensSequence of token IDs.
Returns
Decoded UTF-8 string.

Implements Mila::Data::Tokenizer.

Here is the call graph for this function:

◆ decodeToken()

void Mila::Data::BpeTokenizer::decodeToken ( const std::string & token,
std::string & out )
inlineprivate

Reverse byte-encode a single token string and append to out.

For byte-level vocabularies each UTF-8 character in the token string maps back to one raw byte via the GPT-2 byte decoder. Characters without a decoder entry emit '?'.

Parameters
tokenToken string from the vocabulary.
outOutput UTF-8 string to append to.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ encode()

std::vector< TokenId > Mila::Data::BpeTokenizer::encode ( const std::string & text)
inlineoverridevirtual

Encode text to a sequence of token IDs.

Performs the special token pre-pass first when the vocabulary has registered special tokens. Plain text segments between special tokens are processed through the standard pre-tokenization and BPE merge pipeline.

Parameters
textInput text (UTF-8).
Returns
Sequence of token IDs.

Implements Mila::Data::Tokenizer.

Here is the call graph for this function:

◆ encodeSegment()

void Mila::Data::BpeTokenizer::encodeSegment ( const std::string & text,
std::vector< TokenId > & out )
inlineprivate

Encode a plain text segment (guaranteed to contain no special tokens).

/**

Encode a plain text segment (guaranteed to contain no special tokens).

Dispatches to the BPE merge path when explicit merge rules are present, or to the max-munch path for TikToken-style vocabularies (Llama 3.x).

Parameters
textPlain text segment.
outAccumulator for output token IDs.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ encodeSegmentBpe()

void Mila::Data::BpeTokenizer::encodeSegmentBpe ( const std::vector< std::string > & words,
std::vector< TokenId > & out )
inlineprivate

BPE merge encode for GPT-2 style vocabularies.

Byte-encodes each pre-token then applies explicit merge rules greedily (lowest priority index first) until no more merges are possible.

Parameters
wordsPre-tokenized segments from the regex pass.
outAccumulator for output token IDs.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ encodeSegmentMaxMunch()

void Mila::Data::BpeTokenizer::encodeSegmentMaxMunch ( const std::vector< std::string > & words,
std::vector< TokenId > & out )
inlineprivate

Max-munch encode for TikToken vocabularies (Llama 3.x).

Byte-encodes each pre-token then scans for the longest vocabulary match at each position. Falls back to ID 0 for unrecognised units.

Parameters
wordsPre-tokenized segments from the regex pass.
outAccumulator for output token IDs.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ getBosTokenId()

std::optional< TokenId > Mila::Data::BpeTokenizer::getBosTokenId ( ) const
inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ getEosTokenId()

std::optional< TokenId > Mila::Data::BpeTokenizer::getEosTokenId ( ) const
inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ getPadTokenId()

std::optional< TokenId > Mila::Data::BpeTokenizer::getPadTokenId ( ) const
inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ getVocab()

const BpeVocabulary & Mila::Data::BpeTokenizer::getVocab ( ) const
inline

◆ getVocabSize()

size_t Mila::Data::BpeTokenizer::getVocabSize ( ) const
inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ initializePreTokenization()

void Mila::Data::BpeTokenizer::initializePreTokenization ( )
inlineprivate

Build the pre-tokenization regex from the vocabulary config.

Attempts to compile the Unicode pattern first. If std::regex rejects it (MSVC ECMAScript mode does not support {L} / {N}), falls back to the ASCII-only approximation for the detected mode. Llama3Regex and Gpt2Regex each have a dedicated ASCII fallback; an unrecognised pattern that fails compilation is treated as a hard error.

Here is the caller graph for this function:

◆ isValidToken()

bool Mila::Data::BpeTokenizer::isValidToken ( TokenId tokenId) const
inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ load()

BpeTokenizer Mila::Data::BpeTokenizer::load ( const std::filesystem::path & path)
inlinestatic

Load a tokenizer from a Mila binary vocabulary file.

Parameters
pathPath to a vocabulary file written by BpeVocabulary::save().
Returns
Loaded BpeTokenizer instance.
Exceptions
std::runtime_erroron I/O or format errors.
Here is the call graph for this function:

◆ loadGpt2()

std::shared_ptr< BpeTokenizer > Mila::Data::BpeTokenizer::loadGpt2 ( const std::filesystem::path & path)
inlinestatic

Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py.

Parameters
pathPath to the GPT-2 tokenizer binary.
Returns
Shared tokenizer instance.
Exceptions
std::runtime_erroron I/O or format errors.
Here is the call graph for this function:

◆ loadLlama32()

std::shared_ptr< BpeTokenizer > Mila::Data::BpeTokenizer::loadLlama32 ( const std::filesystem::path & path)
inlinestatic

Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py.

Parameters
pathPath to the Llama 3.2 tokenizer binary.
Returns
Shared tokenizer instance.
Exceptions
std::runtime_erroron I/O or format errors.
Here is the call graph for this function:

◆ loadMistral()

std::shared_ptr< BpeTokenizer > Mila::Data::BpeTokenizer::loadMistral ( const std::filesystem::path & vocab_path,
const std::filesystem::path & merges_path )
inlinestatic

Load a Mistral tokenizer.

Note
Not yet implemented. Provide a Mila binary produced by save() as a workaround.
Exceptions
std::runtime_erroralways.
Here is the call graph for this function:

◆ preTokenize()

std::vector< std::string > Mila::Data::BpeTokenizer::preTokenize ( const std::string & text)
inlineprivate

Split text into pre-tokens using the configured regex.

Returns the entire text as a single element when no regex is configured (e.g., vocabularies built with PreTokenizationMode::None).

Parameters
textInput text segment.
Returns
Vector of pre-token strings.
Here is the caller graph for this function:

◆ tokenToString()

std::string Mila::Data::BpeTokenizer::tokenToString ( TokenId tokenId) const
inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ utf8CharLength()

size_t Mila::Data::BpeTokenizer::utf8CharLength ( unsigned char first_byte)
inlinestaticprivate
Here is the caller graph for this function:

Member Data Documentation

◆ pre_tokenization_regex_

std::optional<std::regex> Mila::Data::BpeTokenizer::pre_tokenization_regex_
private

◆ vocab_

BpeVocabulary Mila::Data::BpeTokenizer::vocab_
private

The documentation for this class was generated from the following file: