Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...

Inheritance diagram for Mila::Data::BpeTokenizer:

Collaboration diagram for Mila::Data::BpeTokenizer:

[legend]

Public Member Functions
	BpeTokenizer (BpeVocabulary vocab)
std::string	decode (std::span< const TokenId > tokens) override
	Decode a sequence of token IDs back to a UTF-8 string.
std::vector< TokenId >	encode (const std::string &text) override
	Encode text to a sequence of token IDs.
std::optional< TokenId >	getBosTokenId () const override
std::optional< TokenId >	getEosTokenId () const override
std::optional< TokenId >	getPadTokenId () const override
const BpeVocabulary &	getVocab () const
size_t	getVocabSize () const override
bool	isValidToken (TokenId tokenId) const override
std::string	tokenToString (TokenId tokenId) const override
Public Member Functions inherited from Mila::Data::Tokenizer
virtual	~Tokenizer ()=default

Static Public Member Functions
static BpeTokenizer	load (const std::filesystem::path &path)
	Load a tokenizer from a Mila binary vocabulary file.
static std::shared_ptr< BpeTokenizer >	loadGpt2 (const std::filesystem::path &path)
	Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py.
static std::shared_ptr< BpeTokenizer >	loadLlama32 (const std::filesystem::path &path)
	Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py.
static std::shared_ptr< BpeTokenizer >	loadMistral (const std::filesystem::path &vocab_path, const std::filesystem::path &merges_path)
	Load a Mistral tokenizer.

Private Member Functions
void	decodeToken (const std::string &token, std::string &out)
	Reverse byte-encode a single token string and append to `out`.
void	encodeSegment (const std::string &text, std::vector< TokenId > &out)
	Encode a plain text segment (guaranteed to contain no special tokens).
void	encodeSegmentBpe (const std::vector< std::string > &words, std::vector< TokenId > &out)
	BPE merge encode for GPT-2 style vocabularies.
void	encodeSegmentMaxMunch (const std::vector< std::string > &words, std::vector< TokenId > &out)
	Max-munch encode for TikToken vocabularies (Llama 3.x).
void	initializePreTokenization ()
	Build the pre-tokenization regex from the vocabulary config.
std::vector< std::string >	preTokenize (const std::string &text)
	Split text into pre-tokens using the configured regex.

Static Private Member Functions
static size_t	utf8CharLength (unsigned char first_byte)

Private Attributes
std::optional< std::regex >	pre_tokenization_regex_
BpeVocabulary	vocab_

Detailed Description

Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families.

Construct from a pre-built vocabulary or via the convenience factory methods:

// GPT-2
auto tok = BpeTokenizer::loadGpt2( "gpt2_tokenizer.bin" );
auto ids = tok->encode( "Hello, world!" );
 
// Llama 3.2
auto tok = BpeTokenizer::loadLlama32( "llama32_tokenizer.bin" );
auto ids = tok->encode( "<|begin_of_text|>Hello, world!" );

The special token pre-pass is enabled whenever the vocabulary registers at least one special token. For GPT-2, this means "<|endoftext|>" is intercepted before BPE runs; for Llama 3.x, the full set of named and extended tokens is intercepted.

Constructor & Destructor Documentation

◆ BpeTokenizer()

Mila::Data::BpeTokenizer::BpeTokenizer ( BpeVocabulary vocab )

inlineexplicit

Here is the call graph for this function:

Here is the caller graph for this function:

Member Function Documentation

◆ decode()

std::string Mila::Data::BpeTokenizer::decode ( std::span< const TokenId > tokens )

inlineoverridevirtual

Decode a sequence of token IDs back to a UTF-8 string.

Each token string is byte-decoded using the GPT-2 style byte mapping. IDs with no vocabulary entry emit a '?' placeholder.

Parameters

tokens Sequence of token IDs.

Returns: Decoded UTF-8 string.

Implements Mila::Data::Tokenizer.

Here is the call graph for this function:

◆ decodeToken()

void Mila::Data::BpeTokenizer::decodeToken	(	const std::string &	token,
		std::string &	out )

inlineprivate

Reverse byte-encode a single token string and append to out.

For byte-level vocabularies each UTF-8 character in the token string maps back to one raw byte via the GPT-2 byte decoder. Characters without a decoder entry emit '?'.

Parameters

token	Token string from the vocabulary.
out	Output UTF-8 string to append to.

Here is the call graph for this function:

Here is the caller graph for this function:

◆ encode()

std::vector< TokenId > Mila::Data::BpeTokenizer::encode ( const std::string & text )

inlineoverridevirtual

Encode text to a sequence of token IDs.

Performs the special token pre-pass first when the vocabulary has registered special tokens. Plain text segments between special tokens are processed through the standard pre-tokenization and BPE merge pipeline.

Parameters

text	Input text (UTF-8).

Returns: Sequence of token IDs.

Implements Mila::Data::Tokenizer.

Here is the call graph for this function:

◆ encodeSegment()

void Mila::Data::BpeTokenizer::encodeSegment	(	const std::string &	text,
		std::vector< TokenId > &	out )

inlineprivate

Encode a plain text segment (guaranteed to contain no special tokens).

/**

Encode a plain text segment (guaranteed to contain no special tokens).

Dispatches to the BPE merge path when explicit merge rules are present, or to the max-munch path for TikToken-style vocabularies (Llama 3.x).

Parameters

text	Plain text segment.
out	Accumulator for output token IDs.

Here is the call graph for this function:

Here is the caller graph for this function:

◆ encodeSegmentBpe()

void Mila::Data::BpeTokenizer::encodeSegmentBpe	(	const std::vector< std::string > &	words,
		std::vector< TokenId > &	out )

inlineprivate

BPE merge encode for GPT-2 style vocabularies.

Byte-encodes each pre-token then applies explicit merge rules greedily (lowest priority index first) until no more merges are possible.

Parameters

words	Pre-tokenized segments from the regex pass.
out	Accumulator for output token IDs.

Here is the call graph for this function:

Here is the caller graph for this function:

◆ encodeSegmentMaxMunch()

void Mila::Data::BpeTokenizer::encodeSegmentMaxMunch	(	const std::vector< std::string > &	words,
		std::vector< TokenId > &	out )

inlineprivate

Max-munch encode for TikToken vocabularies (Llama 3.x).

Byte-encodes each pre-token then scans for the longest vocabulary match at each position. Falls back to ID 0 for unrecognised units.

Parameters

words	Pre-tokenized segments from the regex pass.
out	Accumulator for output token IDs.

Here is the call graph for this function:

Here is the caller graph for this function:

◆ getBosTokenId()

std::optional< TokenId > Mila::Data::BpeTokenizer::getBosTokenId ( ) const

inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ getEosTokenId()

std::optional< TokenId > Mila::Data::BpeTokenizer::getEosTokenId ( ) const

inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ getPadTokenId()

std::optional< TokenId > Mila::Data::BpeTokenizer::getPadTokenId ( ) const

inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ getVocab()

const BpeVocabulary & Mila::Data::BpeTokenizer::getVocab ( ) const

inline

◆ getVocabSize()

size_t Mila::Data::BpeTokenizer::getVocabSize ( ) const

inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ initializePreTokenization()

void Mila::Data::BpeTokenizer::initializePreTokenization ( )

inlineprivate

Build the pre-tokenization regex from the vocabulary config.

Attempts to compile the Unicode pattern first. If std::regex rejects it (MSVC ECMAScript mode does not support {L} / {N}), falls back to the ASCII-only approximation for the detected mode. Llama3Regex and Gpt2Regex each have a dedicated ASCII fallback; an unrecognised pattern that fails compilation is treated as a hard error.

Here is the caller graph for this function:

◆ isValidToken()

bool Mila::Data::BpeTokenizer::isValidToken ( TokenId tokenId ) const

inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ load()

BpeTokenizer Mila::Data::BpeTokenizer::load ( const std::filesystem::path & path )

inlinestatic

Load a tokenizer from a Mila binary vocabulary file.

Parameters

path	Path to a vocabulary file written by BpeVocabulary::save().

Returns: Loaded BpeTokenizer instance.

Exceptions

std::runtime_error on I/O or format errors.

Here is the call graph for this function:

◆ loadGpt2()

std::shared_ptr< BpeTokenizer > Mila::Data::BpeTokenizer::loadGpt2 ( const std::filesystem::path & path )

inlinestatic

Load a GPT-2 tokenizer from the binary produced by convert_gpt2_tokenizer.py.

Parameters

path	Path to the GPT-2 tokenizer binary.

Returns: Shared tokenizer instance.

Exceptions

std::runtime_error on I/O or format errors.

Here is the call graph for this function:

◆ loadLlama32()

std::shared_ptr< BpeTokenizer > Mila::Data::BpeTokenizer::loadLlama32 ( const std::filesystem::path & path )

inlinestatic

Load a Llama 3.2 tokenizer from the binary produced by convert_llama_tokenizer.py.

Parameters

path	Path to the Llama 3.2 tokenizer binary.

Returns: Shared tokenizer instance.

Exceptions

std::runtime_error on I/O or format errors.

Here is the call graph for this function:

◆ loadMistral()

std::shared_ptr< BpeTokenizer > Mila::Data::BpeTokenizer::loadMistral	(	const std::filesystem::path &	vocab_path,
		const std::filesystem::path &	merges_path )

inlinestatic

Load a Mistral tokenizer.

Note: Not yet implemented. Provide a Mila binary produced by save() as a workaround.

Exceptions

std::runtime_error always.

Here is the call graph for this function:

◆ preTokenize()

std::vector< std::string > Mila::Data::BpeTokenizer::preTokenize ( const std::string & text )

inlineprivate

Split text into pre-tokens using the configured regex.

Returns the entire text as a single element when no regex is configured (e.g., vocabularies built with PreTokenizationMode::None).

Parameters

text	Input text segment.

Returns: Vector of pre-token strings.

Here is the caller graph for this function:

◆ tokenToString()

std::string Mila::Data::BpeTokenizer::tokenToString ( TokenId tokenId ) const

inlineoverridevirtual

Implements Mila::Data::Tokenizer.

◆ utf8CharLength()

size_t Mila::Data::BpeTokenizer::utf8CharLength ( unsigned char first_byte )

inlinestaticprivate

Here is the caller graph for this function:

Member Data Documentation

◆ pre_tokenization_regex_

std::optional<std::regex> Mila::Data::BpeTokenizer::pre_tokenization_regex_

private

◆ vocab_

BpeVocabulary Mila::Data::BpeTokenizer::vocab_

private

The documentation for this class was generated from the following file:

/__w/Mila/Mila/Mila/Src/Data/Tokenizers/Bpe/BpeTokenizer.ixx

Public Member Functions

Static Public Member Functions

Private Member Functions

Static Private Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

◆ BpeTokenizer()

Member Function Documentation

◆ decode()

◆ decodeToken()

◆ encode()

◆ encodeSegment()

◆ encodeSegmentBpe()

◆ encodeSegmentMaxMunch()

◆ getBosTokenId()

◆ getEosTokenId()

◆ getPadTokenId()

◆ getVocab()

◆ getVocabSize()

◆ initializePreTokenization()

◆ isValidToken()

◆ load()

◆ loadGpt2()

◆ loadLlama32()

◆ loadMistral()

◆ preTokenize()

◆ tokenToString()

◆ utf8CharLength()

Member Data Documentation

◆ pre_tokenization_regex_

◆ vocab_