Unified BPE tokenizer for GPT-2, Llama 3.x, and Mistral model families. More...

#include <string>
#include <string_view>
#include <vector>
#include <span>
#include <memory>
#include <optional>
#include <filesystem>
#include <chrono>
#include <iostream>
#include <regex>
#include <limits>
#include <stdexcept>
import Data.TokenizerVocabulary;
import Data.Tokenizer;
import Data.BpePreTokenizationMode;
import Data.BpeVocabulary;

Classes
class	Mila::Data::BpeTokenizer
	Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...

Namespaces
namespace	Mila
	Mila main API namespace.
namespace	Mila::Data

Typedefs
using	Mila::Data::TokenId

Detailed Description

Unified BPE tokenizer for GPT-2, Llama 3.x, and Mistral model families.

Encode pipeline:

Special token pre-pass: split input on registered special token strings (longest-first scan) and emit their IDs directly, bypassing BPE entirely. GPT-2 vocabularies with no registered special tokens skip this via fast path.
Pre-tokenize each plain text segment with the configured regex pattern.
Byte-encode each pre-token using the GPT-2 style byte encoder. 4a. BPE path (GPT-2 / trained vocabularies): apply explicit merge rules greedily, lowest priority index first. 4b. Max-munch path (Llama 3.x / TikToken): find the longest vocabulary match at each position in the encoded unit sequence. Used when no merge rules are present; the merge order is implicit in the token ID assignment.
Map final tokens to IDs; fall back to 0 on a miss.

Decode pipeline: Concatenate token strings and reverse the byte encoding back to UTF-8.

Classes

Namespaces

Typedefs

Detailed Description