Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
BpeTokenizer.ixx File Reference

Unified BPE tokenizer for GPT-2, Llama 3.x, and Mistral model families. More...

#include <string>
#include <string_view>
#include <vector>
#include <span>
#include <memory>
#include <optional>
#include <filesystem>
#include <chrono>
#include <iostream>
#include <regex>
#include <limits>
#include <stdexcept>
import Data.TokenizerVocabulary;
import Data.Tokenizer;
import Data.BpePreTokenizationMode;
import Data.BpeVocabulary;

Classes

class  Mila::Data::BpeTokenizer
 Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...

Namespaces

namespace  Mila
 Mila main API namespace.
namespace  Mila::Data

Typedefs

using Mila::Data::TokenId

Detailed Description

Unified BPE tokenizer for GPT-2, Llama 3.x, and Mistral model families.

Encode pipeline:

  1. Special token pre-pass: split input on registered special token strings (longest-first scan) and emit their IDs directly, bypassing BPE entirely. GPT-2 vocabularies with no registered special tokens skip this via fast path.
  2. Pre-tokenize each plain text segment with the configured regex pattern.
  3. Byte-encode each pre-token using the GPT-2 style byte encoder. 4a. BPE path (GPT-2 / trained vocabularies): apply explicit merge rules greedily, lowest priority index first. 4b. Max-munch path (Llama 3.x / TikToken): find the longest vocabulary match at each position in the encoded unit sequence. Used when no merge rules are present; the merge order is implicit in the token ID assignment.
  4. Map final tokens to IDs; fall back to 0 on a miss.

Decode pipeline: Concatenate token strings and reverse the byte encoding back to UTF-8.