Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data::SpecialTokens Struct Referenceexport

Configuration for special tokens across all tokenizer types. More...

Public Member Functions

constexpr size_t count () const
 Count enabled named special tokens.
size_t countAll () const
 Count all special tokens including extended set.
std::vector< SpecialTokengetEnabledTokens () const
 Get all enabled named tokens in priority order.
constexpr size_t getIdOffset () const
 Get the ID offset for regular tokens.
std::string_view getString (SpecialToken token) const
 Get string representation of a named special token.
constexpr bool isEnabled (SpecialToken token) const
 Check if a named token type is enabled.
bool isSpecialToken (std::string_view str) const
 Check if a string matches any enabled special token (named or extended).

Static Public Member Functions

static SpecialTokens forClassification ()
 Configuration for sequence classification (BERT-style).
static SpecialTokens forMLM ()
 Configuration for masked language modeling.
static SpecialTokens gptStyle ()
 GPT-2 style configuration.
static SpecialTokens llamaStyle ()
 Llama 3.x configuration.
static SpecialTokens minimal ()
 Minimal configuration (PAD, UNK only).
static SpecialTokens none ()
 Configuration with no special tokens.
static SpecialTokens standard ()
 Standard configuration (PAD, UNK, BOS, EOS).

Public Attributes

std::string bos_token = "<BOS>"
std::string cls_token = "<CLS>"
std::string eos_token = "<EOS>"
std::unordered_map< std::string, int32_t > extended_special_tokens
 Extended special tokens beyond the seven named slots.
std::string mask_token = "<MASK>"
std::string pad_token = "<PAD>"
std::string sep_token = "<SEP>"
std::string unk_token = "<UNK>"
bool use_bos = true
bool use_cls = false
bool use_eos = true
bool use_mask = false
bool use_pad = true
bool use_sep = false
bool use_unk = true

Detailed Description

Configuration for special tokens across all tokenizer types.

Used by CharTokenizer, BpeTokenizer, Gpt4BpeTokenizer, WordPieceTokenizer, etc. Token strings are customizable to support different model conventions.

The seven named slots (PAD, UNK, BOS, EOS, MASK, SEP, CLS) cover the common case for all known model families. For models with additional special tokens beyond these seven (e.g. Llama 3.2's 256 reserved tokens), use the extended_special_tokens map.

Member Function Documentation

◆ count()

size_t Mila::Data::SpecialTokens::count ( ) const
inlineconstexpr

Count enabled named special tokens.

Does not include extended_special_tokens.

Here is the caller graph for this function:

◆ countAll()

size_t Mila::Data::SpecialTokens::countAll ( ) const
inline

Count all special tokens including extended set.

Here is the call graph for this function:

◆ forClassification()

SpecialTokens Mila::Data::SpecialTokens::forClassification ( )
inlinestatic

Configuration for sequence classification (BERT-style).

◆ forMLM()

SpecialTokens Mila::Data::SpecialTokens::forMLM ( )
inlinestatic

Configuration for masked language modeling.

◆ getEnabledTokens()

std::vector< SpecialToken > Mila::Data::SpecialTokens::getEnabledTokens ( ) const
inline

Get all enabled named tokens in priority order.

Here is the call graph for this function:

◆ getIdOffset()

size_t Mila::Data::SpecialTokens::getIdOffset ( ) const
inlineconstexpr

Get the ID offset for regular tokens.

Named special tokens occupy IDs 0 to (count()-1), so regular tokens start at this offset. Extended tokens have explicit IDs and do not contribute to this offset.

Here is the call graph for this function:

◆ getString()

std::string_view Mila::Data::SpecialTokens::getString ( SpecialToken token) const
inline

Get string representation of a named special token.

◆ gptStyle()

SpecialTokens Mila::Data::SpecialTokens::gptStyle ( )
inlinestatic

GPT-2 style configuration.

Uses <|endoftext|> for PAD, UNK, BOS, and EOS — GPT-2 uses one token string for all roles.

Here is the caller graph for this function:

◆ isEnabled()

bool Mila::Data::SpecialTokens::isEnabled ( SpecialToken token) const
inlineconstexpr

Check if a named token type is enabled.

Here is the caller graph for this function:

◆ isSpecialToken()

bool Mila::Data::SpecialTokens::isSpecialToken ( std::string_view str) const
inline

Check if a string matches any enabled special token (named or extended).

◆ llamaStyle()

SpecialTokens Mila::Data::SpecialTokens::llamaStyle ( )
inlinestatic

Llama 3.x configuration.

Registers BOS, EOS, and all five instruct/tool-calling control tokens. These token IDs are fixed across the Llama 3.x family and exist in every Llama 3.x vocabulary regardless of whether the model is a base or instruct variant. Registering them ensures the encoder pre-pass matches them as single atomic tokens rather than subword fragments.

Token ID Role
<|begin_of_text|> 128000 BOS
<|end_of_text|> 128001 EOS
<|start_header_id|> 128006 Opens a role header
<|end_header_id|> 128007 Closes a role header
<|eom_id|> 128008 Tool call boundary / stop
<|eot_id|> 128009 End of turn — primary stop
<|python_tag|> 128010 Tool call open marker
Here is the caller graph for this function:

◆ minimal()

SpecialTokens Mila::Data::SpecialTokens::minimal ( )
inlinestatic

Minimal configuration (PAD, UNK only).

◆ none()

SpecialTokens Mila::Data::SpecialTokens::none ( )
inlinestatic

Configuration with no special tokens.

◆ standard()

SpecialTokens Mila::Data::SpecialTokens::standard ( )
inlinestatic

Standard configuration (PAD, UNK, BOS, EOS).

Member Data Documentation

◆ bos_token

std::string Mila::Data::SpecialTokens::bos_token = "<BOS>"

◆ cls_token

std::string Mila::Data::SpecialTokens::cls_token = "<CLS>"

◆ eos_token

std::string Mila::Data::SpecialTokens::eos_token = "<EOS>"

◆ extended_special_tokens

std::unordered_map<std::string, int32_t> Mila::Data::SpecialTokens::extended_special_tokens

Extended special tokens beyond the seven named slots.

Used for model families with large special token sets, such as Llama 3.2's reserved tokens (IDs 128002-128255). These are matched during the encode pre-pass before BPE merges are applied.

Key: token string (e.g. "<|reserved_special_token_0|>") Value: token ID

◆ mask_token

std::string Mila::Data::SpecialTokens::mask_token = "<MASK>"

◆ pad_token

std::string Mila::Data::SpecialTokens::pad_token = "<PAD>"

◆ sep_token

std::string Mila::Data::SpecialTokens::sep_token = "<SEP>"

◆ unk_token

std::string Mila::Data::SpecialTokens::unk_token = "<UNK>"

◆ use_bos

bool Mila::Data::SpecialTokens::use_bos = true

◆ use_cls

bool Mila::Data::SpecialTokens::use_cls = false

◆ use_eos

bool Mila::Data::SpecialTokens::use_eos = true

◆ use_mask

bool Mila::Data::SpecialTokens::use_mask = false

◆ use_pad

bool Mila::Data::SpecialTokens::use_pad = true

◆ use_sep

bool Mila::Data::SpecialTokens::use_sep = false

◆ use_unk

bool Mila::Data::SpecialTokens::use_unk = true

The documentation for this struct was generated from the following file: