Configuration for special tokens across all tokenizer types. More...

Public Member Functions
constexpr size_t	count () const
	Count enabled named special tokens.
size_t	countAll () const
	Count all special tokens including extended set.
std::vector< SpecialToken >	getEnabledTokens () const
	Get all enabled named tokens in priority order.
constexpr size_t	getIdOffset () const
	Get the ID offset for regular tokens.
std::string_view	getString (SpecialToken token) const
	Get string representation of a named special token.
constexpr bool	isEnabled (SpecialToken token) const
	Check if a named token type is enabled.
bool	isSpecialToken (std::string_view str) const
	Check if a string matches any enabled special token (named or extended).

Static Public Member Functions
static SpecialTokens	forClassification ()
	Configuration for sequence classification (BERT-style).
static SpecialTokens	forMLM ()
	Configuration for masked language modeling.
static SpecialTokens	gptStyle ()
	GPT-2 style configuration.
static SpecialTokens	llamaStyle ()
	Llama 3.x configuration.
static SpecialTokens	minimal ()
	Minimal configuration (PAD, UNK only).
static SpecialTokens	none ()
	Configuration with no special tokens.
static SpecialTokens	standard ()
	Standard configuration (PAD, UNK, BOS, EOS).

Public Attributes
std::string	bos_token = "<BOS>"
std::string	cls_token = "<CLS>"
std::string	eos_token = "<EOS>"
std::unordered_map< std::string, int32_t >	extended_special_tokens
	Extended special tokens beyond the seven named slots.
std::string	mask_token = "<MASK>"
std::string	pad_token = "<PAD>"
std::string	sep_token = "<SEP>"
std::string	unk_token = "<UNK>"
bool	use_bos = true
bool	use_cls = false
bool	use_eos = true
bool	use_mask = false
bool	use_pad = true
bool	use_sep = false
bool	use_unk = true

Detailed Description

Configuration for special tokens across all tokenizer types.

Used by CharTokenizer, BpeTokenizer, Gpt4BpeTokenizer, WordPieceTokenizer, etc. Token strings are customizable to support different model conventions.

The seven named slots (PAD, UNK, BOS, EOS, MASK, SEP, CLS) cover the common case for all known model families. For models with additional special tokens beyond these seven (e.g. Llama 3.2's 256 reserved tokens), use the extended_special_tokens map.

Member Function Documentation

◆ count()

size_t Mila::Data::SpecialTokens::count ( ) const

inlineconstexpr

Count enabled named special tokens.

Does not include extended_special_tokens.

Here is the caller graph for this function:

◆ countAll()

size_t Mila::Data::SpecialTokens::countAll ( ) const

inline

Count all special tokens including extended set.

Here is the call graph for this function:

◆ forClassification()

SpecialTokens Mila::Data::SpecialTokens::forClassification ( )

inlinestatic

Configuration for sequence classification (BERT-style).

◆ forMLM()

SpecialTokens Mila::Data::SpecialTokens::forMLM ( )

inlinestatic

Configuration for masked language modeling.

◆ getEnabledTokens()

std::vector< SpecialToken > Mila::Data::SpecialTokens::getEnabledTokens ( ) const

inline

Get all enabled named tokens in priority order.

Here is the call graph for this function:

◆ getIdOffset()

size_t Mila::Data::SpecialTokens::getIdOffset ( ) const

inlineconstexpr

Get the ID offset for regular tokens.

Named special tokens occupy IDs 0 to (count()-1), so regular tokens start at this offset. Extended tokens have explicit IDs and do not contribute to this offset.

Here is the call graph for this function:

◆ getString()

std::string_view Mila::Data::SpecialTokens::getString ( SpecialToken token ) const

inline

Get string representation of a named special token.

◆ gptStyle()

SpecialTokens Mila::Data::SpecialTokens::gptStyle ( )

inlinestatic

GPT-2 style configuration.

Uses <|endoftext|> for PAD, UNK, BOS, and EOS — GPT-2 uses one token string for all roles.

Here is the caller graph for this function:

◆ isEnabled()

bool Mila::Data::SpecialTokens::isEnabled ( SpecialToken token ) const

inlineconstexpr

Check if a named token type is enabled.

Here is the caller graph for this function:

◆ isSpecialToken()

bool Mila::Data::SpecialTokens::isSpecialToken ( std::string_view str ) const

inline

Check if a string matches any enabled special token (named or extended).

◆ llamaStyle()

SpecialTokens Mila::Data::SpecialTokens::llamaStyle ( )

inlinestatic

Llama 3.x configuration.

Registers BOS, EOS, and all five instruct/tool-calling control tokens. These token IDs are fixed across the Llama 3.x family and exist in every Llama 3.x vocabulary regardless of whether the model is a base or instruct variant. Registering them ensures the encoder pre-pass matches them as single atomic tokens rather than subword fragments.

Token	ID	Role
<\|begin_of_text\|>	128000	BOS
<\|end_of_text\|>	128001	EOS
<\|start_header_id\|>	128006	Opens a role header
<\|end_header_id\|>	128007	Closes a role header
<\|eom_id\|>	128008	Tool call boundary / stop
<\|eot_id\|>	128009	End of turn — primary stop
<\|python_tag\|>	128010	Tool call open marker

Here is the caller graph for this function:

◆ minimal()

SpecialTokens Mila::Data::SpecialTokens::minimal ( )

inlinestatic

Minimal configuration (PAD, UNK only).

◆ none()

SpecialTokens Mila::Data::SpecialTokens::none ( )

inlinestatic

Configuration with no special tokens.

◆ standard()

SpecialTokens Mila::Data::SpecialTokens::standard ( )

inlinestatic

Standard configuration (PAD, UNK, BOS, EOS).

Member Data Documentation

◆ bos_token

std::string Mila::Data::SpecialTokens::bos_token = "<BOS>"

◆ cls_token

std::string Mila::Data::SpecialTokens::cls_token = "<CLS>"

◆ eos_token

std::string Mila::Data::SpecialTokens::eos_token = "<EOS>"

◆ extended_special_tokens

std::unordered_map<std::string, int32_t> Mila::Data::SpecialTokens::extended_special_tokens

Extended special tokens beyond the seven named slots.

Used for model families with large special token sets, such as Llama 3.2's reserved tokens (IDs 128002-128255). These are matched during the encode pre-pass before BPE merges are applied.

Key: token string (e.g. "<|reserved_special_token_0|>") Value: token ID

◆ mask_token

std::string Mila::Data::SpecialTokens::mask_token = "<MASK>"

◆ pad_token

std::string Mila::Data::SpecialTokens::pad_token = "<PAD>"

◆ sep_token

std::string Mila::Data::SpecialTokens::sep_token = "<SEP>"

◆ unk_token

std::string Mila::Data::SpecialTokens::unk_token = "<UNK>"

◆ use_bos

bool Mila::Data::SpecialTokens::use_bos = true

◆ use_cls

bool Mila::Data::SpecialTokens::use_cls = false

◆ use_eos

bool Mila::Data::SpecialTokens::use_eos = true

◆ use_mask

bool Mila::Data::SpecialTokens::use_mask = false

◆ use_pad

bool Mila::Data::SpecialTokens::use_pad = true

◆ use_sep

bool Mila::Data::SpecialTokens::use_sep = false

◆ use_unk

bool Mila::Data::SpecialTokens::use_unk = true

The documentation for this struct was generated from the following file:

/__w/Mila/Mila/Mila/Src/Data/Tokenizers/SpecialTokens.ixx

Public Member Functions

Static Public Member Functions

Public Attributes

Detailed Description

Member Function Documentation

◆ count()

◆ countAll()

◆ forClassification()

◆ forMLM()

◆ getEnabledTokens()

◆ getIdOffset()

◆ getString()

◆ gptStyle()

◆ isEnabled()

◆ isSpecialToken()

◆ llamaStyle()

◆ minimal()

◆ none()

◆ standard()

Member Data Documentation

◆ bos_token

◆ cls_token

◆ eos_token

◆ extended_special_tokens

◆ mask_token

◆ pad_token

◆ sep_token

◆ unk_token

◆ use_bos

◆ use_cls

◆ use_eos

◆ use_mask

◆ use_pad

◆ use_sep

◆ use_unk