|
Mila 0.13.48
Deep Neural Network Library
|
Configuration for special tokens across all tokenizer types. More...
Public Member Functions | |
| constexpr size_t | count () const |
| Count enabled named special tokens. | |
| size_t | countAll () const |
| Count all special tokens including extended set. | |
| std::vector< SpecialToken > | getEnabledTokens () const |
| Get all enabled named tokens in priority order. | |
| constexpr size_t | getIdOffset () const |
| Get the ID offset for regular tokens. | |
| std::string_view | getString (SpecialToken token) const |
| Get string representation of a named special token. | |
| constexpr bool | isEnabled (SpecialToken token) const |
| Check if a named token type is enabled. | |
| bool | isSpecialToken (std::string_view str) const |
| Check if a string matches any enabled special token (named or extended). | |
Static Public Member Functions | |
| static SpecialTokens | forClassification () |
| Configuration for sequence classification (BERT-style). | |
| static SpecialTokens | forMLM () |
| Configuration for masked language modeling. | |
| static SpecialTokens | gptStyle () |
| GPT-2 style configuration. | |
| static SpecialTokens | llamaStyle () |
| Llama 3.x configuration. | |
| static SpecialTokens | minimal () |
| Minimal configuration (PAD, UNK only). | |
| static SpecialTokens | none () |
| Configuration with no special tokens. | |
| static SpecialTokens | standard () |
| Standard configuration (PAD, UNK, BOS, EOS). | |
Public Attributes | |
| std::string | bos_token = "<BOS>" |
| std::string | cls_token = "<CLS>" |
| std::string | eos_token = "<EOS>" |
| std::unordered_map< std::string, int32_t > | extended_special_tokens |
| Extended special tokens beyond the seven named slots. | |
| std::string | mask_token = "<MASK>" |
| std::string | pad_token = "<PAD>" |
| std::string | sep_token = "<SEP>" |
| std::string | unk_token = "<UNK>" |
| bool | use_bos = true |
| bool | use_cls = false |
| bool | use_eos = true |
| bool | use_mask = false |
| bool | use_pad = true |
| bool | use_sep = false |
| bool | use_unk = true |
Configuration for special tokens across all tokenizer types.
Used by CharTokenizer, BpeTokenizer, Gpt4BpeTokenizer, WordPieceTokenizer, etc. Token strings are customizable to support different model conventions.
The seven named slots (PAD, UNK, BOS, EOS, MASK, SEP, CLS) cover the common case for all known model families. For models with additional special tokens beyond these seven (e.g. Llama 3.2's 256 reserved tokens), use the extended_special_tokens map.
|
inlineconstexpr |
Count enabled named special tokens.
Does not include extended_special_tokens.

|
inline |
Count all special tokens including extended set.

|
inlinestatic |
Configuration for sequence classification (BERT-style).
|
inlinestatic |
Configuration for masked language modeling.
|
inline |
Get all enabled named tokens in priority order.

|
inlineconstexpr |
Get the ID offset for regular tokens.
Named special tokens occupy IDs 0 to (count()-1), so regular tokens start at this offset. Extended tokens have explicit IDs and do not contribute to this offset.

|
inline |
Get string representation of a named special token.
|
inlinestatic |
GPT-2 style configuration.
Uses <|endoftext|> for PAD, UNK, BOS, and EOS — GPT-2 uses one token string for all roles.

|
inlineconstexpr |
Check if a named token type is enabled.

|
inline |
Check if a string matches any enabled special token (named or extended).
|
inlinestatic |
Llama 3.x configuration.
Registers BOS, EOS, and all five instruct/tool-calling control tokens. These token IDs are fixed across the Llama 3.x family and exist in every Llama 3.x vocabulary regardless of whether the model is a base or instruct variant. Registering them ensures the encoder pre-pass matches them as single atomic tokens rather than subword fragments.
| Token | ID | Role |
|---|---|---|
| <|begin_of_text|> | 128000 | BOS |
| <|end_of_text|> | 128001 | EOS |
| <|start_header_id|> | 128006 | Opens a role header |
| <|end_header_id|> | 128007 | Closes a role header |
| <|eom_id|> | 128008 | Tool call boundary / stop |
| <|eot_id|> | 128009 | End of turn — primary stop |
| <|python_tag|> | 128010 | Tool call open marker |

|
inlinestatic |
Minimal configuration (PAD, UNK only).
|
inlinestatic |
Configuration with no special tokens.
|
inlinestatic |
Standard configuration (PAD, UNK, BOS, EOS).
| std::string Mila::Data::SpecialTokens::bos_token = "<BOS>" |
| std::string Mila::Data::SpecialTokens::cls_token = "<CLS>" |
| std::string Mila::Data::SpecialTokens::eos_token = "<EOS>" |
| std::unordered_map<std::string, int32_t> Mila::Data::SpecialTokens::extended_special_tokens |
Extended special tokens beyond the seven named slots.
Used for model families with large special token sets, such as Llama 3.2's reserved tokens (IDs 128002-128255). These are matched during the encode pre-pass before BPE merges are applied.
Key: token string (e.g. "<|reserved_special_token_0|>") Value: token ID
| std::string Mila::Data::SpecialTokens::mask_token = "<MASK>" |
| std::string Mila::Data::SpecialTokens::pad_token = "<PAD>" |
| std::string Mila::Data::SpecialTokens::sep_token = "<SEP>" |
| std::string Mila::Data::SpecialTokens::unk_token = "<UNK>" |
| bool Mila::Data::SpecialTokens::use_bos = true |
| bool Mila::Data::SpecialTokens::use_cls = false |
| bool Mila::Data::SpecialTokens::use_eos = true |
| bool Mila::Data::SpecialTokens::use_mask = false |
| bool Mila::Data::SpecialTokens::use_pad = true |
| bool Mila::Data::SpecialTokens::use_sep = false |
| bool Mila::Data::SpecialTokens::use_unk = true |