Classes
class	BpeTokenizer
	Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...
class	BpeTrainer
	Corpus accumulator and trainer for BPE vocabularies. More...
class	BpeVocabulary
	Unified Byte Pair Encoding (BPE) vocabulary. More...
class	BpeVocabularyConfig
	Configuration for the BPE vocabulary. More...
class	CharTokenizer
	Character-level tokenizer. More...
class	CharTrainer
	Character-level tokenizer trainer. More...
class	CharVocabulary
	Character vocabulary for tokenization. More...
class	CharVocabularyConfig
	Configuration for Character-level tokenizer training. More...
class	DataLoader
	Device-agnostic data loader interface using abstract tensor data types. More...
class	MilaFileHeader
	Common file header for Mila data files. More...
class	SerializationMetadata
	Type-safe metadata container for component serialization. More...
struct	SpecialTokens
	Configuration for special tokens across all tokenizer types. More...
class	Tokenizer
class	TokenizerTrainer
	Abstract interface for training tokenizer vocabularies from text corpora. More...
class	TokenizerVocabulary
	Generic tokenizer vocabulary interface. More...
class	TokenSequenceLoader
	Token sequence loader for autoregressive language models. More...
struct	TokenSequenceLoaderConfig
	Configuration for StreamingSequenceLoader behavior. More...
class	TrainerFactory
	Factory for creating tokenizer trainers and loading vocabularies. More...

Typedefs
template<TensorDataType TInputDataType = TensorDataType::FP32, TensorDataType TTargetDataType = TInputDataType>
using	Mila::Data::CpuDataLoader = DataLoader<TInputDataType, TTargetDataType, CpuMemoryResource>
	CPU data loader with single precision floating point.
using	json = nlohmann::json
using	TokenId
using	TokenId
using	TokenId
using	TokenId
using	TokenId
using	Mila::Data::TokenId = int32_t

Enumerations
enum class	Mila::Data::MilaFileType : uint32_t { Unknown = 0 , BpeVocabulary = 1 , CharVocabulary = 2 , TokenizedCorpus = 3 , Dataset = 4 , TrainingCheckpoint = 5 , Gpt4BpeVocabulary = 6 }
	File type identifiers for Mila data files. More...
enum class	Mila::Data::PreTokenizationMode { None , Whitespace , Gpt2Regex , Llama3Regex }
	Pre-tokenization strategies for GPT-4 style BPE tokenizers. More...
enum class	Mila::Data::SpecialToken : uint8_t { PAD = 0 , UNK , BOS , EOS , MASK , SEP , CLS }
	Special token types for tokenization. More...
enum class	TokenizerType
	Tokenizer type discriminator used across tokenizer and vocabulary types. More...
enum class	Mila::Data::TokenizerType : uint8_t { Unknown = 0 , Char , Bpe , SentencePiece , Word , Unigram }
	Tokenizer type discriminator used across tokenizer and vocabulary types. More...

Functions
TokenizerType	Mila::Data::stringToTokenizerType (std::string_view s) noexcept
	Parse a string into TokenizerType.
std::string_view	Mila::Data::tokenizerTypeToString (TokenizerType t) noexcept
	Convert TokenizerType to a stable string representation.
constexpr const char *	Mila::Data::toString (MilaFileType type)

Variables
constexpr const char *	Mila::Data::GPT2_PRETOKENIZATION_PATTERN
constexpr const char *	Mila::Data::GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK
constexpr const char *	Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN
constexpr const char *	Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

Typedef Documentation

◆ CpuDataLoader

template<TensorDataType TInputDataType = TensorDataType::FP32, TensorDataType TTargetDataType = TInputDataType>

using Mila::Data::CpuDataLoader = DataLoader<TInputDataType, TTargetDataType, CpuMemoryResource>

export

CPU data loader with single precision floating point.

Convenient alias for data loaders using standard CPU memory with FP32 data types for both inputs and targets. Suitable for CPU-only training and development workflows.

Template Parameters

TInputDataType	Input tensor data type (defaults to FP32)
TTargetDataType	Target tensor data type (defaults to input type)

◆ json

using Mila::Data::json = nlohmann::json

◆ TokenId [1/6]

using Mila::Data::TokenId

◆ TokenId [2/6]

using Mila::Data::TokenId

◆ TokenId [3/6]

using Mila::Data::TokenId

◆ TokenId [4/6]

using Mila::Data::TokenId

◆ TokenId [5/6]

using Mila::Data::TokenId

◆ TokenId [6/6]

using Mila::Data::TokenId = int32_t

export

Enumeration Type Documentation

◆ MilaFileType

enum class Mila::Data::MilaFileType : uint32_t

exportstrong

File type identifiers for Mila data files.

Enumerator
Unknown
BpeVocabulary
CharVocabulary
TokenizedCorpus
Dataset
TrainingCheckpoint
Gpt4BpeVocabulary

◆ PreTokenizationMode

enum class Mila::Data::PreTokenizationMode

exportstrong

Pre-tokenization strategies for GPT-4 style BPE tokenizers.

Enumerator
None
Whitespace
Gpt2Regex
Llama3Regex

◆ SpecialToken

enum class Mila::Data::SpecialToken : uint8_t

exportstrong

Special token types for tokenization.

Enumerator
PAD
UNK
BOS
EOS
MASK
SEP
CLS

◆ TokenizerType [1/2]

enum class Mila::Data::TokenizerType : uint8_t

strong

Tokenizer type discriminator used across tokenizer and vocabulary types.

◆ TokenizerType [2/2]

enum class Mila::Data::TokenizerType : uint8_t

exportstrong

Tokenizer type discriminator used across tokenizer and vocabulary types.

Enumerator
Unknown
Char
Bpe
SentencePiece
Word
Unigram

Function Documentation

◆ stringToTokenizerType()

TokenizerType Mila::Data::stringToTokenizerType ( std::string_view s )

inlineexportnoexcept

Parse a string into TokenizerType.

Comparison is case-insensitive for the ASCII range.

◆ tokenizerTypeToString()

std::string_view Mila::Data::tokenizerTypeToString ( TokenizerType t )

inlineexportnoexcept

Convert TokenizerType to a stable string representation.

◆ toString()

const char * Mila::Data::toString ( MilaFileType type )

constexprexport

Here is the caller graph for this function:

Variable Documentation

◆ GPT2_PRETOKENIZATION_PATTERN

const char* Mila::Data::GPT2_PRETOKENIZATION_PATTERN

constexprexport

Initial value:

=

R"('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+)"

◆ GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

const char* Mila::Data::GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

constexprexport

Initial value:

=

R"('s|'t|'re|'ve|'m|'ll|'d| ?[A-Za-z]+| ?[0-9]+| ?[^\sA-Za-z0-9]+|\s+(?!\S)|\s+)"

◆ LLAMA3_PRETOKENIZATION_PATTERN

const char* Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN

constexprexport

Initial value:

=

R"((?i:'[sdmt]|'ll|'ve|'re)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+)"

◆ LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

const char* Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

constexprexport

Initial value:

=

R"((?:'[sdmt]|'ll|'ve|'re)|[^\r\nA-Za-z0-9]?[A-Za-z]+|[0-9]{1,3}| ?[^\sA-Za-z0-9]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+)"

Classes

Typedefs

Enumerations

Functions

Variables

Typedef Documentation

◆ CpuDataLoader

◆ json

◆ TokenId [1/6]

◆ TokenId [2/6]

◆ TokenId [3/6]

◆ TokenId [4/6]

◆ TokenId [5/6]

◆ TokenId [6/6]

Enumeration Type Documentation

◆ MilaFileType

◆ PreTokenizationMode

◆ SpecialToken

◆ TokenizerType [1/2]

◆ TokenizerType [2/2]

Function Documentation

◆ stringToTokenizerType()

◆ tokenizerTypeToString()

◆ toString()

Variable Documentation

◆ GPT2_PRETOKENIZATION_PATTERN

◆ GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

◆ LLAMA3_PRETOKENIZATION_PATTERN

◆ LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK