Mila 0.13.48
Deep Neural Network Library
Loading...
Searching...
No Matches
Mila::Data Namespace Reference

Classes

class  BpeTokenizer
 Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More...
class  BpeTrainer
 Corpus accumulator and trainer for BPE vocabularies. More...
class  BpeVocabulary
 Unified Byte Pair Encoding (BPE) vocabulary. More...
class  BpeVocabularyConfig
 Configuration for the BPE vocabulary. More...
class  CharTokenizer
 Character-level tokenizer. More...
class  CharTrainer
 Character-level tokenizer trainer. More...
class  CharVocabulary
 Character vocabulary for tokenization. More...
class  CharVocabularyConfig
 Configuration for Character-level tokenizer training. More...
class  DataLoader
 Device-agnostic data loader interface using abstract tensor data types. More...
class  MilaFileHeader
 Common file header for Mila data files. More...
 Type-safe metadata container for component serialization. More...
struct  SpecialTokens
 Configuration for special tokens across all tokenizer types. More...
class  Tokenizer
class  TokenizerTrainer
 Abstract interface for training tokenizer vocabularies from text corpora. More...
class  TokenizerVocabulary
 Generic tokenizer vocabulary interface. More...
class  TokenSequenceLoader
 Token sequence loader for autoregressive language models. More...
struct  TokenSequenceLoaderConfig
 Configuration for StreamingSequenceLoader behavior. More...
class  TrainerFactory
 Factory for creating tokenizer trainers and loading vocabularies. More...

Typedefs

template<TensorDataType TInputDataType = TensorDataType::FP32, TensorDataType TTargetDataType = TInputDataType>
using Mila::Data::CpuDataLoader = DataLoader<TInputDataType, TTargetDataType, CpuMemoryResource>
 CPU data loader with single precision floating point.
using json = nlohmann::json
using TokenId
using TokenId
using TokenId
using TokenId
using TokenId
using Mila::Data::TokenId = int32_t

Enumerations

enum class  Mila::Data::MilaFileType : uint32_t {
  Unknown = 0 , BpeVocabulary = 1 , CharVocabulary = 2 , TokenizedCorpus = 3 ,
  Dataset = 4 , TrainingCheckpoint = 5 , Gpt4BpeVocabulary = 6
}
 File type identifiers for Mila data files. More...
enum class  Mila::Data::PreTokenizationMode { None , Whitespace , Gpt2Regex , Llama3Regex }
 Pre-tokenization strategies for GPT-4 style BPE tokenizers. More...
enum class  Mila::Data::SpecialToken : uint8_t {
  PAD = 0 , UNK , BOS , EOS ,
  MASK , SEP , CLS
}
 Special token types for tokenization. More...
enum class  TokenizerType
 Tokenizer type discriminator used across tokenizer and vocabulary types. More...
enum class  Mila::Data::TokenizerType : uint8_t {
  Unknown = 0 , Char , Bpe , SentencePiece ,
  Word , Unigram
}
 Tokenizer type discriminator used across tokenizer and vocabulary types. More...

Functions

TokenizerType Mila::Data::stringToTokenizerType (std::string_view s) noexcept
 Parse a string into TokenizerType.
std::string_view Mila::Data::tokenizerTypeToString (TokenizerType t) noexcept
 Convert TokenizerType to a stable string representation.
constexpr const char * Mila::Data::toString (MilaFileType type)

Variables

constexpr const char * Mila::Data::GPT2_PRETOKENIZATION_PATTERN
constexpr const char * Mila::Data::GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK
constexpr const char * Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN
constexpr const char * Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

Typedef Documentation

◆ CpuDataLoader

template<TensorDataType TInputDataType = TensorDataType::FP32, TensorDataType TTargetDataType = TInputDataType>
using Mila::Data::CpuDataLoader = DataLoader<TInputDataType, TTargetDataType, CpuMemoryResource>
export

CPU data loader with single precision floating point.

Convenient alias for data loaders using standard CPU memory with FP32 data types for both inputs and targets. Suitable for CPU-only training and development workflows.

Template Parameters
TInputDataTypeInput tensor data type (defaults to FP32)
TTargetDataTypeTarget tensor data type (defaults to input type)

◆ json

using Mila::Data::json = nlohmann::json

◆ TokenId [1/6]

◆ TokenId [2/6]

◆ TokenId [3/6]

◆ TokenId [4/6]

◆ TokenId [5/6]

◆ TokenId [6/6]

using Mila::Data::TokenId = int32_t
export

Enumeration Type Documentation

◆ MilaFileType

enum class Mila::Data::MilaFileType : uint32_t
exportstrong

File type identifiers for Mila data files.

Enumerator
Unknown 
BpeVocabulary 
CharVocabulary 
TokenizedCorpus 
Dataset 
TrainingCheckpoint 
Gpt4BpeVocabulary 

◆ PreTokenizationMode

enum class Mila::Data::PreTokenizationMode
exportstrong

Pre-tokenization strategies for GPT-4 style BPE tokenizers.

Enumerator
None 
Whitespace 
Gpt2Regex 
Llama3Regex 

◆ SpecialToken

enum class Mila::Data::SpecialToken : uint8_t
exportstrong

Special token types for tokenization.

Enumerator
PAD 
UNK 
BOS 
EOS 
MASK 
SEP 
CLS 

◆ TokenizerType [1/2]

enum class Mila::Data::TokenizerType : uint8_t
strong

Tokenizer type discriminator used across tokenizer and vocabulary types.

◆ TokenizerType [2/2]

enum class Mila::Data::TokenizerType : uint8_t
exportstrong

Tokenizer type discriminator used across tokenizer and vocabulary types.

Enumerator
Unknown 
Char 
Bpe 
SentencePiece 
Word 
Unigram 

Function Documentation

◆ stringToTokenizerType()

TokenizerType Mila::Data::stringToTokenizerType ( std::string_view s)
inlineexportnoexcept

Parse a string into TokenizerType.

Comparison is case-insensitive for the ASCII range.

◆ tokenizerTypeToString()

std::string_view Mila::Data::tokenizerTypeToString ( TokenizerType t)
inlineexportnoexcept

Convert TokenizerType to a stable string representation.

◆ toString()

const char * Mila::Data::toString ( MilaFileType type)
constexprexport
Here is the caller graph for this function:

Variable Documentation

◆ GPT2_PRETOKENIZATION_PATTERN

const char* Mila::Data::GPT2_PRETOKENIZATION_PATTERN
constexprexport
Initial value:
=
R"('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+)"

◆ GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

const char* Mila::Data::GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK
constexprexport
Initial value:
=
R"('s|'t|'re|'ve|'m|'ll|'d| ?[A-Za-z]+| ?[0-9]+| ?[^\sA-Za-z0-9]+|\s+(?!\S)|\s+)"

◆ LLAMA3_PRETOKENIZATION_PATTERN

const char* Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN
constexprexport
Initial value:
=
R"((?i:'[sdmt]|'ll|'ve|'re)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+)"

◆ LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK

const char* Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK
constexprexport
Initial value:
=
R"((?:'[sdmt]|'ll|'ve|'re)|[^\r\nA-Za-z0-9]?[A-Za-z]+|[0-9]{1,3}| ?[^\sA-Za-z0-9]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+)"