|
Mila 0.13.48
Deep Neural Network Library
|
Classes | |
| class | BpeTokenizer |
| Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families. More... | |
| class | BpeTrainer |
| Corpus accumulator and trainer for BPE vocabularies. More... | |
| class | BpeVocabulary |
| Unified Byte Pair Encoding (BPE) vocabulary. More... | |
| class | BpeVocabularyConfig |
| Configuration for the BPE vocabulary. More... | |
| class | CharTokenizer |
| Character-level tokenizer. More... | |
| class | CharTrainer |
| Character-level tokenizer trainer. More... | |
| class | CharVocabulary |
| Character vocabulary for tokenization. More... | |
| class | CharVocabularyConfig |
| Configuration for Character-level tokenizer training. More... | |
| class | DataLoader |
| Device-agnostic data loader interface using abstract tensor data types. More... | |
| class | MilaFileHeader |
| Common file header for Mila data files. More... | |
| class | SerializationMetadata |
| Type-safe metadata container for component serialization. More... | |
| struct | SpecialTokens |
| Configuration for special tokens across all tokenizer types. More... | |
| class | Tokenizer |
| class | TokenizerTrainer |
| Abstract interface for training tokenizer vocabularies from text corpora. More... | |
| class | TokenizerVocabulary |
| Generic tokenizer vocabulary interface. More... | |
| class | TokenSequenceLoader |
| Token sequence loader for autoregressive language models. More... | |
| struct | TokenSequenceLoaderConfig |
| Configuration for StreamingSequenceLoader behavior. More... | |
| class | TrainerFactory |
| Factory for creating tokenizer trainers and loading vocabularies. More... | |
Typedefs | |
| template<TensorDataType TInputDataType = TensorDataType::FP32, TensorDataType TTargetDataType = TInputDataType> | |
| using | Mila::Data::CpuDataLoader = DataLoader<TInputDataType, TTargetDataType, CpuMemoryResource> |
| CPU data loader with single precision floating point. | |
| using | json = nlohmann::json |
| using | TokenId |
| using | TokenId |
| using | TokenId |
| using | TokenId |
| using | TokenId |
| using | Mila::Data::TokenId = int32_t |
Enumerations | |
| enum class | Mila::Data::MilaFileType : uint32_t { Unknown = 0 , BpeVocabulary = 1 , CharVocabulary = 2 , TokenizedCorpus = 3 , Dataset = 4 , TrainingCheckpoint = 5 , Gpt4BpeVocabulary = 6 } |
| File type identifiers for Mila data files. More... | |
| enum class | Mila::Data::PreTokenizationMode { None , Whitespace , Gpt2Regex , Llama3Regex } |
| Pre-tokenization strategies for GPT-4 style BPE tokenizers. More... | |
| enum class | Mila::Data::SpecialToken : uint8_t { PAD = 0 , UNK , BOS , EOS , MASK , SEP , CLS } |
| Special token types for tokenization. More... | |
| enum class | TokenizerType |
| Tokenizer type discriminator used across tokenizer and vocabulary types. More... | |
| enum class | Mila::Data::TokenizerType : uint8_t { Unknown = 0 , Char , Bpe , SentencePiece , Word , Unigram } |
| Tokenizer type discriminator used across tokenizer and vocabulary types. More... | |
Functions | |
| TokenizerType | Mila::Data::stringToTokenizerType (std::string_view s) noexcept |
| Parse a string into TokenizerType. | |
| std::string_view | Mila::Data::tokenizerTypeToString (TokenizerType t) noexcept |
| Convert TokenizerType to a stable string representation. | |
| constexpr const char * | Mila::Data::toString (MilaFileType type) |
Variables | |
| constexpr const char * | Mila::Data::GPT2_PRETOKENIZATION_PATTERN |
| constexpr const char * | Mila::Data::GPT2_PRETOKENIZATION_PATTERN_ASCII_FALLBACK |
| constexpr const char * | Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN |
| constexpr const char * | Mila::Data::LLAMA3_PRETOKENIZATION_PATTERN_ASCII_FALLBACK |
|
export |
CPU data loader with single precision floating point.
Convenient alias for data loaders using standard CPU memory with FP32 data types for both inputs and targets. Suitable for CPU-only training and development workflows.
| TInputDataType | Input tensor data type (defaults to FP32) |
| TTargetDataType | Target tensor data type (defaults to input type) |
| using Mila::Data::json = nlohmann::json |
| using Mila::Data::TokenId |
| using Mila::Data::TokenId |
| using Mila::Data::TokenId |
| using Mila::Data::TokenId |
| using Mila::Data::TokenId |
|
export |
|
exportstrong |
File type identifiers for Mila data files.
| Enumerator | |
|---|---|
| Unknown | |
| BpeVocabulary | |
| CharVocabulary | |
| TokenizedCorpus | |
| Dataset | |
| TrainingCheckpoint | |
| Gpt4BpeVocabulary | |
|
exportstrong |
|
exportstrong |
|
strong |
Tokenizer type discriminator used across tokenizer and vocabulary types.
|
exportstrong |
Tokenizer type discriminator used across tokenizer and vocabulary types.
| Enumerator | |
|---|---|
| Unknown | |
| Char | |
| Bpe | |
| SentencePiece | |
| Word | |
| Unigram | |
|
inlineexportnoexcept |
Parse a string into TokenizerType.
Comparison is case-insensitive for the ASCII range.
|
inlineexportnoexcept |
Convert TokenizerType to a stable string representation.
|
constexprexport |

|
constexprexport |
|
constexprexport |
|
constexprexport |
|
constexprexport |