|
Mila 0.13.48
Deep Neural Network Library
|
Token sequence loader for autoregressive language models. More...


Public Types | |
| using | BaseLoader = DataLoader<TensorDataType::INT32, TensorDataType::INT32, TMemoryResource> |
| using | HostType = typename TensorHostTypeMap<TensorDataType::INT32>::host_type |
| using | TensorType = Tensor<TensorDataType::INT32, TMemoryResource> |
| Public Types inherited from Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource > | |
| using | InputDataType |
| Input tensor abstract data type. | |
| using | InputTensor |
| Input tensor type alias. | |
| using | MemoryResource |
| Memory resource type for tensor allocation. | |
| using | TargetDataType |
| Target tensor abstract data type. | |
| using | TargetTensor |
| Target tensor type alias. | |
Public Member Functions | |
| TokenSequenceLoader (const std::filesystem::path &tokens_file, int64_t batch_size, int64_t seq_length, bool is_training, DeviceId device, const TokenSequenceLoaderConfig &config=TokenSequenceLoaderConfig()) | |
| Constructs streaming autoregressive sequence loader. | |
| TokenSequenceLoader (const TokenSequenceLoader &)=delete | |
| TokenSequenceLoader (TokenSequenceLoader &&)=delete | |
| ~TokenSequenceLoader () noexcept | |
| const TensorType & | inputs () const override |
| Provides immutable access to input tensor for current batch. | |
| TensorType & | inputs () override |
| Provides mutable access to input tensor for current batch. | |
| void | nextBatch () override |
| Loads the next batch of data from the dataset. | |
| int64_t | numBatches () const override |
| Returns the total number of batches in the dataset. | |
| size_t | numTokens () const |
| size_t | numWindows () const |
| TokenSequenceLoader & | operator= (const TokenSequenceLoader &)=delete |
| TokenSequenceLoader & | operator= (TokenSequenceLoader &&)=delete |
| void | reset () override |
| Resets the loader to the beginning of the dataset. | |
| int64_t | sequenceLength () const |
| const TensorType & | targets () const override |
| Provides immutable access to target tensor for current batch. | |
| TensorType & | targets () override |
| Provides mutable access to target tensor for current batch. | |
| size_t | windowSizeTokens () const |
| Public Member Functions inherited from Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource > | |
| DataLoader (const DataLoader &)=delete | |
| Copy operations explicitly deleted for performance safety. | |
| DataLoader (DataLoader &&)=default | |
| Move operations for efficient ownership transfer. | |
| DataLoader (int64_t batch_size) | |
| Constructs data loader with specified batch configuration. | |
| virtual | ~DataLoader ()=default |
| Virtual destructor ensuring proper cleanup in derived classes. | |
| int64_t | batchSize () const noexcept |
| Returns the configured batch size. | |
| int64_t | currentBatch () const noexcept |
| Returns the current batch index. | |
| virtual std::string | getDatasetInfo () const |
| Returns dataset statistics for optimization and analysis. | |
| virtual bool | hasNext () const |
| Checks if more batches are available. | |
| DataLoader & | operator= (const DataLoader &)=delete |
| DataLoader & | operator= (DataLoader &&)=default |
| virtual bool | validateCurrentBatch () const |
| Validates current batch data integrity. | |
Private Member Functions | |
| void | allocateBuffers () |
| void | cleanupBuffers () noexcept |
| void | fillBatch (const TokenId *window_buffer, size_t batch_idx, HostType *input_dest, HostType *target_dest) |
| Fills a batch from the current window buffer. | |
| void | initializeDataset () |
| void | loadWindowFromFile (std::ifstream &file, TokenId *buffer, size_t window_idx) |
| Loads a window from the token file. | |
| void | prepareSequenceIndices () |
| void | producerThreadFunc () noexcept |
| Producer thread: streams windows from disk and fills batches. | |
| void | shuffleSequenceIndices () |
| void | swapBuffers () noexcept |
Static Private Member Functions | |
| static DeviceId | validateDeviceId (DeviceId device) |
Private Attributes | |
| std::atomic< bool > | back_buffer_ready_ |
| std::shared_ptr< TensorType > | back_input_tensor_ |
| std::shared_ptr< TensorType > | back_target_tensor_ |
| size_t | batches_per_window_ |
| TokenSequenceLoaderConfig | config_ |
| std::atomic< size_t > | current_batch_in_window_ |
| std::atomic< size_t > | current_window_idx_ |
| std::condition_variable | cv_consumer_ |
| std::condition_variable | cv_producer_ |
| DeviceId | device_ |
| size_t | file_size_ |
| std::atomic< bool > | front_buffer_ready_ |
| std::shared_ptr< TensorType > | front_input_tensor_ |
| std::shared_ptr< TensorType > | front_target_tensor_ |
| bool | is_training_ |
| std::mutex | mutex_ |
| int64_t | num_batches_ |
| size_t | num_tokens_ |
| size_t | num_windows_ |
| std::exception_ptr | producer_exception_ |
| std::thread | producer_thread_ |
| int64_t | seq_length_ |
| std::vector< size_t > | sequence_indices_ |
| size_t | sequences_per_window_ |
| std::atomic< bool > | stop_ |
| std::filesystem::path | tokens_file_path_ |
| size_t | window_size_tokens_ |
Additional Inherited Members | |
| Static Public Member Functions inherited from Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource > | |
| static constexpr bool | supportsMixedPrecision () noexcept |
| Checks if data loader supports mixed-precision workflows. | |
| static constexpr bool | usesPinnedMemory () noexcept |
| Checks if data loader uses pinned memory for GPU optimization. | |
| Static Public Attributes inherited from Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource > | |
| static constexpr TensorDataType | input_data_type |
| Compile-time input data type constant. | |
| static constexpr bool | is_mixed_precision |
| Mixed-precision workflow detection. | |
| static constexpr TensorDataType | target_data_type |
| Compile-time target data type constant. | |
| static constexpr bool | uses_pinned_memory |
| Pinned memory optimization (CUDA-only; false on CPU-only builds). | |
| Protected Member Functions inherited from Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource > | |
| void | incrementBatch () noexcept |
| Increments current batch counter. | |
| void | setCurrentBatch (int64_t batch_index) noexcept |
| Updates current batch counter. | |
Token sequence loader for autoregressive language models.
Loads tokenized text data for causal language modeling tasks such as GPT, LLaMA, and other transformer-based models. Reads from pre-tokenized binary .tokens files and produces batches of (input, target) sequence pairs where target[i] = input[i+1] (next-token prediction).
Implementation uses efficient disk streaming with double-buffered producer-consumer pattern for high-throughput training on large corpora.
| TMemoryResource | CpuMemoryResource or CudaPinnedMemoryResource |
|
inlineexport |
Constructs streaming autoregressive sequence loader.
| tokens_file | Path to binary .tokens file (uint32_t format) |
| batch_size | Number of sequences per batch |
| seq_length | Context window length (tokens per sequence) |
| is_training | Enable shuffling and continuous epochs |
| device | Compute device for tensor allocation |
| config | Performance and streaming configuration |
| std::invalid_argument | If batch_size or seq_length is zero |
| std::runtime_error | If file operations or initialization fails |


|
inlineexportnoexcept |

|
exportdelete |

|
exportdelete |

|
inlineexportprivate |


|
inlineexportprivatenoexcept |

|
inlineexportprivate |
Fills a batch from the current window buffer.
Creates non-overlapping sequences where target[i] = input[i+1].
| window_buffer | Source tokens for current window |
| batch_idx | Batch index within current window |
| input_dest | Destination for input sequences |
| target_dest | Destination for target sequences |


|
inlineexportprivate |


|
inlineoverrideexportvirtual |
Provides immutable access to input tensor for current batch.
Derived classes must implement this method to provide read-only access to the tensor containing input data for the currently loaded batch.
Implements Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.
|
inlineoverrideexportvirtual |
Provides mutable access to input tensor for current batch.
Derived classes must implement this method to provide access to the tensor containing input data for the currently loaded batch. The tensor should be properly shaped and contain valid data after nextBatch() call.
Implements Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.
|
inlineexportprivate |
Loads a window from the token file.
| file | Input file stream |
| buffer | Destination buffer (must have space for window_size_tokens_) |
| window_idx | Which window to load |

|
inlineoverrideexportvirtual |
Loads the next batch of data from the dataset.
Derived classes must implement this method to load the next batch of data into the input and target tensors. Implementation should handle data preprocessing, memory allocation, and batch composition according to the specific dataset requirements.
| std::runtime_error | If no more batches are available |
| std::runtime_error | If data loading fails |
Implements Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.

|
inlineoverrideexportvirtual |
Returns the total number of batches in the dataset.
Derived classes must implement this method to report the total number of batches available in their specific dataset. This information is essential for training loop progress tracking and epoch management.
Implements Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.
|
inlineexport |
|
inlineexport |
|
exportdelete |

|
exportdelete |

|
inlineexportprivate |


|
inlineexportprivatenoexcept |
Producer thread: streams windows from disk and fills batches.
Workflow:
Exception safety: Catches all exceptions and stores them for consumer.


|
inlineoverrideexportvirtual |
Resets the loader to the beginning of the dataset.
Resets the internal state to start iteration from the first batch. Derived classes may override this method to implement additional reset functionality such as dataset reshuffling or preprocessing pipeline reinitialization.
Reimplemented from Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.

|
inlineexport |
|
inlineexportprivate |

|
inlineexportprivatenoexcept |

|
inlineoverrideexportvirtual |
Provides immutable access to target tensor for current batch.
Derived classes must implement this method to provide read-only access to the tensor containing target/label data for the currently loaded batch.
Implements Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.
|
inlineoverrideexportvirtual |
Provides mutable access to target tensor for current batch.
Derived classes must implement this method to provide access to the tensor containing target/label data for the currently loaded batch. The tensor should contain ground truth data corresponding to the inputs.
Implements Mila::Data::DataLoader< TensorDataType::INT32, TensorDataType::INT32, TMemoryResource >.
|
inlinestaticexportprivate |


|
inlineexport |