|
Mila 0.13.48
Deep Neural Network Library
|
| NMila | Mila main API namespace |
| NCore | |
| CRandomGenerator | Singleton class providing centralized random number generation |
| NData | |
| CBpeTokenizer | Unified BPE tokenizer targeting GPT-2, Llama 3.x, and Mistral model families |
| CBpeTrainer | Corpus accumulator and trainer for BPE vocabularies |
| CBpeVocabulary | Unified Byte Pair Encoding (BPE) vocabulary |
| CPairHash | |
| CPairViewHash | |
| CBpeVocabularyConfig | Configuration for the BPE vocabulary |
| CCharTokenizer | Character-level tokenizer |
| CCharTrainer | Character-level tokenizer trainer |
| CCharVocabulary | Character vocabulary for tokenization |
| CCharVocabularyConfig | Configuration for Character-level tokenizer training |
| CDataLoader | Device-agnostic data loader interface using abstract tensor data types |
| CMilaFileHeader | Common file header for Mila data files |
| CSerializationMetadata | Type-safe metadata container for component serialization |
| CSpecialTokens | Configuration for special tokens across all tokenizer types |
| CTokenizer | |
| CTokenizerTrainer | Abstract interface for training tokenizer vocabularies from text corpora |
| CTokenizerVocabulary | Generic tokenizer vocabulary interface |
| CTokenSequenceLoader | Token sequence loader for autoregressive language models |
| CTokenSequenceLoaderConfig | Configuration for StreamingSequenceLoader behavior |
| CTrainerFactory | Factory for creating tokenizer trainers and loading vocabularies |
| NDnn | |
| NCompute | |
| NCpu | |
| CFillOps | CPU specialization of TensorOps for initialization operations |
| CMathOps | CPU specialization of TensorOps for mathematical operations |
| CTransferOps | CPU specialization of TensorOps for transfer operations |
| CZeroOps | |
| NCuda | |
| NDetail | |
| Ccuda_structural_kernels | |
| Ccuda_structural_kernels< float > | |
| Ccuda_structural_kernels< nv_bfloat16 > | |
| NGelu | |
| NDetail | |
| Ccuda_gelu_impl | |
| Ccuda_gelu_impl< float > | |
| Ccuda_gelu_impl< half > | |
| CCudaGeluOp | CUDA implementation of the GELU activation function for neural networks |
| CCudaGeluOpRegistrar | Class responsible for registering the CudaGeluOp operation |
| NGqa | |
| NDetail | |
| Ccuda_gqa_kernels | |
| Ccuda_gqa_kernels< float > | |
| Ccuda_gqa_kernels< nv_bfloat16 > | |
| CCudaGqaOp | CUDA Grouped-Query Attention operation |
| CCudaGroupedQueryAttentionOpRegistrar | |
| NLayerNorm | |
| NDetail | |
| Ccuda_layernorm_impl | CUDA kernel dispatcher for LayerNorm operations |
| Ccuda_layernorm_impl< float > | |
| Ccuda_layernorm_impl< half > | |
| CCudaLayerNormOp | CUDA implementation of Layer Normalization |
| CCudaLayerNormOpRegistrar | |
| NLinear | |
| NDetail | |
| Ccuda_matmul_impl | CUDA kernel dispatcher for Linear operations |
| Ccuda_matmul_impl< float > | |
| Ccuda_matvec_impl | CUDA kernel dispatcher for matrix-vector multiply (M=1 decode path) |
| Ccuda_matvec_impl< float, float > | |
| Ccuda_matvec_impl< nv_bfloat16, __nv_fp8_e4m3 > | |
| Ccuda_matvec_impl< nv_bfloat16, nv_bfloat16 > | |
| CCudaLinearOp | CUDA Linear operation with compile-time weight quantization policy dispatch |
| CCudaLinearOpRegistrar | |
| NLpe | |
| NDetail | |
| Ccuda_lpe_impl | CUDA kernel dispatcher for Lpe forward, backward, and positional decode |
| Ccuda_lpe_impl< float > | FP32 specialization of the Lpe CUDA kernel dispatcher |
| Ccuda_lpe_impl< half > | FP16 specialization of the Lpe CUDA kernel dispatcher |
| CCudaLpeOp | CUDA implementation of the Lpe (token + positional embedding) operation |
| CCudaLpeOpRegistrar | |
| NMatMulBiasGelu | |
| CCudaMatMulBiasGeluOp | CUDA implementation of the fused MatMul-Bias-GELU operation |
| CCudaMatMulBiasGeluOpRegistrar | Class responsible for registering the CudaMatMulBiasGeluOp operation |
| NMultiHeadAttention | |
| NDetail | |
| Ccuda_mha_kernels | CUDA kernel dispatcher for attention non-matmul operations |
| Ccuda_mha_kernels< float > | |
| Ccuda_mha_kernels< half > | |
| CCudaMultiHeadAttentionOp | CUDA implementation of Multi-Head Attention using column-major cuBLASLt optimization |
| CCudaMultiHeadAttentionOpRegistrar | |
| NResidual | |
| NDetail | CUDA residual kernel dispatch implementations |
| Ccuda_residual_impl | |
| Ccuda_residual_impl< float > | |
| Ccuda_residual_impl< nv_bfloat16 > | |
| CCudaResidualOp | CUDA Residual operation implementing the BinaryOperation interface |
| CCudaResidualOpRegistrar | |
| NRmsNorm | |
| NDetail | |
| Ccuda_rmsnorm_impl | CUDA kernel dispatcher for RMSNorm operations |
| Ccuda_rmsnorm_impl< float > | |
| Ccuda_rmsnorm_impl< nv_bfloat16 > | |
| CCudaRmsNormOp | CUDA implementation of RMS Normalization |
| CCudaRmsNormOpRegistrar | |
| NRope | |
| NDetail | |
| Ccuda_rope_impl | CUDA kernel dispatcher for RoPE forward, backward, cache build, and positional decode |
| Ccuda_rope_impl< __nv_bfloat16 > | |
| Ccuda_rope_impl< float > | |
| CCudaRopeOp | CUDA implementation of the Rope (rotary positional embedding) operation |
| CCudaRopeOpRegistrar | |
| CRopeCacheRegistry | Process-wide shared cache for RoPE cos/sin frequency tables |
| CAcquireResult | |
| CCacheEntry | |
| CCacheKey | |
| CCacheKeyHash | |
| NSoftmax | |
| NDetail | Namespace for CUDA softmax implementation details |
| Ccuda_softmax_impl | |
| Ccuda_softmax_impl< float > | |
| Ccuda_softmax_impl< half > | |
| CCudaSoftmaxOp | CUDA implementation of Softmax operation using abstract TensorDataType API |
| CCudaSoftmaxOpRegistrar | Class responsible for registering the CudaSoftmaxOp operation |
| NSoftmaxCrossEntropy | |
| NDetail | Namespace for CUDA fused softmax cross entropy implementation details |
| Ccuda_softmax_crossentropy_impl | CUDA kernel dispatcher for SoftmaxCrossEntropy operations |
| Ccuda_softmax_crossentropy_impl< float > | |
| Ccuda_softmax_crossentropy_impl< half > | |
| CCudaSoftmaxCrossEntropyOp | Fused CUDA implementation of Softmax + CrossEntropy using abstract TensorDataType API |
| CCudaSoftmaxCrossEntropyOpRegistrar | Registrar for fused Softmax+CrossEntropy CUDA operation |
| NSwiglu | |
| NDetail | |
| Ccuda_swiglu_impl | |
| Ccuda_swiglu_impl< __nv_bfloat16 > | |
| Ccuda_swiglu_impl< float > | |
| CCudaSwigluOp | |
| CCudaSwigluOpRegistrar | |
| NTokenEmbedding | |
| NDetail | |
| Ccuda_token_embedding_impl | |
| Ccuda_token_embedding_impl< __nv_bfloat16 > | |
| Ccuda_token_embedding_impl< float > | |
| CCudaTokenEmbeddingOp | |
| CCudaTokenEmbeddingOpRegistrar | |
| CCublasLtLinearPlan | RAII wrapper owning cuBLASLt descriptors for a Linear matmul |
| CCublasLtMatMulPlan | RAII wrapper owning cuBLASLt descriptors and the selected heuristic algorithm |
| CCublasLtPlanCache | Generic plan cache keyed on batch size bucket |
| CCudaDataTypeTraits | Compile-time mapping from TensorDataType -> cudaDataType_t |
| CCudaDataTypeTraits< TensorDataType::BF16 > | |
| CCudaDataTypeTraits< TensorDataType::FP16 > | |
| CCudaDataTypeTraits< TensorDataType::FP32 > | |
| CCudaDataTypeTraits< TensorDataType::FP8_E4M3 > | |
| CCudaDataTypeTraits< TensorDataType::FP8_E5M2 > | |
| CCudaDataTypeTraits< TensorDataType::INT32 > | |
| CCudaDataTypeTraits< TensorDataType::INT8 > | |
| CFillOps | CUDA specialization of TensorOps for initialization operations |
| CMathOps | CUDA specialization of TensorOps for mathematical operations |
| CRandomOps | |
| CStructuralOps | |
| CTensorDataTypeMap | Compile-time mapping from abstract TensorDataType -> CUDA native device type |
| CTensorDataTypeMap< TensorDataType::BF16 > | Maps TensorDataType::BF16 to CUDA __nv_bfloat16 |
| CTensorDataTypeMap< TensorDataType::FP16 > | Maps TensorDataType::FP16 to CUDA __half |
| CTensorDataTypeMap< TensorDataType::FP32 > | Maps TensorDataType::FP32 to CUDA float |
| CTensorDataTypeMap< TensorDataType::FP4_E2M1 > | Maps TensorDataType::FP4_E2M1 to std::uint8_t |
| CTensorDataTypeMap< TensorDataType::FP4_E3M0 > | Maps TensorDataType::FP4_E3M0 to std::uint8_t |
| CTensorDataTypeMap< TensorDataType::FP8_E4M3 > | Maps TensorDataType::FP8_E4M3 to CUDA __nv_fp8_e4m3 |
| CTensorDataTypeMap< TensorDataType::FP8_E5M2 > | Maps TensorDataType::FP8_E5M2 to CUDA __nv_fp8_e5m2 |
| CTensorDataTypeMap< TensorDataType::INT16 > | Maps TensorDataType::INT16 to std::int16_t |
| CTensorDataTypeMap< TensorDataType::INT32 > | Maps TensorDataType::INT32 to std::int32_t |
| CTensorDataTypeMap< TensorDataType::INT8 > | Maps TensorDataType::INT8 to std::int8_t |
| CTensorDataTypeMap< TensorDataType::UINT16 > | Maps TensorDataType::UINT16 to std::uint16_t |
| CTensorDataTypeMap< TensorDataType::UINT32 > | Maps TensorDataType::UINT32 to std::uint32_t |
| CTensorDataTypeMap< TensorDataType::UINT8 > | Maps TensorDataType::UINT8 to std::uint8_t |
| CTransferOps | CUDA specialization of TensorOps for tensor transfer operations |
| CZeroOps | |
| Calways_false | |
| CBinaryOperation | |
| CCpuAdamWOptimizer | CPU-specific AdamW optimizer implementation |
| CCpuAttentionOp | CPU implementation of Multi-Head Attention operation |
| CCpuAttentionOpRegistrar | |
| CCpuCrossEntropyOp | CPU implementation of the cross entropy loss operation for neural networks |
| CCpuCrossEntropyOpRegistrar | Class responsible for registering the CpuCrossEntropyOp operation |
| CCpuDevice | Class representing a CPU compute device |
| CCpuDeviceRegistrar | CPU device plugin for device-agnostic registration |
| CCpuEncoderOp | CPU implementation of the Encoder operation |
| CCpuEncoderOpRegistrar | Registrar for CpuEncoderOp operation |
| CCpuGeluOp | CPU implementation of GELU activation operation using abstract TensorDataType |
| CCpuGeluOpRegistrar | Class responsible for registering CPU GELU operations |
| CCpuLayerNormOp | CPU implementation of Layer Normalization using abstract TensorDataType API |
| CCpuLayerNormOpRegistrar | |
| CCpuLinearOp | CPU implementation of Linear operation using abstract TensorDataType API |
| CCpuLinearOpRegistrar | |
| CCpuMemoryResource | CPU memory resource for host-accessible memory allocation |
| CCpuResidualOp | CPU Residual operation (FP32) implementing BinaryOperation interface |
| CCpuResidualOpRegistrar | Registrar for CPU Residual operation (FP32) |
| CCpuSoftmaxCrossEntropyOp | Fused CPU implementation of Softmax + CrossEntropy using abstract TensorDataType API |
| CCpuSoftmaxCrossEntropyOpRegistrar | Registrar for fused Softmax+CrossEntropy operation |
| CCpuSoftmaxOp | CPU implementation of Softmax using abstract TensorDataType API |
| CCpuSoftmaxOpRegistrar | |
| CCublasLtError | |
| CCudaAdamWOptimizer | CUDA-specific AdamW optimizer implementation |
| CCudaBadAlloc | |
| CCudaDataTypeMap | Helper struct to map C++ types to CUDA data types for cuBLASLt |
| CCudaDataTypeMap< __nv_bfloat16 > | |
| CCudaDataTypeMap< float > | |
| CCudaDataTypeMap< half > | |
| CCudaDevice | Class representing a CUDA compute device instance |
| CCudaDeviceMemoryResource | CUDA device memory resource for GPU-accessible memory allocation |
| CCudaDeviceProps | Wrapper for CUDA device properties with cached values |
| CCudaDeviceRegistrar | CUDA device registrar for device-agnostic registration |
| CCudaError | Exception class for CUDA runtime errors |
| CCudaManagedMemoryResource | CUDA managed memory resource for unified host/device accessible memory |
| CCudaPinnedMemoryResource | CUDA pinned memory resource for fast host/device transfer memory |
| CCudaTimer | GPU-accurate interval timer using a CUDA event pair |
| CDevice | Abstract interface for compute device implementations |
| CDeviceAccessible | |
| CDeviceConstructionKey | Construction key for device factories |
| CDeviceId | Lightweight identifier for a compute device |
| CDeviceRegistrar | Device-agnostic registrar for automatic device discovery and registration |
| CDeviceRegistry | Registry of discovered compute devices with lazy instantiation |
| CDeviceTypeTraits | |
| CDeviceTypeTraits< DeviceType::Cpu > | DeviceTypeTraits specialization for the CPU device |
| CDeviceTypeTraits< DeviceType::Cuda > | DeviceTypeTraits specialization for the CUDA device |
| CExecutionContext | Templated execution context for device-specific operations |
| CExecutionContext< DeviceType::Cpu > | CPU execution context specialization |
| CExecutionContext< DeviceType::Cuda > | CUDA execution context specialization |
| CExecutionContext< DeviceType::Metal > | Metal execution context specialization |
| CExecutionContext< DeviceType::Vulkan > | Vulkan execution context specialization |
| CGqaState | Non-owning pointers to shared transient GQA scratch buffers |
| CHostAccessible | |
| CIExecutionContext | Type-erased execution context interface |
| CIKvCacheLifecycle | Capability interface for KV-cache state management |
| CIKvInference | Compute interface for attention operations that maintain a KV cache |
| CIPackedKvInference | KV-cache inference interface for packed-QKV MHA backends |
| CIPositionalDecode | Capability interface for position-dependent unary operations |
| CIPositionalPairedOp | Capability interface for position-dependent paired operations |
| CLinearOpTypeMap< DeviceType::Cpu, TensorDataType::FP32 > | |
| CMemoryResource | Clean memory resource abstraction for device-specific memory allocation |
| CMemoryResourceTraits | Memory resource traits for compile-time dispatch optimization |
| CMemoryResourceTraits< CpuMemoryResource > | CPU-specific memory resource traits providing detailed CPU backend characteristics |
| CMemoryResourceTraits< CudaDeviceMemoryResource > | CUDA device memory resource traits providing detailed GPU backend characteristics |
| CMemoryResourceTraits< CudaManagedMemoryResource > | CUDA managed memory resource traits providing unified memory characteristics |
| CMemoryResourceTraits< CudaPinnedMemoryResource > | CUDA pinned memory resource traits providing fast transfer characteristics |
| CMemoryStats | Global memory statistics for all TrackedMemoryResource instances |
| CMetalDevice | Class representing a Metal compute device instance |
| CMetalDevicePlugin | Metal device plugin for device-agnostic registration |
| CMetalMemoryResource | Stub implementation for non-Apple platforms |
| COperation | |
| COperationRegistry | Central registry for typed, device-aware compute operations |
| CTypeID | Composite key for registry lookup |
| CTypeIDHash | |
| COperationsRegistrar | Class to manage compute operations initialization |
| COperationTraits | Primary traits template for unified compile-time operation dispatch |
| COperationTraits< OperationType::CrossEntropyOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::CrossEntropyOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::GeluOp, DeviceType::Cpu, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::GeluOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::GeluOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::GroupedQueryAttentionOp, DeviceType::Cuda, TensorDataType::BF16, NoKvCompression > | Unquantized BF16 path. No KV cache compression. Standard inference precision |
| COperationTraits< OperationType::GroupedQueryAttentionOp, DeviceType::Cuda, TensorDataType::FP32, NoKvCompression > | Unquantized FP32 path. No KV cache compression |
| COperationTraits< OperationType::LinearOp, DeviceType::Cpu, TensorDataType::FP32, NoWeightQuant > | |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::BF16, NoWeightQuant > | Unquantized BF16 path. Standard inference precision |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::BF16, PerChannelFp8<> > | FP8 per-channel quantized BF16 path. Requires SM >= 8.0 (Ampere+) |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::BF16, PerGroupFp4< 128 > > | FP4 E2M1 per-group quantized BF16 path. W4A16 fused GEMM with E2M1 decode, group_size=128. Requires SM >= 8.0 |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::BF16, PerGroupFp4< 64 > > | FP4 E2M1 per-group quantized BF16 path. W4A16 fused GEMM with E2M1 decode, group_size=64. Requires SM >= 8.0 |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::BF16, PerGroupInt4< 128 > > | INT4 per-group quantized BF16 path. W4A16 fused GEMM, group_size=128. Requires SM >= 8.0 |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::BF16, PerGroupInt4< 64 > > | INT4 per-group quantized BF16 path. W4A16 fused GEMM, group_size=64. Requires SM >= 8.0 |
| COperationTraits< OperationType::LinearOp, DeviceType::Cuda, TensorDataType::FP32, NoWeightQuant > | Unquantized FP32 path. Retained for validation and reference |
| COperationTraits< OperationType::LpeOp, DeviceType::Cpu, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::LpeOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::LpeOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::MultiHeadAttentionOp, DeviceType::Cpu, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::MultiHeadAttentionOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::MultiHeadAttentionOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::ResidualOp, DeviceType::Cpu, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::ResidualOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::ResidualOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::RmsNormOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::RmsNormOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::RopeOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::RopeOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::SoftmaxOp, DeviceType::Cpu, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::SoftmaxOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::SoftmaxOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::SwigluOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::SwigluOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COperationTraits< OperationType::TokenEmbeddingOp, DeviceType::Cuda, TensorDataType::BF16, void > | |
| COperationTraits< OperationType::TokenEmbeddingOp, DeviceType::Cuda, TensorDataType::FP32, void > | |
| COptimizer | Abstract base class for parameter optimizers |
| CPairedOperation | Abstract base for paired operations: two inputs -> two outputs |
| CTrackedMemoryResource | A memory resource wrapper that tracks allocation and deallocation statistics |
| CUnaryOperation | |
| CVulkanDevice | Class representing a Vulkan compute device instance |
| CVulkanMemoryResource | Stub implementation for platforms without Vulkan support |
| NDetail | |
| Cmlp_activation_impl | |
| Cmlp_activation_impl< ActivationType::Gelu, TDeviceType, TPrecision > | |
| Cmlp_activation_impl< ActivationType::Swiglu, TDeviceType, TPrecision > | |
| NExtensibility | |
| CIModulePlugin | |
| CPluginInfo | Get plugin metadata |
| CPluginManager | Manages loading and querying of module plugins |
| CPluginEntry | |
| NOptimizers | |
| CAdamWConfig | Configuration for AdamW optimizer |
| CAdamWOptimizer | Device-agnostic AdamW optimizer |
| CSerializationMetadata | Type-safe metadata container for component serialization |
| NQuant | |
| NKvCache | |
| CNoKvCompression | |
| CPerChannelKvFp8 | Symmetric per-head per-token FP8 KV cache compression policy |
| NWeight | |
| CNoWeightQuant | |
| CPerChannelFp8 | |
| CPerGroupFp4 | |
| CPerGroupInt4 | |
| NSerialization | |
| CArchiveSerializer | Interface for hierarchical archive serializers |
| CITensorBlob | Type-erased interface for a serialized tensor blob |
| CModelArchive | ModelArchive provides high-level helpers for component serialization |
| CScopedScope | |
| CPretrainedMetadata | Metadata for pretrained model |
| CPretrainedModelReader | Reader for Mila pretrained binary format |
| CSerializationMetadata | Type-safe metadata container for component serialization |
| CSerializer | Minimal base interface for model serialization backends |
| CTensorBlob | Concrete tensor blob owning a TensorBuffer-backed raw byte buffer |
| CTensorBlobMetadata | Metadata for a tensor blob in pretrained model format |
| CTensorMetadata | Metadata describing a tensor in serialized form |
| CZipSerializer | ZIP archive serializer built on miniz |
| NVisualization | |
| CBlockVisualizer | |
| CColorLUT | |
| CFramebuffer | |
| CLayerNormVisualizer | |
| CMLPVisualizer | |
| CModuleVisualizer | |
| CRect | |
| CRGB | |
| CVisualizerContext | |
| CAxisPartition | Information about axis partitioning of a tensor |
| CBufferedTokenStreamer | Buffers BufSize tokens before forwarding a contiguous span to Sink |
| CBuildContext | Build-time context for Component::build() |
| CComponent | Abstract base class for neural network components |
| CComponentConfig | Abstract base for component configuration objects |
| CComponentFactory | Factory for reconstructing components from serialized archives |
| CCompositeComponent | A component that contains and manages child components |
| CConstantLRScheduler | Constant learning-rate scheduler |
| CCosineLRScheduler | Cosine annealing scheduler |
| CCpuTensorDataTypeTraits | CPU-specific traits for abstract tensor data types |
| CCrossEntropyConfig | Configuration for fused SoftmaxCrossEntropy loss |
| Cdependent_false | |
| CDropout | Dropout regularization module for neural networks |
| CDropoutConfig | Configuration class for Dropout module |
| CFusedComponent | DEPRECATED |
| CGelu | Gaussian Error Linear Unit (GELU) activation component |
| CGeluConfig | Configuration class for GELU module |
| CGenerateParams | |
| CGenerationStatistics | Statistics captured during a single generateStreaming() call |
| CGptBlock | Transformer encoder block as a composite component |
| CGptBlockConfig | Configuration class for GPT transformer blocks |
| CGptConfig | Network-level configuration for GPT-style transformer networks |
| CGptModel | GPT inference model |
| CGptTransformer | GPT-2 style transformer (decoder-only) for autoregressive token prediction |
| CGqaConfig | Configuration class for the Grouped-Query Attention module |
| CGroupedQueryAttention | Grouped-Query Attention module that accepts concatenated QKV input |
| CITensor | Abstract interface providing essential tensor information and data access |
| CLanguageModel | |
| CLanguageModelConfig | CRTP base configuration for all deployable Mila language models |
| CLanguageNetwork | |
| CLayerNorm | Device-templated Layer Normalization component |
| CLayerNormConfig | |
| CLearningRateScheduler | Abstract base for learning-rate schedulers |
| CLinear | Device-templated fully connected (linear) component |
| CLinearConfig | Configuration object for a Linear (fully connected) layer |
| CLinearLRScheduler | Linear decay scheduler |
| CLlamaBlock | |
| CLlamaConfig | Network-level configuration for LLaMA-style transformer networks |
| CLlamaModel | LLaMA 3 compatible inference model |
| CLlamaModelConfig | Deployment configuration for Llama language models |
| CLlamaTransformer | LLaMA-style transformer (decoder-only) for autoregressive token prediction |
| CLoss | Abstract base class for neural network loss functions |
| CLpe | Encoder module for token and positional embeddings (device-templated) |
| CLpeConfig | Configuration class for the Learned Positional Encoder |
| CMemoryStats | Memory allocation breakdown for a single component |
| CMLP | Multi-Layer Perceptron (MLP) composite component |
| CMLPConfig | Configuration class for the Multi-Layer Perceptron (MLP) block |
| CModel | |
| CModelConfig | Abstract base configuration for all deployable Mila models |
| CMultiAxisPartition | Multi-axis partition for normalization over trailing dimensions |
| CMultiHeadAttention | Multi-Head Attention module that accepts concatenated QKV input |
| CMultiHeadAttentionConfig | Configuration class for Attention module |
| CNetwork | Root composite network container |
| CNetworkFactory | Factory registry for Network deserialization |
| CResidual | Device-templated Residual connection component |
| CResidualConfig | Configuration class for Residual connection component |
| CRmsNorm | Device-templated RMS Normalization component |
| CRmsNormConfig | |
| CRope | Device-templated RoPE component |
| CRopeConfig | |
| CSerializationMetadata | Type-safe metadata container for component serialization |
| CSoftmax | Softmax activation module (device-templated) |
| CSoftmaxConfig | Configuration class for Softmax module |
| CSoftmaxCrossEntropy | Fused SoftmaxCrossEntropy loss module (device-templated) |
| CSwiglu | SwiGLU activation component |
| CSwigluConfig | |
| CTensor | Device-aware N-dimensional tensor |
| CTensorBuffer | Device-agnostic buffer for storing tensor data with abstract type system |
| CTensorDataTypeMap | Primary template for mapping concrete C++ types to TensorDataType |
| CTensorDataTypeMap< __nv_fp8_e4m3 > | |
| CTensorDataTypeMap< __nv_fp8_e5m2 > | |
| CTensorDataTypeMap< float > | Concrete type mapping for float (FP32) |
| CTensorDataTypeMap< half > | |
| CTensorDataTypeMap< nv_bfloat16 > | |
| CTensorDataTypeMap< std::int16_t > | Concrete type mapping for 16-bit signed integer |
| CTensorDataTypeMap< std::int32_t > | Concrete type mapping for 32-bit signed integer |
| CTensorDataTypeMap< std::int8_t > | Concrete type mapping for 8-bit signed integer |
| CTensorDataTypeMap< std::uint16_t > | Concrete type mapping for 16-bit unsigned integer |
| CTensorDataTypeMap< std::uint32_t > | Concrete type mapping for 32-bit unsigned integer |
| CTensorDataTypeMap< std::uint8_t > | Concrete type mapping for 8-bit unsigned integer |
| CTensorDataTypeTraits | Compile-time traits for TensorDataType enumeration values |
| CTensorDataTypeTraits< TensorDataType::BF16 > | Traits specialization for 16-bit brain floating point |
| CTensorDataTypeTraits< TensorDataType::FP16 > | Traits specialization for 16-bit half precision floating point |
| CTensorDataTypeTraits< TensorDataType::FP32 > | Traits specialization for 32-bit IEEE 754 floating point |
| CTensorDataTypeTraits< TensorDataType::FP4_E2M1 > | Traits specialization for 4-bit floating point with E2M1 format |
| CTensorDataTypeTraits< TensorDataType::FP4_E3M0 > | Traits specialization for 4-bit floating point with E3M0 format |
| CTensorDataTypeTraits< TensorDataType::FP8_E4M3 > | Traits specialization for 8-bit floating point with E4M3 format |
| CTensorDataTypeTraits< TensorDataType::FP8_E5M2 > | Traits specialization for 8-bit floating point with E5M2 format |
| CTensorDataTypeTraits< TensorDataType::INT16 > | Traits specialization for 16-bit signed integer |
| CTensorDataTypeTraits< TensorDataType::INT32 > | Traits specialization for 32-bit signed integer |
| CTensorDataTypeTraits< TensorDataType::INT8 > | Traits specialization for 8-bit signed integer |
| CTensorDataTypeTraits< TensorDataType::UINT16 > | Traits specialization for 16-bit unsigned integer |
| CTensorDataTypeTraits< TensorDataType::UINT32 > | Traits specialization for 32-bit unsigned integer |
| CTensorDataTypeTraits< TensorDataType::UINT8 > | Traits specialization for 8-bit unsigned integer |
| CTensorHostTypeMap | Maps abstract TensorDataType to host-compatible C++ type and TensorDataType |
| CTensorHostTypeMap< TensorDataType::BF16 > | Host type for 16-bit brain floating point |
| CTensorHostTypeMap< TensorDataType::FP16 > | Host type for 16-bit half precision floating point |
| CTensorHostTypeMap< TensorDataType::FP32 > | Host type for 32-bit IEEE 754 floating point |
| CTensorHostTypeMap< TensorDataType::FP8_E4M3 > | Host type for 8-bit floating point with E4M3 format |
| CTensorHostTypeMap< TensorDataType::FP8_E5M2 > | Host type for 8-bit floating point with E5M2 format |
| CTensorHostTypeMap< TensorDataType::INT16 > | Host type for 16-bit signed integer |
| CTensorHostTypeMap< TensorDataType::INT32 > | Host type for 32-bit signed integer |
| CTensorHostTypeMap< TensorDataType::INT8 > | Host type for 8-bit signed integer |
| CTensorHostTypeMap< TensorDataType::UINT16 > | Host type for 16-bit unsigned integer |
| CTensorHostTypeMap< TensorDataType::UINT32 > | Host type for 32-bit unsigned integer |
| CTensorHostTypeMap< TensorDataType::UINT8 > | Host type for 8-bit unsigned integer |
| CTensorOps | Device-dispatched TensorOps interface template |
| CTensorOps< Compute::DeviceType::Cpu > | |
| CTensorOps< Compute::DeviceType::Cuda > | |
| CTensorShape | Fixed-capacity inline shape descriptor for N-dimensional tensors |
| CTokenEmbedding | Pure token embedding component (device-templated) |
| CTokenEmbeddingConfig | Configuration for the TokenEmbedding component |
| CUniqueIdGenerator | Thread-safe generator for unique tensor identifiers |
| CVulkanTensorTraits | Vulkan-specific traits for abstract tensor data types |
| NLogging | |
| CConsoleSink | Thread-safe logging sink that writes formatted records to the console |
| CFileSink | Thread-safe logging sink that writes formatted records to a file |
| CLogger | Abstract logging interface and static facade |
| CNullSink | A logging sink that silently discards all records |
| NProfiling | |
| CNvtxRange | |
| NUtils | |
| CStepLogger | |
| CVersion | Semantic Version data |
| Nstd | |
| Chash< Mila::Dnn::Compute::DeviceId > | Hash specialization for DeviceId |
| CCudaException | |
| CMyCustomLayerPlugin |