Skip to content

๐Ÿ”ข Advanced Numerical Embeddings

Transform raw numerical features into powerful representations

Enhance your model's ability to learn from numerical data with KDP's sophisticated dual-branch embedding architecture.

๐Ÿ“‹ Architecture Overview

Advanced Numerical Embeddings in KDP transform continuous values into meaningful embeddings using a dual-branch architecture:

1

Continuous Branch

Processes raw values through a small MLP for smooth pattern learning

2

Discrete Branch

Discretizes values into learnable bins with trainable boundaries

The outputs from both branches are combined using a learnable gate mechanism, providing the perfect balance between continuous and discrete representations.

โœจ Key Benefits

๐Ÿ› ๏ธ

Dual-Branch Architecture

Combines the best of both continuous and discrete processing

๐Ÿ“

Learnable Boundaries

Adapts bin edges during training for optimal discretization

๐ŸŽ›๏ธ

Feature-Specific Processing

Each feature gets its own specialized embedding

๐Ÿ’พ

Memory Efficient

Optimized for handling large-scale tabular datasets

๐Ÿ”—

Flexible Integration

Works seamlessly with other KDP features

๐Ÿ”ง

Residual Connections

Ensures stability during training

๐Ÿš€ Getting Started

1

Basic Usage

from kdp import PreprocessingModel, FeatureType

# Define numerical features
features_specs = {
    "age": FeatureType.FLOAT_NORMALIZED,
    "income": FeatureType.FLOAT_RESCALED,
    "credit_score": FeatureType.FLOAT_NORMALIZED
}

# Initialize model with numerical embeddings
preprocessor = PreprocessingModel(
    path_data="data/my_data.csv",
    features_specs=features_specs,
    use_numerical_embedding=True,  # Enable numerical embeddings
    numerical_embedding_dim=8,     # Size of each feature's embedding
    numerical_num_bins=10          # Number of bins for discretization
)
2

Advanced Configuration

from kdp import PreprocessingModel
from kdp.features import NumericalFeature
from kdp.enums import FeatureType

# Define numerical features with customized embeddings
features_specs = {
    "age": NumericalFeature(
        name="age",
        feature_type=FeatureType.FLOAT_NORMALIZED,
        use_embedding=True,
        embedding_dim=8,
        num_bins=10,
        init_min=18,  # Domain-specific minimum
        init_max=90   # Domain-specific maximum
    ),
    "income": NumericalFeature(
        name="income",
        feature_type=FeatureType.FLOAT_RESCALED,
        use_embedding=True,
        embedding_dim=12,
        num_bins=15,
        init_min=0,     # Cannot be negative
        init_max=500000 # Maximum expected
    )
}

# Create preprocessing model
preprocessor = PreprocessingModel(
    path_data="data/my_data.csv",
    features_specs=features_specs,
    use_numerical_embedding=True,
    numerical_mlp_hidden_units=16,   # Hidden layer size for continuous branch
    numerical_dropout_rate=0.1,      # Regularization
    numerical_use_batch_norm=True    # Normalize activations
)

๐Ÿง  How It Works

Individual Feature Embeddings (NumericalEmbedding)

The NumericalEmbedding layer processes each numerical feature through two parallel branches:

  1. Continuous Branch:
  2. Processes each feature through a small MLP
  3. Applies dropout and optional batch normalization
  4. Includes a residual connection for stability

  5. Discrete Branch:

  6. Maps each value to a bin using learnable min/max boundaries
  7. Retrieves a learned embedding for each bin
  8. Captures non-linear and discrete patterns

  9. Learnable Gate:

  10. Combines outputs from both branches using a sigmoid gate
  11. Adaptively weights continuous vs. discrete representations
  12. Learns optimal combination per feature and dimension
Input value
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  MLP   โ”‚    โ”‚Binning โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚             โ”‚
         โ–ผ             โ–ผ
   Continuous      Discrete
   Embedding       Embedding
         โ”‚             โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
       Gating Mechanism
               โ”‚
               โ–ผ
       Final Embedding

Global Feature Embeddings (GlobalNumericalEmbedding)

The GlobalNumericalEmbedding layer processes all numerical features together and returns a single compact representation:

  1. Flattens input features (if needed)
  2. Applies NumericalEmbedding to process all features
  3. Performs global pooling (average or max) across feature dimensions
  4. Returns a single vector representing all numerical features

This approach is ideal for: - Processing large feature sets efficiently - Capturing cross-feature interactions - Reducing dimensionality of numerical data - Learning a unified numerical representation

โš™๏ธ Configuration Options

Individual Embeddings

Parameter Type Default Description
use_numerical_embedding bool False Enable numerical embeddings
numerical_embedding_dim int 8 Size of each feature's embedding
numerical_mlp_hidden_units int 16 Hidden layer size for continuous branch
numerical_num_bins int 10 Number of bins for discretization
numerical_init_min float/list -3.0 Initial minimum for scaling
numerical_init_max float/list 3.0 Initial maximum for scaling
numerical_dropout_rate float 0.1 Dropout rate for regularization
numerical_use_batch_norm bool True Apply batch normalization

Global Embeddings

Parameter Type Default Description
use_global_numerical_embedding bool False Enable global numerical embeddings
global_embedding_dim int 8 Size of global embedding
global_mlp_hidden_units int 16 Hidden layer size for continuous branch
global_num_bins int 10 Number of bins for discretization
global_init_min float/list -3.0 Initial minimum for scaling
global_init_max float/list 3.0 Initial maximum for scaling
global_dropout_rate float 0.1 Dropout rate for regularization
global_use_batch_norm bool True Apply batch normalization
global_pooling str "average" Pooling method ("average" or "max")

๐ŸŽฏ Best Use Cases

When to Use Individual Embeddings

  • When each numerical feature conveys distinct information
  • When features have different scales or distributions
  • When you need fine-grained control of each feature's representation
  • When memory usage is a concern (more efficient with many features)
  • For explainability (each feature has its own embedding)

When to Use Global Embeddings

  • When you have many numerical features
  • When features have strong interdependencies
  • When dimensionality reduction is desired
  • When a unified representation of all numerical data is needed
  • For simpler model architectures (single vector output)

๐Ÿ” Examples

Financial Risk Modeling

from kdp import PreprocessingModel
from kdp.features import NumericalFeature
from kdp.enums import FeatureType

# Define financial features with domain knowledge
features_specs = {
    "income": NumericalFeature(
        name="income",
        feature_type=FeatureType.FLOAT_RESCALED,
        use_embedding=True,
        embedding_dim=8,
        num_bins=15,
        init_min=0,
        init_max=1000000
    ),
    "debt_ratio": NumericalFeature(
        name="debt_ratio",
        feature_type=FeatureType.FLOAT_NORMALIZED,
        use_embedding=True,
        embedding_dim=4,
        num_bins=8,
        init_min=0,
        init_max=1  # Ratio typically between 0-1
    ),
    "credit_score": NumericalFeature(
        name="credit_score",
        feature_type=FeatureType.FLOAT_NORMALIZED,
        use_embedding=True,
        embedding_dim=6,
        num_bins=10,
        init_min=300,
        init_max=850  # Standard credit score range
    ),
    "payment_history": NumericalFeature(
        name="payment_history",
        feature_type=FeatureType.FLOAT_NORMALIZED,
        use_embedding=True,
        embedding_dim=8,
        num_bins=5,
        init_min=0,
        init_max=1  # Simplified score between 0-1
    )
}

# Create preprocessing model
preprocessor = PreprocessingModel(
    path_data="data/financial_data.csv",
    features_specs=features_specs,
    use_numerical_embedding=True,
    numerical_mlp_hidden_units=16,
    numerical_dropout_rate=0.2,  # Higher dropout for financial data
    numerical_use_batch_norm=True
)

Healthcare Patient Analysis

from kdp import PreprocessingModel
from kdp.features import NumericalFeature
from kdp.enums import FeatureType

# Define patient features
features_specs = {
    # Define many health metrics
    "age": NumericalFeature(...),
    "bmi": NumericalFeature(...),
    "blood_pressure": NumericalFeature(...),
    "cholesterol": NumericalFeature(...),
    "glucose": NumericalFeature(...),
    # Many more metrics...
}

# Use global embedding to handle many numerical features
preprocessor = PreprocessingModel(
    path_data="data/patient_data.csv",
    features_specs=features_specs,
    use_global_numerical_embedding=True,  # Process all features together
    global_embedding_dim=32,              # Higher dimension for complex data
    global_mlp_hidden_units=64,
    global_num_bins=20,                   # More bins for medical precision
    global_dropout_rate=0.1,
    global_use_batch_norm=True,
    global_pooling="max"                  # Use max pooling to capture extremes
)

๐Ÿ’ก Pro Tips

  1. Choose the Right Embedding Type
  2. Use individual embeddings for interpretability and precise control
  3. Use global embeddings for efficiency with many numerical features

  4. Distribution-Aware Initialization

  5. Set init_min and init_max based on your data's actual distribution
  6. Use domain knowledge to set meaningful boundary points
  7. Initialize closer to anticipated feature range for faster convergence

  8. Dimensionality Guidelines

  9. Start with embedding_dim = 4-8 for simple features
  10. Use 8-16 for complex features with non-linear patterns
  11. For global embeddings, scale with the number of features (16-64)

  12. Performance Tuning

  13. Increase num_bins for more granular discrete representations
  14. Adjust mlp_hidden_units to 2-4x the embedding dimension
  15. Use batch normalization for faster, more stable training
  16. Adjust dropout based on dataset size (higher for small datasets)

  17. Combine with Other KDP Features

  18. Pair with distribution-aware encoding for optimal numerical handling
  19. Use with tabular attention to learn cross-feature interactions
  20. Combine with feature selection for automatic dimensionality reduction

๐Ÿ“Š Model Architecture

Advanced numerical embeddings transform your numerical features into rich representations:

Advanced Numerical Embeddings

Global numerical embeddings allow coordinated embeddings across all features:

Global Numerical Embeddings

These diagrams illustrate how KDP transforms numerical features into rich embedding spaces, capturing complex patterns and non-linear relationships.

๐Ÿ’ก How to Enable

๐Ÿงฉ Dependencies

Core Dependencies

  • ๐Ÿ Python 3.9+
  • ๐Ÿ”„ TensorFlow 2.18.0+
  • ๐Ÿ”ข NumPy 1.22.0+
  • ๐Ÿ“Š Pandas 2.2.0+
  • ๐Ÿ“ loguru 0.7.2+

Optional Dependencies

Package Purpose Install Command
scipy ๐Ÿงช Scientific computing and statistical functions pip install "kdp[dev]"
ipython ๐Ÿ” Interactive Python shell and notebook support pip install "kdp[dev]"
pytest โœ… Testing framework and utilities pip install "kdp[dev]"
pydot ๐Ÿ“Š Graph visualization for model architecture pip install "kdp[dev]"
Development Tools ๐Ÿ› ๏ธ All development dependencies pip install "kdp[dev]"
Documentation Tools ๐Ÿ“š Documentation generation tools pip install "kdp[doc]"