๐ท๏ธ Categorical Features
Categorical Features in KDP
Learn how to effectively represent categories, leverage embeddings, and handle high-cardinality data.
๐ Overview
Categorical features represent data that belongs to a finite set of possible values or categories. KDP provides advanced techniques for handling categorical data, from simple encoding to neural embeddings that capture semantic relationships between categories.
๐ Types of Categorical Features
| Feature Type | Best For | Example | When to Use | 
|---|---|---|---|
STRING_CATEGORICAL | 
        Text categories | product_type: "shirt", "pants", "shoes" | When categories are text strings | 
INTEGER_CATEGORICAL | 
        Numeric categories | education_level: 1, 2, 3, 4 | When categories are already represented as integers | 
STRING_HASHED | 
        High-cardinality sets | user_id: "user_12345", "user_67890" | When there are too many unique categories (>10K) | 
MULTI_CATEGORICAL | 
        Multiple categories per sample | interests: ["sports", "music", "travel"] | When each sample can belong to multiple categories | 
๐ Basic Usage
from kdp import PreprocessingModel, FeatureType
# Simple categorical features
features = {
    "product_category": FeatureType.STRING_CATEGORICAL,
    "store_id": FeatureType.INTEGER_CATEGORICAL,
    "tags": FeatureType.MULTI_CATEGORICAL
}
preprocessor = PreprocessingModel(
    path_data="product_data.csv",
    features_specs=features
)
๐ง Advanced Configuration
For more control over categorical processing, use the detailed configuration:
from kdp import PreprocessingModel, FeatureType, CategoricalFeature
# Detailed configuration
features = {
    # Basic configuration
    "product_type": FeatureType.STRING_CATEGORICAL,
    # Full configuration with explicit CategoricalFeature
    "store_location": CategoricalFeature(
        name="store_location",
        feature_type=FeatureType.STRING_CATEGORICAL,
        embedding_dim=16,                  # Size of embedding vector
        hash_bucket_size=1000,             # For hashed features
        vocabulary_size=250,               # Limit vocabulary size
        use_embedding=True,                # Use neural embeddings
        unknown_token="<UNK>",             # Token for out-of-vocabulary values
        oov_buckets=10,                    # Out-of-vocabulary buckets
        multi_hot=False                    # For single category per sample
    ),
    # High-cardinality feature using hashing
    "product_id": CategoricalFeature(
        name="product_id",
        feature_type=FeatureType.STRING_HASHED,
        hash_bucket_size=5000
    ),
    # Multi-categorical feature with separator
    "product_tags": CategoricalFeature(
        name="product_tags",
        feature_type=FeatureType.MULTI_CATEGORICAL,
        separator=",",                     # How values are separated in data
        multi_hot=True                     # Enable multi-hot encoding
    )
}
preprocessor = PreprocessingModel(
    path_data="product_data.csv",
    features_specs=features
)
โ๏ธ Key Configuration Parameters
| Parameter | Description | Default | Notes | 
|---|---|---|---|
embedding_dim | 
        Size of embedding vectors | 8 | Higher values capture more complex relationships (8-128) | 
hash_bucket_size | 
        Number of hash buckets for hashed features | 1000 | Larger values reduce collisions but increase dimensionality | 
salt | 
        Salt value for hash function | None | Custom salt to make hash values unique across features | 
hash_with_embedding | 
        Apply embedding after hashing | False | Combines hashing with embeddings for large vocabularies | 
vocabulary_size | 
        Maximum number of categories to keep | None | None uses all categories, otherwise keeps top N by frequency | 
use_embedding | 
        Enable neural embeddings vs. one-hot encoding | True | Neural embeddings improve performance for most models | 
separator | 
        Character that separates values in multi-categorical features | "," | Only used for MULTI_CATEGORICAL features | 
      
oov_buckets | 
        Number of buckets for out-of-vocabulary values | 1 | Higher values help handle new categories in production | 
๐ก Powerful Features
๐งฟ Embedding Visualizations
KDP's categorical embeddings can be visualized to see relationships between categories:
# Train the preprocessor
preprocessor.fit()
result = preprocessor.build_preprocessor()
# Extract embeddings for visualization
embeddings = preprocessor.get_feature_embeddings("product_category")
# Visualize with t-SNE or UMAP
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2)
embeddings_2d = tsne.fit_transform(embeddings)
plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
plt.title("Category Embedding Visualization")
plt.show()
๐ Handling High-Cardinality
KDP provides multiple strategies for dealing with features that have many unique values:
# Method 1: Limit vocabulary size (keeps most frequent)
user_id_limited = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    vocabulary_size=10000  # Keep top 10K users
)
# Method 2: Hash features to buckets (fast, fixed memory)
user_id_hashed = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=5000  # Hash into 5K buckets
)
# Method 3: Hash with embeddings (best balance)
user_id_hash_embed = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    hash_with_embedding=True,
    embedding_size=16
)
๐งฎ Feature Hashing
Feature hashing transforms categorical values into a fixed-size vector representation, ideal for very high-cardinality features. It's now fully integrated with the ModelAdvisor for automatic configuration:
# Basic feature hashing
product_id = CategoricalFeature(
    name="product_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=1024  # Number of hash buckets
)
# Advanced feature hashing with custom salt
# The salt ensures different features use different hash spaces
session_id = CategoricalFeature(
    name="session_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    salt=42  # Custom salt value
)
# Feature hashing followed by embedding
user_id = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    hash_with_embedding=True,
    embedding_size=16  # Embedding dimension after hashing
)
๐ค Auto-configuration with ModelAdvisor
KDP's ModelAdvisor now intelligently recommends hashing for high-cardinality features:
from kdp.model_advisor import recommend_model_configuration
from kdp.stats import DatasetStatistics
# Analyze dataset statistics
stats_calculator = DatasetStatistics("high_cardinality_data.csv")
stats_calculator.compute_statistics()
# Get recommendations from ModelAdvisor
recommendations = recommend_model_configuration(stats_calculator.features_stats)
# The recommendations will include HASHING for high-cardinality features
# Example output:
'''
{
  "features": {
    "user_id": {
      "feature_type": "CategoricalFeature",
      "preprocessing": ["HASHING"],
      "config": {
        "category_encoding": "HASHING",
        "hash_bucket_size": 2048,
        "hash_with_embedding": true,
        "embedding_size": 16
      },
      "notes": ["High cardinality feature (10K+ values)", "Using hashing for efficiency"]
    },
    ...
  }
}
'''
# Generate code from recommendations
code_snippet = recommendations["code_snippet"]
print(code_snippet)
๐ Choosing Between Encoding Options
KDP offers multiple encoding options for categorical features. Here's how to choose:
| Encoding | Vocabulary Size | Memory Usage | New Categories | Semantic Information | 
|---|---|---|---|---|
| One-Hot Encoding | Small (< 50) | High | โ Requires retraining | โ No relationship capture | 
| Embeddings | Medium (50-10K) | Medium | โ ๏ธ Limited by OOV handling | โ Captures relationships | 
| Hashing | Very Large (10K+) | Low (fixed) | โ Handles new values | โ No relationship capture | 
| Hashing with Embeddings | Very Large (10K+) | Low-Medium | โ Handles new values | โ Some relationship capture | 
The ModelAdvisor analyzes your data and automatically recommends the optimal encoding based on these criteria.
# End-to-end example with automatic encoding selection
from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions
from kdp.processor import PreprocessingModel
from kdp.stats import DatasetStatistics
from kdp.model_advisor import recommend_model_configuration
# 1. Analyze dataset
stats = DatasetStatistics("product_dataset.csv")
stats.compute_statistics()
# 2. Get recommendations
recommendations = recommend_model_configuration(stats.features_stats)
# 3. Create preprocessing model using recommended config
features = {}
for name, feature_rec in recommendations["features"].items():
    if feature_rec["feature_type"] == "CategoricalFeature":
        # Extract configuration for this categorical feature
        config = feature_rec["config"]
        features[name] = CategoricalFeature(
            name=name,
            feature_type=getattr(FeatureType, config.get("feature_type", "STRING_CATEGORICAL")),
            category_encoding=getattr(CategoryEncodingOptions, config.get("category_encoding", "EMBEDDING")),
            hash_bucket_size=config.get("hash_bucket_size"),
            hash_with_embedding=config.get("hash_with_embedding", False),
            embedding_size=config.get("embedding_size"),
            salt=config.get("salt")
        )
# 4. Create and build the preprocessing model
model = PreprocessingModel(
    path_data="product_dataset.csv",
    features_specs=features
)
preprocessor = model.build_preprocessor()
๐ง Real-World Examples
E-commerce Product Categorization
# E-commerce features with hierarchical categories
preprocessor = PreprocessingModel(
    path_data="products.csv",
    features_specs={
        # Main category, subcategory, and detailed category
        "main_category": FeatureType.STRING_CATEGORICAL,
        "subcategory": FeatureType.STRING_CATEGORICAL,
        "detailed_category": FeatureType.STRING_CATEGORICAL,
        # Product attributes as multi-categories
        "product_features": CategoricalFeature(
            name="product_features",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            separator="|",
            multi_hot=True
        ),
        # Brand as a high-cardinality feature
        "brand": CategoricalFeature(
            name="brand",
            feature_type=FeatureType.STRING_CATEGORICAL,
            embedding_dim=16,
            vocabulary_size=1000  # Top 1000 brands
        )
    }
)
Content Recommendation System
# Content recommendation with user and item features
preprocessor = PreprocessingModel(
    path_data="interaction_data.csv",
    features_specs={
        # User features
        "user_id": CategoricalFeature(
            name="user_id",
            feature_type=FeatureType.STRING_HASHED,
            hash_bucket_size=10000
        ),
        "user_interests": CategoricalFeature(
            name="user_interests",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            embedding_dim=32,
            separator=","
        ),
        # Content features
        "content_id": CategoricalFeature(
            name="content_id",
            feature_type=FeatureType.STRING_HASHED,
            hash_bucket_size=5000
        ),
        "content_tags": CategoricalFeature(
            name="content_tags",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            embedding_dim=24,
            separator="|"
        ),
        "content_type": FeatureType.STRING_CATEGORICAL
    }
)
๐ Pro Tips
๐ Choose Embedding Dimensions Wisely
For simple categories with few values (2-10), use 4-8 dimensions. For complex categories with many values (100+), use 16-64 dimensions. The more complex the relationships between categories, the higher dimensions you need.
โก Pre-train Embeddings
KDP allows you to initialize embeddings with pre-trained vectors for faster convergence:
# Create initial embeddings dictionary
pretrained = {
    "sports": [0.1, 0.2, 0.3, 0.4],
    "music": [0.5, 0.6, 0.7, 0.8]
}
# Use pre-trained embeddings
category_feature = CategoricalFeature(
    name="interest",
    feature_type=FeatureType.STRING_CATEGORICAL,
    embedding_dim=4,
    pretrained_embeddings=pretrained
)
๐ Combine Multiple Encoding Strategies
For critical features, consider using both embeddings and one-hot encoding in parallel:
# Main feature with embedding
features["product_type"] = CategoricalFeature(
    name="product_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    use_embedding=True
)
# Same feature with one-hot encoding
features["product_type_onehot"] = CategoricalFeature(
    name="product_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    use_embedding=False
)
๐ Handling Unknown Categories
Configure how KDP handles previously unseen categories in production:
feature = CategoricalFeature(
    name="store_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    unknown_token="<NEW_STORE>",  # Custom token
    oov_buckets=5                 # Use 5 different embeddings
)
๐ Understanding Categorical Embeddings
Categorical embeddings transform categorical values into dense vector representations that capture semantic relationships between categories.