🏷️ Categorical Features

Categorical Features in KDP

Learn how to effectively represent categories, leverage embeddings, and handle high-cardinality data.

📋 Overview

Categorical features represent data that belongs to a finite set of possible values or categories. KDP provides advanced techniques for handling categorical data, from simple encoding to neural embeddings that capture semantic relationships between categories.

🚀 Types of Categorical Features

Feature Type	Best For	Example	When to Use
`STRING_CATEGORICAL`	Text categories	product_type: "shirt", "pants", "shoes"	When categories are text strings
`INTEGER_CATEGORICAL`	Numeric categories	education_level: 1, 2, 3, 4	When categories are already represented as integers
`STRING_HASHED`	High-cardinality sets	user_id: "user_12345", "user_67890"	When there are too many unique categories (>10K)
`MULTI_CATEGORICAL`	Multiple categories per sample	interests: ["sports", "music", "travel"]	When each sample can belong to multiple categories

📝 Basic Usage

from kdp import PreprocessingModel, FeatureType

# Simple categorical features
features = {
    "product_category": FeatureType.STRING_CATEGORICAL,
    "store_id": FeatureType.INTEGER_CATEGORICAL,
    "tags": FeatureType.MULTI_CATEGORICAL
}

preprocessor = PreprocessingModel(
    path_data="product_data.csv",
    features_specs=features
)

🧠 Advanced Configuration

For more control over categorical processing, use the detailed configuration:

from kdp import PreprocessingModel, FeatureType, CategoricalFeature

# Detailed configuration
features = {
    # Basic configuration
    "product_type": FeatureType.STRING_CATEGORICAL,

    # Full configuration with explicit CategoricalFeature
    "store_location": CategoricalFeature(
        name="store_location",
        feature_type=FeatureType.STRING_CATEGORICAL,
        embedding_dim=16,                  # Size of embedding vector
        hash_bucket_size=1000,             # For hashed features
        vocabulary_size=250,               # Limit vocabulary size
        use_embedding=True,                # Use neural embeddings
        unknown_token="<UNK>",             # Token for out-of-vocabulary values
        oov_buckets=10,                    # Out-of-vocabulary buckets
        multi_hot=False                    # For single category per sample
    ),

    # High-cardinality feature using hashing
    "product_id": CategoricalFeature(
        name="product_id",
        feature_type=FeatureType.STRING_HASHED,
        hash_bucket_size=5000
    ),

    # Multi-categorical feature with separator
    "product_tags": CategoricalFeature(
        name="product_tags",
        feature_type=FeatureType.MULTI_CATEGORICAL,
        separator=",",                     # How values are separated in data
        multi_hot=True                     # Enable multi-hot encoding
    )
}

preprocessor = PreprocessingModel(
    path_data="product_data.csv",
    features_specs=features
)

⚙️ Key Configuration Parameters

Parameter	Description	Default	Notes
`embedding_dim`	Size of embedding vectors	8	Higher values capture more complex relationships (8-128)
`hash_bucket_size`	Number of hash buckets for hashed features	1000	Larger values reduce collisions but increase dimensionality
`salt`	Salt value for hash function	None	Custom salt to make hash values unique across features
`hash_with_embedding`	Apply embedding after hashing	False	Combines hashing with embeddings for large vocabularies
`vocabulary_size`	Maximum number of categories to keep	None	None uses all categories, otherwise keeps top N by frequency
`use_embedding`	Enable neural embeddings vs. one-hot encoding	True	Neural embeddings improve performance for most models
`separator`	Character that separates values in multi-categorical features	","	Only used for `MULTI_CATEGORICAL` features
`oov_buckets`	Number of buckets for out-of-vocabulary values	1	Higher values help handle new categories in production

💡 Powerful Features

🧿 Embedding Visualizations

KDP's categorical embeddings can be visualized to see relationships between categories:

# Train the preprocessor
preprocessor.fit()
result = preprocessor.build_preprocessor()

# Extract embeddings for visualization
embeddings = preprocessor.get_feature_embeddings("product_category")

# Visualize with t-SNE or UMAP
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2)
embeddings_2d = tsne.fit_transform(embeddings)

plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
plt.title("Category Embedding Visualization")
plt.show()

🌍 Handling High-Cardinality

KDP provides multiple strategies for dealing with features that have many unique values:

# Method 1: Limit vocabulary size (keeps most frequent)
user_id_limited = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    vocabulary_size=10000  # Keep top 10K users
)

# Method 2: Hash features to buckets (fast, fixed memory)
user_id_hashed = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=5000  # Hash into 5K buckets
)

# Method 3: Hash with embeddings (best balance)
user_id_hash_embed = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    hash_with_embedding=True,
    embedding_size=16
)

🧮 Feature Hashing

Feature hashing transforms categorical values into a fixed-size vector representation, ideal for very high-cardinality features. It's now fully integrated with the ModelAdvisor for automatic configuration:

# Basic feature hashing
product_id = CategoricalFeature(
    name="product_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=1024  # Number of hash buckets
)

# Advanced feature hashing with custom salt
# The salt ensures different features use different hash spaces
session_id = CategoricalFeature(
    name="session_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    salt=42  # Custom salt value
)

# Feature hashing followed by embedding
user_id = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    hash_with_embedding=True,
    embedding_size=16  # Embedding dimension after hashing
)

🤖 Auto-configuration with ModelAdvisor

KDP's ModelAdvisor now intelligently recommends hashing for high-cardinality features:

from kdp.model_advisor import recommend_model_configuration
from kdp.stats import DatasetStatistics

# Analyze dataset statistics
stats_calculator = DatasetStatistics("high_cardinality_data.csv")
stats_calculator.compute_statistics()

# Get recommendations from ModelAdvisor
recommendations = recommend_model_configuration(stats_calculator.features_stats)

# The recommendations will include HASHING for high-cardinality features
# Example output:
'''
{
  "features": {
    "user_id": {
      "feature_type": "CategoricalFeature",
      "preprocessing": ["HASHING"],
      "config": {
        "category_encoding": "HASHING",
        "hash_bucket_size": 2048,
        "hash_with_embedding": true,
        "embedding_size": 16
      },
      "notes": ["High cardinality feature (10K+ values)", "Using hashing for efficiency"]
    },
    ...
  }
}
'''

# Generate code from recommendations
code_snippet = recommendations["code_snippet"]
print(code_snippet)

🔍 Choosing Between Encoding Options

KDP offers multiple encoding options for categorical features. Here's how to choose:

Encoding	Vocabulary Size	Memory Usage	New Categories	Semantic Information
One-Hot Encoding	Small (< 50)	High	❌ Requires retraining	❌ No relationship capture
Embeddings	Medium (50-10K)	Medium	⚠️ Limited by OOV handling	✅ Captures relationships
Hashing	Very Large (10K+)	Low (fixed)	✅ Handles new values	❌ No relationship capture
Hashing with Embeddings	Very Large (10K+)	Low-Medium	✅ Handles new values	✅ Some relationship capture

The ModelAdvisor analyzes your data and automatically recommends the optimal encoding based on these criteria.

# End-to-end example with automatic encoding selection
from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions
from kdp.processor import PreprocessingModel
from kdp.stats import DatasetStatistics
from kdp.model_advisor import recommend_model_configuration

# 1. Analyze dataset
stats = DatasetStatistics("product_dataset.csv")
stats.compute_statistics()

# 2. Get recommendations
recommendations = recommend_model_configuration(stats.features_stats)

# 3. Create preprocessing model using recommended config
features = {}
for name, feature_rec in recommendations["features"].items():
    if feature_rec["feature_type"] == "CategoricalFeature":
        # Extract configuration for this categorical feature
        config = feature_rec["config"]
        features[name] = CategoricalFeature(
            name=name,
            feature_type=getattr(FeatureType, config.get("feature_type", "STRING_CATEGORICAL")),
            category_encoding=getattr(CategoryEncodingOptions, config.get("category_encoding", "EMBEDDING")),
            hash_bucket_size=config.get("hash_bucket_size"),
            hash_with_embedding=config.get("hash_with_embedding", False),
            embedding_size=config.get("embedding_size"),
            salt=config.get("salt")
        )

# 4. Create and build the preprocessing model
model = PreprocessingModel(
    path_data="product_dataset.csv",
    features_specs=features
)
preprocessor = model.build_preprocessor()

🔧 Real-World Examples

E-commerce Product Categorization

# E-commerce features with hierarchical categories
preprocessor = PreprocessingModel(
    path_data="products.csv",
    features_specs={
        # Main category, subcategory, and detailed category
        "main_category": FeatureType.STRING_CATEGORICAL,
        "subcategory": FeatureType.STRING_CATEGORICAL,
        "detailed_category": FeatureType.STRING_CATEGORICAL,

        # Product attributes as multi-categories
        "product_features": CategoricalFeature(
            name="product_features",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            separator="|",
            multi_hot=True
        ),

        # Brand as a high-cardinality feature
        "brand": CategoricalFeature(
            name="brand",
            feature_type=FeatureType.STRING_CATEGORICAL,
            embedding_dim=16,
            vocabulary_size=1000  # Top 1000 brands
        )
    }
)

Content Recommendation System

# Content recommendation with user and item features
preprocessor = PreprocessingModel(
    path_data="interaction_data.csv",
    features_specs={
        # User features
        "user_id": CategoricalFeature(
            name="user_id",
            feature_type=FeatureType.STRING_HASHED,
            hash_bucket_size=10000
        ),
        "user_interests": CategoricalFeature(
            name="user_interests",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            embedding_dim=32,
            separator=","
        ),

        # Content features
        "content_id": CategoricalFeature(
            name="content_id",
            feature_type=FeatureType.STRING_HASHED,
            hash_bucket_size=5000
        ),
        "content_tags": CategoricalFeature(
            name="content_tags",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            embedding_dim=24,
            separator="|"
        ),
        "content_type": FeatureType.STRING_CATEGORICAL
    }
)

💎 Pro Tips

🔍 Choose Embedding Dimensions Wisely

For simple categories with few values (2-10), use 4-8 dimensions. For complex categories with many values (100+), use 16-64 dimensions. The more complex the relationships between categories, the higher dimensions you need.

⚡ Pre-train Embeddings

KDP allows you to initialize embeddings with pre-trained vectors for faster convergence:

# Create initial embeddings dictionary
pretrained = {
    "sports": [0.1, 0.2, 0.3, 0.4],
    "music": [0.5, 0.6, 0.7, 0.8]
}

# Use pre-trained embeddings
category_feature = CategoricalFeature(
    name="interest",
    feature_type=FeatureType.STRING_CATEGORICAL,
    embedding_dim=4,
    pretrained_embeddings=pretrained
)

🌀 Combine Multiple Encoding Strategies

For critical features, consider using both embeddings and one-hot encoding in parallel:

# Main feature with embedding
features["product_type"] = CategoricalFeature(
    name="product_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    use_embedding=True
)

# Same feature with one-hot encoding
features["product_type_onehot"] = CategoricalFeature(
    name="product_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    use_embedding=False
)

🔄 Handling Unknown Categories

Configure how KDP handles previously unseen categories in production:

feature = CategoricalFeature(
    name="store_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    unknown_token="<NEW_STORE>",  # Custom token
    oov_buckets=5                 # Use 5 different embeddings
)

📊 Understanding Categorical Embeddings

graph TD A[Raw Category Data] -->|Vocabulary Creation| B[Category Vocabulary] B -->|Lookup| C[Integer Indices] C -->|Embedding Layer| D[Dense Vectors] style A fill:#f9f9f9,stroke:#ccc,stroke-width:2px style B fill:#e1f5fe,stroke:#4fc3f7,stroke-width:2px style C fill:#e8f5e9,stroke:#66bb6a,stroke-width:2px style D fill:#f3e5f5,stroke:#ce93d8,stroke-width:2px

Categorical embeddings transform categorical values into dense vector representations that capture semantic relationships between categories.

← Numerical Features Text Features →

🏷️ Categorical Features