Skip to content

๐Ÿท๏ธ Categorical Features

Categorical Features in KDP

Learn how to effectively represent categories, leverage embeddings, and handle high-cardinality data.

๐Ÿ“‹ Overview

Categorical features represent data that belongs to a finite set of possible values or categories. KDP provides advanced techniques for handling categorical data, from simple encoding to neural embeddings that capture semantic relationships between categories.

๐Ÿš€ Types of Categorical Features

Feature Type Best For Example When to Use
STRING_CATEGORICAL Text categories product_type: "shirt", "pants", "shoes" When categories are text strings
INTEGER_CATEGORICAL Numeric categories education_level: 1, 2, 3, 4 When categories are already represented as integers
STRING_HASHED High-cardinality sets user_id: "user_12345", "user_67890" When there are too many unique categories (>10K)
MULTI_CATEGORICAL Multiple categories per sample interests: ["sports", "music", "travel"] When each sample can belong to multiple categories

๐Ÿ“ Basic Usage

from kdp import PreprocessingModel, FeatureType

# Simple categorical features
features = {
    "product_category": FeatureType.STRING_CATEGORICAL,
    "store_id": FeatureType.INTEGER_CATEGORICAL,
    "tags": FeatureType.MULTI_CATEGORICAL
}

preprocessor = PreprocessingModel(
    path_data="product_data.csv",
    features_specs=features
)

๐Ÿง  Advanced Configuration

For more control over categorical processing, use the detailed configuration:

from kdp import PreprocessingModel, FeatureType, CategoricalFeature

# Detailed configuration
features = {
    # Basic configuration
    "product_type": FeatureType.STRING_CATEGORICAL,

    # Full configuration with explicit CategoricalFeature
    "store_location": CategoricalFeature(
        name="store_location",
        feature_type=FeatureType.STRING_CATEGORICAL,
        embedding_dim=16,                  # Size of embedding vector
        hash_bucket_size=1000,             # For hashed features
        vocabulary_size=250,               # Limit vocabulary size
        use_embedding=True,                # Use neural embeddings
        unknown_token="<UNK>",             # Token for out-of-vocabulary values
        oov_buckets=10,                    # Out-of-vocabulary buckets
        multi_hot=False                    # For single category per sample
    ),

    # High-cardinality feature using hashing
    "product_id": CategoricalFeature(
        name="product_id",
        feature_type=FeatureType.STRING_HASHED,
        hash_bucket_size=5000
    ),

    # Multi-categorical feature with separator
    "product_tags": CategoricalFeature(
        name="product_tags",
        feature_type=FeatureType.MULTI_CATEGORICAL,
        separator=",",                     # How values are separated in data
        multi_hot=True                     # Enable multi-hot encoding
    )
}

preprocessor = PreprocessingModel(
    path_data="product_data.csv",
    features_specs=features
)

โš™๏ธ Key Configuration Parameters

Parameter Description Default Notes
embedding_dim Size of embedding vectors 8 Higher values capture more complex relationships (8-128)
hash_bucket_size Number of hash buckets for hashed features 1000 Larger values reduce collisions but increase dimensionality
salt Salt value for hash function None Custom salt to make hash values unique across features
hash_with_embedding Apply embedding after hashing False Combines hashing with embeddings for large vocabularies
vocabulary_size Maximum number of categories to keep None None uses all categories, otherwise keeps top N by frequency
use_embedding Enable neural embeddings vs. one-hot encoding True Neural embeddings improve performance for most models
separator Character that separates values in multi-categorical features "," Only used for MULTI_CATEGORICAL features
oov_buckets Number of buckets for out-of-vocabulary values 1 Higher values help handle new categories in production

๐Ÿ’ก Powerful Features

๐Ÿงฟ Embedding Visualizations

KDP's categorical embeddings can be visualized to see relationships between categories:

# Train the preprocessor
preprocessor.fit()
result = preprocessor.build_preprocessor()

# Extract embeddings for visualization
embeddings = preprocessor.get_feature_embeddings("product_category")

# Visualize with t-SNE or UMAP
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2)
embeddings_2d = tsne.fit_transform(embeddings)

plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
plt.title("Category Embedding Visualization")
plt.show()

๐ŸŒ Handling High-Cardinality

KDP provides multiple strategies for dealing with features that have many unique values:

# Method 1: Limit vocabulary size (keeps most frequent)
user_id_limited = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    vocabulary_size=10000  # Keep top 10K users
)

# Method 2: Hash features to buckets (fast, fixed memory)
user_id_hashed = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=5000  # Hash into 5K buckets
)

# Method 3: Hash with embeddings (best balance)
user_id_hash_embed = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    hash_with_embedding=True,
    embedding_size=16
)

๐Ÿงฎ Feature Hashing

Feature hashing transforms categorical values into a fixed-size vector representation, ideal for very high-cardinality features. It's now fully integrated with the ModelAdvisor for automatic configuration:

# Basic feature hashing
product_id = CategoricalFeature(
    name="product_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=1024  # Number of hash buckets
)

# Advanced feature hashing with custom salt
# The salt ensures different features use different hash spaces
session_id = CategoricalFeature(
    name="session_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    salt=42  # Custom salt value
)

# Feature hashing followed by embedding
user_id = CategoricalFeature(
    name="user_id",
    feature_type=FeatureType.STRING_CATEGORICAL,
    category_encoding=CategoryEncodingOptions.HASHING,
    hash_bucket_size=2048,
    hash_with_embedding=True,
    embedding_size=16  # Embedding dimension after hashing
)

๐Ÿค– Auto-configuration with ModelAdvisor

KDP's ModelAdvisor now intelligently recommends hashing for high-cardinality features:

from kdp.model_advisor import recommend_model_configuration
from kdp.stats import DatasetStatistics

# Analyze dataset statistics
stats_calculator = DatasetStatistics("high_cardinality_data.csv")
stats_calculator.compute_statistics()

# Get recommendations from ModelAdvisor
recommendations = recommend_model_configuration(stats_calculator.features_stats)

# The recommendations will include HASHING for high-cardinality features
# Example output:
'''
{
  "features": {
    "user_id": {
      "feature_type": "CategoricalFeature",
      "preprocessing": ["HASHING"],
      "config": {
        "category_encoding": "HASHING",
        "hash_bucket_size": 2048,
        "hash_with_embedding": true,
        "embedding_size": 16
      },
      "notes": ["High cardinality feature (10K+ values)", "Using hashing for efficiency"]
    },
    ...
  }
}
'''

# Generate code from recommendations
code_snippet = recommendations["code_snippet"]
print(code_snippet)

๐Ÿ” Choosing Between Encoding Options

KDP offers multiple encoding options for categorical features. Here's how to choose:

Encoding Vocabulary Size Memory Usage New Categories Semantic Information
One-Hot Encoding Small (< 50) High โŒ Requires retraining โŒ No relationship capture
Embeddings Medium (50-10K) Medium โš ๏ธ Limited by OOV handling โœ… Captures relationships
Hashing Very Large (10K+) Low (fixed) โœ… Handles new values โŒ No relationship capture
Hashing with Embeddings Very Large (10K+) Low-Medium โœ… Handles new values โœ… Some relationship capture

The ModelAdvisor analyzes your data and automatically recommends the optimal encoding based on these criteria.

# End-to-end example with automatic encoding selection
from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions
from kdp.processor import PreprocessingModel
from kdp.stats import DatasetStatistics
from kdp.model_advisor import recommend_model_configuration

# 1. Analyze dataset
stats = DatasetStatistics("product_dataset.csv")
stats.compute_statistics()

# 2. Get recommendations
recommendations = recommend_model_configuration(stats.features_stats)

# 3. Create preprocessing model using recommended config
features = {}
for name, feature_rec in recommendations["features"].items():
    if feature_rec["feature_type"] == "CategoricalFeature":
        # Extract configuration for this categorical feature
        config = feature_rec["config"]
        features[name] = CategoricalFeature(
            name=name,
            feature_type=getattr(FeatureType, config.get("feature_type", "STRING_CATEGORICAL")),
            category_encoding=getattr(CategoryEncodingOptions, config.get("category_encoding", "EMBEDDING")),
            hash_bucket_size=config.get("hash_bucket_size"),
            hash_with_embedding=config.get("hash_with_embedding", False),
            embedding_size=config.get("embedding_size"),
            salt=config.get("salt")
        )

# 4. Create and build the preprocessing model
model = PreprocessingModel(
    path_data="product_dataset.csv",
    features_specs=features
)
preprocessor = model.build_preprocessor()

๐Ÿ”ง Real-World Examples

E-commerce Product Categorization

# E-commerce features with hierarchical categories
preprocessor = PreprocessingModel(
    path_data="products.csv",
    features_specs={
        # Main category, subcategory, and detailed category
        "main_category": FeatureType.STRING_CATEGORICAL,
        "subcategory": FeatureType.STRING_CATEGORICAL,
        "detailed_category": FeatureType.STRING_CATEGORICAL,

        # Product attributes as multi-categories
        "product_features": CategoricalFeature(
            name="product_features",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            separator="|",
            multi_hot=True
        ),

        # Brand as a high-cardinality feature
        "brand": CategoricalFeature(
            name="brand",
            feature_type=FeatureType.STRING_CATEGORICAL,
            embedding_dim=16,
            vocabulary_size=1000  # Top 1000 brands
        )
    }
)

Content Recommendation System

# Content recommendation with user and item features
preprocessor = PreprocessingModel(
    path_data="interaction_data.csv",
    features_specs={
        # User features
        "user_id": CategoricalFeature(
            name="user_id",
            feature_type=FeatureType.STRING_HASHED,
            hash_bucket_size=10000
        ),
        "user_interests": CategoricalFeature(
            name="user_interests",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            embedding_dim=32,
            separator=","
        ),

        # Content features
        "content_id": CategoricalFeature(
            name="content_id",
            feature_type=FeatureType.STRING_HASHED,
            hash_bucket_size=5000
        ),
        "content_tags": CategoricalFeature(
            name="content_tags",
            feature_type=FeatureType.MULTI_CATEGORICAL,
            embedding_dim=24,
            separator="|"
        ),
        "content_type": FeatureType.STRING_CATEGORICAL
    }
)

๐Ÿ’Ž Pro Tips

๐Ÿ” Choose Embedding Dimensions Wisely

For simple categories with few values (2-10), use 4-8 dimensions. For complex categories with many values (100+), use 16-64 dimensions. The more complex the relationships between categories, the higher dimensions you need.

โšก Pre-train Embeddings

KDP allows you to initialize embeddings with pre-trained vectors for faster convergence:

# Create initial embeddings dictionary
pretrained = {
    "sports": [0.1, 0.2, 0.3, 0.4],
    "music": [0.5, 0.6, 0.7, 0.8]
}

# Use pre-trained embeddings
category_feature = CategoricalFeature(
    name="interest",
    feature_type=FeatureType.STRING_CATEGORICAL,
    embedding_dim=4,
    pretrained_embeddings=pretrained
)

๐ŸŒ€ Combine Multiple Encoding Strategies

For critical features, consider using both embeddings and one-hot encoding in parallel:

# Main feature with embedding
features["product_type"] = CategoricalFeature(
    name="product_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    use_embedding=True
)

# Same feature with one-hot encoding
features["product_type_onehot"] = CategoricalFeature(
    name="product_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    use_embedding=False
)

๐Ÿ”„ Handling Unknown Categories

Configure how KDP handles previously unseen categories in production:

feature = CategoricalFeature(
    name="store_type",
    feature_type=FeatureType.STRING_CATEGORICAL,
    unknown_token="<NEW_STORE>",  # Custom token
    oov_buckets=5                 # Use 5 different embeddings
)

๐Ÿ“Š Understanding Categorical Embeddings

graph TD A[Raw Category Data] -->|Vocabulary Creation| B[Category Vocabulary] B -->|Lookup| C[Integer Indices] C -->|Embedding Layer| D[Dense Vectors] style A fill:#f9f9f9,stroke:#ccc,stroke-width:2px style B fill:#e1f5fe,stroke:#4fc3f7,stroke-width:2px style C fill:#e8f5e9,stroke:#66bb6a,stroke-width:2px style D fill:#f3e5f5,stroke:#ce93d8,stroke-width:2px

Categorical embeddings transform categorical values into dense vector representations that capture semantic relationships between categories.