๐ท๏ธ Categorical Features
Categorical Features in KDP
Learn how to effectively represent categories, leverage embeddings, and handle high-cardinality data.
๐ Overview
Categorical features represent data that belongs to a finite set of possible values or categories. KDP provides advanced techniques for handling categorical data, from simple encoding to neural embeddings that capture semantic relationships between categories.
๐ Types of Categorical Features
Feature Type | Best For | Example | When to Use |
---|---|---|---|
STRING_CATEGORICAL |
Text categories | product_type: "shirt", "pants", "shoes" | When categories are text strings |
INTEGER_CATEGORICAL |
Numeric categories | education_level: 1, 2, 3, 4 | When categories are already represented as integers |
STRING_HASHED |
High-cardinality sets | user_id: "user_12345", "user_67890" | When there are too many unique categories (>10K) |
MULTI_CATEGORICAL |
Multiple categories per sample | interests: ["sports", "music", "travel"] | When each sample can belong to multiple categories |
๐ Basic Usage
from kdp import PreprocessingModel, FeatureType
# Simple categorical features
features = {
"product_category": FeatureType.STRING_CATEGORICAL,
"store_id": FeatureType.INTEGER_CATEGORICAL,
"tags": FeatureType.MULTI_CATEGORICAL
}
preprocessor = PreprocessingModel(
path_data="product_data.csv",
features_specs=features
)
๐ง Advanced Configuration
For more control over categorical processing, use the detailed configuration:
from kdp import PreprocessingModel, FeatureType, CategoricalFeature
# Detailed configuration
features = {
# Basic configuration
"product_type": FeatureType.STRING_CATEGORICAL,
# Full configuration with explicit CategoricalFeature
"store_location": CategoricalFeature(
name="store_location",
feature_type=FeatureType.STRING_CATEGORICAL,
embedding_dim=16, # Size of embedding vector
hash_bucket_size=1000, # For hashed features
vocabulary_size=250, # Limit vocabulary size
use_embedding=True, # Use neural embeddings
unknown_token="<UNK>", # Token for out-of-vocabulary values
oov_buckets=10, # Out-of-vocabulary buckets
multi_hot=False # For single category per sample
),
# High-cardinality feature using hashing
"product_id": CategoricalFeature(
name="product_id",
feature_type=FeatureType.STRING_HASHED,
hash_bucket_size=5000
),
# Multi-categorical feature with separator
"product_tags": CategoricalFeature(
name="product_tags",
feature_type=FeatureType.MULTI_CATEGORICAL,
separator=",", # How values are separated in data
multi_hot=True # Enable multi-hot encoding
)
}
preprocessor = PreprocessingModel(
path_data="product_data.csv",
features_specs=features
)
โ๏ธ Key Configuration Parameters
Parameter | Description | Default | Notes |
---|---|---|---|
embedding_dim |
Size of embedding vectors | 8 | Higher values capture more complex relationships (8-128) |
hash_bucket_size |
Number of hash buckets for hashed features | 1000 | Larger values reduce collisions but increase dimensionality |
salt |
Salt value for hash function | None | Custom salt to make hash values unique across features |
hash_with_embedding |
Apply embedding after hashing | False | Combines hashing with embeddings for large vocabularies |
vocabulary_size |
Maximum number of categories to keep | None | None uses all categories, otherwise keeps top N by frequency |
use_embedding |
Enable neural embeddings vs. one-hot encoding | True | Neural embeddings improve performance for most models |
separator |
Character that separates values in multi-categorical features | "," | Only used for MULTI_CATEGORICAL features |
oov_buckets |
Number of buckets for out-of-vocabulary values | 1 | Higher values help handle new categories in production |
๐ก Powerful Features
๐งฟ Embedding Visualizations
KDP's categorical embeddings can be visualized to see relationships between categories:
# Train the preprocessor
preprocessor.fit()
result = preprocessor.build_preprocessor()
# Extract embeddings for visualization
embeddings = preprocessor.get_feature_embeddings("product_category")
# Visualize with t-SNE or UMAP
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2)
embeddings_2d = tsne.fit_transform(embeddings)
plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
plt.title("Category Embedding Visualization")
plt.show()
๐ Handling High-Cardinality
KDP provides multiple strategies for dealing with features that have many unique values:
# Method 1: Limit vocabulary size (keeps most frequent)
user_id_limited = CategoricalFeature(
name="user_id",
feature_type=FeatureType.STRING_CATEGORICAL,
vocabulary_size=10000 # Keep top 10K users
)
# Method 2: Hash features to buckets (fast, fixed memory)
user_id_hashed = CategoricalFeature(
name="user_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=5000 # Hash into 5K buckets
)
# Method 3: Hash with embeddings (best balance)
user_id_hash_embed = CategoricalFeature(
name="user_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=2048,
hash_with_embedding=True,
embedding_size=16
)
๐งฎ Feature Hashing
Feature hashing transforms categorical values into a fixed-size vector representation, ideal for very high-cardinality features. It's now fully integrated with the ModelAdvisor for automatic configuration:
# Basic feature hashing
product_id = CategoricalFeature(
name="product_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=1024 # Number of hash buckets
)
# Advanced feature hashing with custom salt
# The salt ensures different features use different hash spaces
session_id = CategoricalFeature(
name="session_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=2048,
salt=42 # Custom salt value
)
# Feature hashing followed by embedding
user_id = CategoricalFeature(
name="user_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=2048,
hash_with_embedding=True,
embedding_size=16 # Embedding dimension after hashing
)
๐ค Auto-configuration with ModelAdvisor
KDP's ModelAdvisor now intelligently recommends hashing for high-cardinality features:
from kdp.model_advisor import recommend_model_configuration
from kdp.stats import DatasetStatistics
# Analyze dataset statistics
stats_calculator = DatasetStatistics("high_cardinality_data.csv")
stats_calculator.compute_statistics()
# Get recommendations from ModelAdvisor
recommendations = recommend_model_configuration(stats_calculator.features_stats)
# The recommendations will include HASHING for high-cardinality features
# Example output:
'''
{
"features": {
"user_id": {
"feature_type": "CategoricalFeature",
"preprocessing": ["HASHING"],
"config": {
"category_encoding": "HASHING",
"hash_bucket_size": 2048,
"hash_with_embedding": true,
"embedding_size": 16
},
"notes": ["High cardinality feature (10K+ values)", "Using hashing for efficiency"]
},
...
}
}
'''
# Generate code from recommendations
code_snippet = recommendations["code_snippet"]
print(code_snippet)
๐ Choosing Between Encoding Options
KDP offers multiple encoding options for categorical features. Here's how to choose:
Encoding | Vocabulary Size | Memory Usage | New Categories | Semantic Information |
---|---|---|---|---|
One-Hot Encoding | Small (< 50) | High | โ Requires retraining | โ No relationship capture |
Embeddings | Medium (50-10K) | Medium | โ ๏ธ Limited by OOV handling | โ Captures relationships |
Hashing | Very Large (10K+) | Low (fixed) | โ Handles new values | โ No relationship capture |
Hashing with Embeddings | Very Large (10K+) | Low-Medium | โ Handles new values | โ Some relationship capture |
The ModelAdvisor analyzes your data and automatically recommends the optimal encoding based on these criteria.
# End-to-end example with automatic encoding selection
from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions
from kdp.processor import PreprocessingModel
from kdp.stats import DatasetStatistics
from kdp.model_advisor import recommend_model_configuration
# 1. Analyze dataset
stats = DatasetStatistics("product_dataset.csv")
stats.compute_statistics()
# 2. Get recommendations
recommendations = recommend_model_configuration(stats.features_stats)
# 3. Create preprocessing model using recommended config
features = {}
for name, feature_rec in recommendations["features"].items():
if feature_rec["feature_type"] == "CategoricalFeature":
# Extract configuration for this categorical feature
config = feature_rec["config"]
features[name] = CategoricalFeature(
name=name,
feature_type=getattr(FeatureType, config.get("feature_type", "STRING_CATEGORICAL")),
category_encoding=getattr(CategoryEncodingOptions, config.get("category_encoding", "EMBEDDING")),
hash_bucket_size=config.get("hash_bucket_size"),
hash_with_embedding=config.get("hash_with_embedding", False),
embedding_size=config.get("embedding_size"),
salt=config.get("salt")
)
# 4. Create and build the preprocessing model
model = PreprocessingModel(
path_data="product_dataset.csv",
features_specs=features
)
preprocessor = model.build_preprocessor()
๐ง Real-World Examples
E-commerce Product Categorization
# E-commerce features with hierarchical categories
preprocessor = PreprocessingModel(
path_data="products.csv",
features_specs={
# Main category, subcategory, and detailed category
"main_category": FeatureType.STRING_CATEGORICAL,
"subcategory": FeatureType.STRING_CATEGORICAL,
"detailed_category": FeatureType.STRING_CATEGORICAL,
# Product attributes as multi-categories
"product_features": CategoricalFeature(
name="product_features",
feature_type=FeatureType.MULTI_CATEGORICAL,
separator="|",
multi_hot=True
),
# Brand as a high-cardinality feature
"brand": CategoricalFeature(
name="brand",
feature_type=FeatureType.STRING_CATEGORICAL,
embedding_dim=16,
vocabulary_size=1000 # Top 1000 brands
)
}
)
Content Recommendation System
# Content recommendation with user and item features
preprocessor = PreprocessingModel(
path_data="interaction_data.csv",
features_specs={
# User features
"user_id": CategoricalFeature(
name="user_id",
feature_type=FeatureType.STRING_HASHED,
hash_bucket_size=10000
),
"user_interests": CategoricalFeature(
name="user_interests",
feature_type=FeatureType.MULTI_CATEGORICAL,
embedding_dim=32,
separator=","
),
# Content features
"content_id": CategoricalFeature(
name="content_id",
feature_type=FeatureType.STRING_HASHED,
hash_bucket_size=5000
),
"content_tags": CategoricalFeature(
name="content_tags",
feature_type=FeatureType.MULTI_CATEGORICAL,
embedding_dim=24,
separator="|"
),
"content_type": FeatureType.STRING_CATEGORICAL
}
)
๐ Pro Tips
๐ Choose Embedding Dimensions Wisely
For simple categories with few values (2-10), use 4-8 dimensions. For complex categories with many values (100+), use 16-64 dimensions. The more complex the relationships between categories, the higher dimensions you need.
โก Pre-train Embeddings
KDP allows you to initialize embeddings with pre-trained vectors for faster convergence:
# Create initial embeddings dictionary
pretrained = {
"sports": [0.1, 0.2, 0.3, 0.4],
"music": [0.5, 0.6, 0.7, 0.8]
}
# Use pre-trained embeddings
category_feature = CategoricalFeature(
name="interest",
feature_type=FeatureType.STRING_CATEGORICAL,
embedding_dim=4,
pretrained_embeddings=pretrained
)
๐ Combine Multiple Encoding Strategies
For critical features, consider using both embeddings and one-hot encoding in parallel:
# Main feature with embedding
features["product_type"] = CategoricalFeature(
name="product_type",
feature_type=FeatureType.STRING_CATEGORICAL,
use_embedding=True
)
# Same feature with one-hot encoding
features["product_type_onehot"] = CategoricalFeature(
name="product_type",
feature_type=FeatureType.STRING_CATEGORICAL,
use_embedding=False
)
๐ Handling Unknown Categories
Configure how KDP handles previously unseen categories in production:
feature = CategoricalFeature(
name="store_type",
feature_type=FeatureType.STRING_CATEGORICAL,
unknown_token="<NEW_STORE>", # Custom token
oov_buckets=5 # Use 5 different embeddings
)
๐ Understanding Categorical Embeddings
Categorical embeddings transform categorical values into dense vector representations that capture semantic relationships between categories.