Categorical Feature Hashing Example
This example demonstrates how to use feature hashing for categorical variables in the KDP library.
What is Categorical Feature Hashing?
Feature hashing (also known as the "hashing trick") is a technique used to transform high-cardinality categorical features into a fixed-size vector representation. It's particularly useful for:
- Handling categorical features with very large numbers of unique values
- Dealing with previously unseen categories at inference time
- Reducing memory usage for high-cardinality features
When to Use Hashing vs. Embeddings or One-Hot Encoding
- One-Hot Encoding: Best for low-cardinality features (typically <10 categories)
- Embeddings: Good for medium-cardinality features where the relationships between categories are important
- Hashing: Ideal for high-cardinality features (hundreds or thousands of unique values)
Basic Example
from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions
from kdp.processor import PreprocessingModel
# Define a categorical feature with hashing
features = {
"high_cardinality_feature": CategoricalFeature(
name="high_cardinality_feature",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=1024 # Number of hash buckets
)
}
# Create a preprocessing model with the features
model = PreprocessingModel(features_specs=features)
Advanced Hashing Options
Hash with Embeddings
You can combine hashing with embeddings to reduce dimensionality further:
from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions
features = {
"hashed_with_embedding": CategoricalFeature(
name="hashed_with_embedding",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=512, # Number of hash buckets
hash_with_embedding=True, # Enable embedding layer after hashing
embedding_size=16 # Size of the embedding vectors
)
}
Custom Hash Salt
Adding a salt value to the hash function can help prevent collisions between different features:
features = {
"product_id": CategoricalFeature(
name="product_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=2048,
salt=42 # Custom salt value for hashing
)
}
Comparison of Different Encoding Methods
features = {
# Small cardinality - one hot encoding
"product_category": CategoricalFeature(
name="product_category",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.ONE_HOT_ENCODING
),
# Medium cardinality - embeddings
"store_id": CategoricalFeature(
name="store_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.EMBEDDING,
embedding_size=8
),
# High cardinality - hashing
"customer_id": CategoricalFeature(
name="customer_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=1024
),
# Very high cardinality - hashing with embedding
"product_id": CategoricalFeature(
name="product_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=2048,
hash_with_embedding=True,
embedding_size=16
)
}
Automatic Configuration with ModelAdvisor
KDP's ModelAdvisor
can automatically determine the best encoding strategy for each feature based on data statistics:
from kdp.stats import DatasetStatistics
from kdp.model_advisor import recommend_model_configuration
# First, analyze your dataset
dataset_stats = DatasetStatistics("e_commerce_data.csv")
dataset_stats.compute_statistics()
# Get recommendations from the ModelAdvisor
recommendations = recommend_model_configuration(dataset_stats.features_stats)
# Print feature-specific recommendations
for feature, config in recommendations["features"].items():
if "HASHING" in config.get("preprocessing", []):
print(f"Feature '{feature}' recommended for hashing:")
print(f" - Hash bucket size: {config['config'].get('hash_bucket_size')}")
print(f" - Use embeddings: {config['config'].get('hash_with_embedding', False)}")
if config['config'].get('hash_with_embedding'):
print(f" - Embedding size: {config['config'].get('embedding_size')}")
print(f" - Salt value: {config['config'].get('salt')}")
print(f" - Notes: {', '.join(config.get('notes', []))}")
print()
# Generate ready-to-use code
print("Generated code snippet:")
print(recommendations["code_snippet"])
The ModelAdvisor uses these heuristics for categorical features: - For features with <50 unique values: ONE_HOT_ENCODING - For features with 50-1000 unique values: EMBEDDING - For features with >1000 unique values: HASHING - For features with >10,000 unique values: HASHING with embeddings
It also automatically determines: - The appropriate hash bucket size based on cardinality - Whether to add salt values to prevent collisions - Embedding dimensions when using hash_with_embedding=True
Choosing the Right Hash Bucket Size
The number of hash buckets is a critical parameter that affects model performance:
- Too few buckets: Many categories will hash to the same bucket (high collision rate)
- Too many buckets: Sparse representation that might not generalize well
A good rule of thumb is to use a bucket size that is 2-4 times the number of unique categories in your data.
Handling Hash Collisions
Hash collisions occur when different category values hash to the same bucket. There are two common strategies to mitigate this:
- Increase bucket size: Use more buckets to reduce collision probability
- Multi-hashing: Apply multiple hash functions and use all outputs:
# Example of using multi-hash technique (available in advanced settings)
features = {
"complex_id": CategoricalFeature(
name="complex_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=1024,
hash_with_embedding=True,
multi_hash=True, # Enable multiple hash functions
num_hash_functions=3 # Number of hash functions to use
)
}
Performance Considerations
Hashing is computationally efficient compared to maintaining a large vocabulary mapping, especially when:
- You have a very large number of unique categories
- New categories appear frequently in production
- Memory is constrained
Feature hashing trades off a small amount of accuracy (due to potential collisions) for significant efficiency gains with very high-cardinality features.
Complete End-to-End Example
Here's a complete example showing how to use feature hashing for e-commerce product data:
```python import pandas as pd from kdp.features import CategoricalFeature, FeatureType, CategoryEncodingOptions, NumericalFeature from kdp.processor import PreprocessingModel
Create sample e-commerce data
data = { "product_id": [f"p{i}" for i in range(1000)], # High cardinality "category": ["electronics", "clothing", "books", "home"] * 250, # Low cardinality "store_id": [f"store_{i % 100}" for i in range(1000)], # Medium cardinality "user_id": [f"user_{i % 10000}" for i in range(1000)], # Very high cardinality "price": [i * 0.1 for i in range(1000)] # Numerical } df = pd.DataFrame(data) df.to_csv("ecommerce.csv", index=False)
Define features with appropriate encoding strategies
features = { # Low cardinality - one hot encoding "category": CategoricalFeature( name="category", feature_type=FeatureType.STRING_CATEGORICAL, category_encoding=CategoryEncodingOptions.ONE_HOT_ENCODING ),
# Medium cardinality - embedding
"store_id": CategoricalFeature(
name="store_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.EMBEDDING,
embedding_size=8
),
# High cardinality - hashing
"product_id": CategoricalFeature(
name="product_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=2048,
salt=1 # Use different salt values for different features
),
# Very high cardinality - hashing with embedding
"user_id": CategoricalFeature(
name="user_id",
feature_type=FeatureType.STRING_CATEGORICAL,
category_encoding=CategoryEncodingOptions.HASHING,
hash_bucket_size=4096,
hash_with_embedding=True,
embedding_size=16,
salt=2 # Different salt to avoid collisions with product_id
),
# Numerical feature
"price": NumericalFeature(
name="price",
feature_type=FeatureType.FLOAT_NORMALIZED
)
}
Create and build the model
model = PreprocessingModel( path_data="ecommerce.csv", features_specs=features, output_mode="CONCAT" )
Build the preprocessor
preprocessor = model.build_preprocessor()
Use the preprocessor for inference
input_data = { "category": ["electronics"], "store_id": ["store_42"], "product_id": ["p999"], # Known product "user_id": ["user_new"], # New user, not seen in training "price": [99.99] }
Process the data - note how hashing handles both known and unknown values
processed = preprocessor(input_data) print("Output shape:", processed.shape)