Skip to content

Defining Features for Preprocessing 🌟

Customize the preprocessing pipeline by setting up a dictionary that maps feature names to their respective types, tailored to your specific requirements.

đŸ’¯ Numeric Features

Explore various methods to define numerical features tailored to your needs:

features_specs = {
    "feat1": "float",
    "feat2": "FLOAT",
    "feat3": "FLOAT_NORMALIZED",
    "feat3": "FLOAT_RESCALED",
    ...
}

Utilize predefined preprocessing configurations with FeatureType.

from kdp.features import FeatureType

features_specs = {
    "feat1": FeatureType.FLOAT_NORMALIZED,
    "feat2": FeatureType.FLOAT_RESCALED,
    ...
}

Available FeatureType options:

  • FLOAT
  • FLOAT_NORMALIZED
  • FLOAT_RESCALED
  • FLOAT_DISCRETIZED

Customize preprocessing by passing specific parameters to NumericalFeature.

from kdp.features import NumericalFeature

features_specs = {
    "feat3": NumericalFeature(
        name="feat3",
        feature_type=FeatureType.FLOAT_DISCRETIZED,
        bin_boundaries=[(1, 10)],
    ),
    "feat4": NumericalFeature(
        name="feat4",
        feature_type=FeatureType.FLOAT,
    ),
    ...
}

Here's how the numeric preprocessing pipeline looks:

Numeric Feature Pipeline

🐈‍âŦ› Categorical Features

Define categorical features flexibly:

features_specs = {
    "feat1": "INTEGER_CATEGORICAL",
    "feat2": "STRING_CATEGORICAL",
    "feat3": "string_categorical",
    ...
}

Leverage default configurations with FeatureType.

from kdp.features import FeatureType

features_specs = {
    "feat1": FeatureType.INTEGER_CATEGORICAL,
    "feat2": FeatureType.STRING_CATEGORICAL,
    ...
}

Available FeatureType options:

  • STRING_CATEGORICAL
  • INTEGER_CATEGORICAL

Tailor feature processing by specifying properties in CategoricalFeature.

from kdp.features
from kdp.features import CategoricalFeature

features_specs = {
    "feat1": CategoricalFeature(
        name="feat7",
        feature_type=FeatureType.INTEGER_CATEGORICAL,
        embedding_size=100,
    ),
    "feat2": CategoricalFeature(
        name="feat2",
        feature_type=FeatureType.STRING_CATEGORICAL,
    ),
    ...
}

See how the categorical preprocessing pipeline appears:

Categorical Feature Pipeline

📝 Text Features

Customize text features in multiple ways to fit your project's demands:

features_specs = {
    "feat1": "text",
    "feat2": "TEXT",
    ...
}

Use FeatureType for automatic default preprocessing setups.

from kdp.features import FeatureType

features_specs = {
    "feat1": FeatureType.TEXT,
    "feat2": FeatureType.TEXT,
    ...
}

Available FeatureType options:

  • TEXT

Customize text preprocessing by passing specific arguments to TextFeature.

from kdp.features import TextFeature

features_specs = {
    "feat1": TextFeature(
        name="feat2",
        feature_type=FeatureType.TEXT,
        max_tokens=100,
        stop_words=["stop", "next"],
    ),
    "feat2": TextFeature(
        name="feat2",
        feature_type=FeatureType.TEXT,
    ),
    ...
}

Here's how the text feature preprocessing pipeline looks:

Text Feature Pipeline

❌ Cross Features

To implement cross features, specify a list of feature tuples in the PreprocessingModel like so:

from kdp.processor import PreprocessingModel

ppr = PreprocessingModel(
    path_data="data/data.csv",
    features_specs={
        "feat6": FeatureType.STRING_CATEGORICAL,
        "feat7": FeatureType.INTEGER_CATEGORICAL,
    },
    feature_crosses=[("feat6", "feat7", 5)],
)

Example cross feature between INTEGER_CATEGORICAL and STRING_CATEGORICAL:

Cross Features Pipeline

🚀 Custom Preprocessing Steps

If you require even more customization, you can define custom preprocessing steps using the Feature class, using preprocessors attribute.

The preprocessors attribute accepts a list of methods defined in PreprocessorLayerFactory.

from kdp.features import Feature
from kdp.layers_factory import PreprocessorLayerFactory

features_specs = {
    "feat1": FeatureType.FLOAT_NORMALIZED,
    "feat2": Feature(
        name="custom_feature_pipeline",
        feature_type=FeatureType.FLOAT_NORMALIZED,
        preprocessors=[
            PreprocessorLayerFactory.rescaling_layer,
            PreprocessorLayerFactory.normalization_layer,

        ],
        # leyers required kwargs
        scale=1,
    )
}

Here's how the text feature preprocessing pipeline looks:

Text Feature Pipeline

The full list of availble layers can be found: Preprocessing Layers Factory