Applying PolynomialFeatures() to a subset of features in your pipeline using ColumnTransformer

Polynomial Features, which is a part of sklearn.preprocessing, allows us to feed interactions between input features to our model. It also allows us to generate higher order versions of our input features. This functionality helps us explore non-linear relationships such as income with age. It also helps us explore interactions between features, such as #bathrooms * #bedrooms while predicting real estate prices. However, this operation can lead to a dramatic increase in the number of features. The sklearn documentation warns us of this:

Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

While a powerful addition to any feature engineering toolkit, this and some other sklearn functions do not allow us to specify which columns to operate on. This can lead to a significant increase in the size of our data when the number of input features is high. As data scientists, we must always beware the curse of dimensionality. Below we explore how to apply PolynomialFeatures to a select number of input features.

#Import necessary packages
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelBinarizer, StandardScaler
, LabelEncoder, PolynomialFeatures, FunctionTransformer
from sklearn_pandas import DataFrameMapper
from sklearn.decomposition import PCA
from sklearn.linear_model import LassoCV

# Load the Tips dataset
data = sns.load_dataset("tips")

X=temp.drop('tip', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB

We are going to use a data frame mapper to apply customized transformations to each of the categorical features in our dataset. For numeric features, we sequentially perform Imputation, Standard Scaling, and then polynomial feature transformation. In this example, the polynomial feature transformation is applied only to two columns, 'total_bill' and 'size'.

def unwrap(x):
    return np.ravel(x)

mapper = DataFrameMapper([
    (['sex'], [SimpleImputer(strategy='constant', fill_value='missing')
               ,FunctionTransformer(unwrap, validate=False)
    ('smoker', LabelEncoder()),
    (['day'], [SimpleImputer(strategy='constant', fill_value='most_frequent')
               ,FunctionTransformer(unwrap, validate=False)
    ('time', LabelEncoder())
], df_out=True)

numeric_features = ['total_bill','size']
transformer1 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures())


categorical_features = ['sex','smoker','day','time']
transformer2 = Pipeline(steps=[

    ('mapper', mapper )])

transformer = ColumnTransformer(
        ('numeric_transformer', transformer1, numeric_features),
        ('categorical_transformer', transformer2, categorical_features)

(195, 13)

As we can see, the number of features has expanded to 13.

  • 4 from PolynomialFeatures() being applied to 'total_bill','size'
  • 4 from LabelBinarizer() being applied to 'day'
  • Remaing 5 represent 'sex','smoker','size','time' ,'total_bill'

It isn't necessary to seperate columns into numeric and categorical. Below we apply polynomial feature transformation to 'day', 'total_bill', 'time', 'size'.

mapper = DataFrameMapper([
    (['sex'], [SimpleImputer(strategy='constant', fill_value='missing')
               ,FunctionTransformer(unwrap, validate=False)
    ('smoker', LabelEncoder()),
], df_out=True)

mapper2 = DataFrameMapper([
    ('day', LabelBinarizer()),
    (['total_bill'], StandardScaler()),
    ('time', LabelEncoder()),
], df_out=True)

features_set_one = ['total_bill','size','time','day']
transformer1 = Pipeline(steps=[

    ('mapper2', mapper2),
    ('poly', PolynomialFeatures())


features_set_two = ['sex','smoker']
transformer2 = Pipeline(steps=[

    ('mapper', mapper )])

transformer2 = ColumnTransformer(
        ('plynomial_transformer', transformer1, features_set_one),
        ('transformer2', transformer2, features_set_two)

(195, 38)

The expanded number of columns are coming from polynomial feature transformation being applied to more features than before. ColumnTransformer objects (like transformer2 in our case) can also be used to create pipelines as can be seen below.

model=LassoCV(n_jobs=-1, max_iter=10000, cv=5)
X_test=feature_pipe.fit_transform(X_test), y_train)
model.score(X_train, y_train)

print(f' R^2 (test): {r2_score(y_test,y_pred)}')
print(f' RMS: {mean_squared_error(y_test,y_pred)**0.5}')
 R^2 (test): 0.34360014381555704
 RMS: 1.1732410711642152

In this post we have used ColumnTransformer but similar operations can also be performed using Feature Union