A Better OrdinalEncoder for Scikit-learn

If you ever used Encoder class in Python Sklearn package, you will probably know LabelEncoder, OrdinalEnocder and OneHotEncoder. These Encoders are for transforming categorical data into numerical data. In this blog, I develop a new Ordinal Encoder which makes up the shortcomings of the current Ordinal Encoder in the sklearn. Also, it can be used in the sklearn pipeline perfectly.

In Python Sklearn, when we are going to train machine learning models, we need to convert all string or object type of data to integer or float type before we truly execute training step, otherwise, we are not allowed to run the model. This is very different from R where the string type of data will be automatically converted to the factor which is a built-in data type specifically designed for categorical data. Therefore, Encoding as a necessary step of data preprocessing is very important for Sklearn. However, the above three encoders all have their own shortcomings.

LabelEncoder can convert a list of objects(string, integer, float) to a list of integer. But it can only process one list or array-like instance at a time since it was designed for the target variable. When we have a data frame with many features, it is much better to use an encoder that is able to handle multiple categorical features at a time. That better encoder is OrdinalEnocder. Let’s see the difference between the LabelEncoder and the OrdinalEnocder.

# LabelEncoder can only convert one list at a time.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['one','two','three'])  
le.transform(['one','two','three']) 

# OrdinalEncoder is much powerful since it can apply on multiple features at one time.
from sklearn.preprocessing import OrdinalEncoder
oe=OrdinalEncoder()
oe.fit([[1.0,'one'],[2.0,'two'],[3.0,'three']]) # pd.dataframe or np.array is also available 
oe.transform([[2.0,'one']])

However, OrdinalEncoder is not perfect. If a new value that never appears in the fit step needs to be converted in the transform step, OrdinalEncoder will raise an error. Such unknown value will cause a problem when the testing set contains an unknown value of the training set. In order to avoid unknown value in the testing set, we have to fit the entire data set for the OrdinalEncoder, which means we need to fit the OrdinalEncoder before splitting the dataset into training and testing. Even if we can fix unknown value in this way, it still does not work when new samples with unknown value fed into the model. A more powerful Encoder named OneHotEncoder perfectly solve the above issue, since it has an argument ‘handle_unknown’. You can set this argument as ‘ignore’ to avoid raising an error. OneHotEncoder is born for creating dummy features, it has a similar function of pd.get_dummies, but it is better than pd.get_dummies for the handle_unknown capability.

Before version 0.20, Sklearn can be only applied to the numeric features, which means you have to use LabelEncoder or OrdinalEncoder to convert string features or object features to numeric features before using OneHotEncoder. At that moment, there was no way to construct a pipeline in the Sklearn involving OrdinalEncoder and OneHotEncoder, since the OrdinalEncoder could not handle unknown value in the testing set. My new class of OrdinalEncoder was developed at that moment for the purpose of providing a new OrdinalEncoder which can be used with OneHotEncoder in the pipeline. It seems like Sklearn developer found the bug of OneHotEncoder, for version 0.20 they enhanced the OneHotEncoder so that it could handle numeric features, which means for the purpose of converting string type of features to dummy features, we only need to use OneHotEncoder. However, in the case of converting string type features to numeric features or ordinal numeric features, we still need to use OrdinalEncoder. Notice that for the tree-based models, dummy variable conversion is not necessary and not recommended. Therefore, my new OrdinalEncoder can be used in such a case. You only need to copy the following code to create a new OridnalEncoder class.

# Do not use from sklearn.preprocessing import _BaseEncoder, it is protected class!
from sklearn.preprocessing._encoders import _BaseEncoder
class new_OrdinalEncoder(_BaseEncoder):
    def __init__(self,cat_index='all'):
        self.dicts={}
        # cate_index is the categorical feature index list
        self.cat_index=cat_index
    
    def fit(self,df,*y):
        if self.cat_index=='all':
            self.cat_index=list(range(df.shape[1]))
        for feat in self.cat_index:
            dic=np.unique(df.iloc[:,feat])
            dic=dict([(i,index) for index, i in enumerate(dic)])
            self.dicts[feat]=dic
            
    def fit_transform(self,df,*y):
        if self.cat_index=='all':
            self.cat_index=list(range(df.shape[1]))
        df_output=df.copy()
        for feat in self.cat_index:
            dic=np.unique(df.iloc[:,feat])
            dic=dict([(i,index) for index, i in enumerate(dic)])
            self.dicts[feat]=dic
            df_output.iloc[:,feat]=df.iloc[:,feat].apply(lambda x: dic[x])
        return df_output
        
    def transform(self,df):
        df_output=df.copy()
        for feat in self.cat_index:
            dic=self.dicts[feat]
            df_output.iloc[:,feat]=df.iloc[:,feat].apply(self.unknown_value,args=(dic,))
        return df_output
    
    def unknown_value(self,value,dic): # It will set up a new interger for unknown values!
        try:
            return dic[value]
        except:
            return len(dic)

Let’s take a real-world dataset as an example to introduce the usage of the above new OrdinalEncoder. The dataset I use here is a historical transaction dataset for KDD 99 Discovery Challenge. You can download it from the link below. There are 6 categorical features and 2 continuous features in the dataset. We want to convert two features (date, k_symbol) to ordinal numeric features, convert the rest of categorical features to dummy features and remain the continuous features as they are.

# Load data
import pandas as pd
import numpy as np
#Save the data and notebook in the same folder! 
data=pd.read_csv('transaction.csv')
 
# identify categorical and continous features 
cat=['account_id','type','operation','bank']
ordi=['date','k_symbol']
conti=['amount','balance']

# split data into training and testing set
from sklearn.model_selection import train_test_split
train, test = train_test_split(data,test_size=0.3) 

#if use my new_ordinalencoder, it will not raise error anymore!
column_trans = ColumnTransformer(
[('ohe', OneHotEncoder(dtype='int',handle_unknown='ignore'),cat),
 ('ord',new_OrdinalEncoder() , ordi)],
remainder='passthrough')

column_trans.fit_transform(train)

column_trans.transform(test)

Notice that in the above column transformer estimator, if you use OrdinalEncoder instead of my new OrdinalEncoder, it will raise a value error in the transformation step for the testing set.

The above new OrdinalEncoder class is just a demo code that explains the core idea of it, for a more detailed code of this class which is perfectly compatible with Sklearn, please feel free to reach me out by the following contact information. Do you want to enhance your Data Science Tools? Please subscribe to my blog right now!

Published by frank xu

I am a data science practitioner. I love math, artificial intelligence and big data. I am looking forward to sharing experience with all data science enthusiasts.

One thought on “A Better OrdinalEncoder for Scikit-learn

  1. Hi, thanks for the wonderful code. Where can we find the entire code? What about the names for the features after encoding like columns_trans.get_feature_names() in sklearn?

Leave a Reply

%d bloggers like this: