Feature Encoding with Python


In the machine, one important part is feature engineering. It is very common to see categorical features in a dataset. However, machine learning algorithms can only read numerical values. It is essential to encoding categorical features into numerical values

Here we will cover different ways of encoding categorical features

  1.  OneHot encoding
  2.  Frequency OneHot encoding
  3.  Label encoding
  4.  Target mean encoding
  5.  Binary encoding
  6.  Frequency/Count Encoding
  7.  Feature hashing

OneHot encoding

One Hot Encoding will produce columns and the presence of a class will be represented in binary format. classes are separated out to different features. The algorithm is only worried about their presence/absence without making any assumptions about their relationship

Label encoding

Label Encoding gives numerical aliases to the classes. So the resultant label encoded feature will have 0,1,2, to number of different unique value in each column The problem with this approach is that there is no relation between these three classes yet our algorithm might consider them to be ordered (that is there is some relation between them) maybe 0<1<2 


algorithms as having some sort of hierarchy/order in them

Frequency OneHot encoding

Think of it like we have too many methods so we won’t make a column for each value but only for top values having the most frequency or in another way It is a way to utilize the frequency of the categories as labels

Target means encoding

mean target encoding for each category in the feature label is decided with the mean value of the target variable

Binary encoding

Binary Encoding just labels values to an integer then takes binary of the integer and makes the binary table to encode data in another way Binary encoding converts a category into binary digits

Frequency/Count Encoding

here we would be basically replacing the string with the frequency count which is present in the column

Feature hashing

Hashing will convert categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transform into the numerical dimensional space. With Hashing, the number of dimensions will be less than the number of dimensions with encoding like One Hot Encoding

we see that hash encoded has created 1048576 additional Columns which again make it harder to debug back and see which features have the most contribution to target Prediction which makes it use-case only for competitions and not much suitable for real-world applications

These were some methods to encode categorical values to numerical
values. I hope you enjoyed!

Thank you reading