Map > Data Mining > Explaining the Past > Data Exploration > Univariate Analysis > Encoding
 

Encoding

Encoding or continuization is the transformation of categorical variables to binary or numerical counterparts. An example is to treat male or female for gender as 1 or 0. Categorical variables must be encoded in many modeling methods (e.g., linear regression, SVM, neural networks). Two main types of encoding are Binary and Target-based.
 

  

Binary Encoding

Numerization of categorical variables by taking the values 0 or 1 to indicate the absence or presence of each category. If the categorical variable has k categories we would need to create k binary variables (technically speaking, k-1 would suffice). In the following example, the categorical variable "Trend" with three values transformed to three separate binary numerical variables. The main drawback with this method is when the categorical variable with many values (e.g., city) which can tremendously increase the dimension of data. 

 Categorical

Encoded

Trend

Trend_Up Trend_Down Trend_Flat
Up 1 0 0
Up 1 0 0
Down 0 1 0
Flat 0 0 1
Down 0 1 0
Up 1 0 0
Down 0 1 0
Flat 0 0 1
Flat 0 0 1
Flat 0 0 1
 

Target-based Encoding

Target-based encoding is numerization of categorical variables via target. In this method, we replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding probability of the target (if categorical) or average of the target (if numerical). The main drawbacks of this method are its dependency to the distribution of the target, and its lower predictability power compare to the binary encoding method.
 
Example 1:
An example of target-based encoding via a categorical target.
Trend Target Trend_Encoded
Up 1 0.66
Up 1 0.66
Down 0 0.33
Flat 0 0.5
Down 1 0.33
Up 0 0.66
Down 0 0.33
Flat 0 0.5
Flat 1 0.5
Flat 1 0.5
 

Target

 
Trend 0 1 Probability (1)
Up 1 2 0.66
Down 2 1 0.33
Flat 2 2 0.5

 

 
Example 2:
An example of target-based encoding via a numerical target.
Trend Target Trend_Encoded
Up 21 23.7
Up 24 23.7
Down 8 10.3
Flat 15 14.5
Down 11 10.3
Up 26 23.7
Down 12 10.3
Flat 16 14.5
Flat 14 14.5
Flat 13 14.5
Trend Target - Average
Up 23.7
Down 10.3
Flat 14.5

 

 
Exercise