2. Categorical and Ordinal Features

2.1 Label Encoding, maps categories into numbers

  • Alphabetical (sorted)
  • Order of appearance

2.2 Frequencey encoding, maps categories to their frequencies.

[S,C,Q] -> [0.5, 0.3, 0.2]

  • No duplicated frequencies.

    encoding = titanic.groupby(‘Embarked’).size() 
    encoding = encoding/len(titanic) 
    titanic[‘enc’] = titanic.Embarked.map(encoding)
    
  • Has duplicated frequencies. Categorization after common frequency encodings.

    from scipy.stats import rankdata
    

2.3 One-hot Encoding

>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
  • sklearn.preprocessing.OneHotEncoder
    >>> from sklearn.preprocessing import OneHotEncoder
    >>> enc = OneHotEncoder()
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
    OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
         handle_unknown='error', n_values='auto', sparse=True)
    >>> enc.n_values_
    array([2, 3, 4])
    >>> enc.feature_indices_
    array([0, 2, 5, 9])
    >>> enc.transform([[0, 1, 1]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])
    

2.4 Sumarize

  • Label and Frequency encodings are often used for tree-based models.
  • One-hot encoding is often used for non-tree-based models.
  • Interactions of categorical features can help linear models and KNN

results matching ""

    No results matching ""