Powered by GitBook

2. Categorical and Ordinal Features

2.1 Label Encoding, maps categories into numbers

Alphabetical (sorted)
- [S,C,Q] -> [2, 1, 3]
- sklearn.preprocessing.LabelEncoder
Order of appearance
- [S,C,Q] -> [1, 2, 3]
- Pandas.factorize

2.2 Frequencey encoding, maps categories to their frequencies.

[S,C,Q] -> [0.5, 0.3, 0.2]

No duplicated frequencies.

encoding = titanic.groupby(‘Embarked’).size() 
encoding = encoding/len(titanic) 
titanic[‘enc’] = titanic.Embarked.map(encoding)

Has duplicated frequencies. Categorization after common frequency encodings.
```
from scipy.stats import rankdata
```

2.3 One-hot Encoding

pandas.get_dummies

>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

sklearn.preprocessing.OneHotEncoder

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
     handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

2.4 Sumarize

Label and Frequency encodings are often used for tree-based models.
One-hot encoding is often used for non-tree-based models.
Interactions of categorical features can help linear models and KNN

results matching ""

No results matching ""