1. Numeric Features
1.1 Scaling
First of all,
- Tree-based models doesn't depend on scaling
Non-tree-based models hugely depend on scaling.
Normalization, to [0,1]
- sklearn.preprocessing.MinMaxScaler
- Standardization, to mean=0, std=1
- sklearn.preprocessing.StandardScaler
1.2 Remove Outliers
- np.percentile
- Compute the qth percentile of the data along the specified axis.
- np.clip
- Given an interval, values outside the interval are clipped to the interval edges.
For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.
UPPERBOUND, LOWERBOUND = np.percentile(x, [1,99]) y = np.clip(x, UPPERBOUND, LOWERBOUND)
- Given an interval, values outside the interval are clipped to the interval edges.
For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.
1.3 Rank - sets spaces between sorted values to be equal.
scipy.stats.rankdata
>>> from scipy.stats import rankdata
>>> rankdata([0, 2, 3, 2])
array([ 1. , 2.5, 4. , 2.5])
>>> rankdata([0, 2, 3, 2], method='min')
array([ 1., 2., 4., 2.])
>>> rankdata([0, 2, 3, 2], method='max')
array([ 1., 3., 4., 3.])
>>> rankdata([0, 2, 3, 2], method='dense')
array([ 1., 2., 3., 2.])
>>> rankdata([0, 2, 3, 2], method='ordinal')
array([ 1., 2., 4., 3.])
1.4 Other useful transformations.
Often helps non-tree-based methods, especially neural networks.