1. Numeric Features

1.1 Scaling

First of all,

  • Tree-based models doesn't depend on scaling
  • Non-tree-based models hugely depend on scaling.

  • Normalization, to [0,1]

  • Standardization, to mean=0, std=1

1.2 Remove Outliers

  • np.percentile
    • Compute the qth percentile of the data along the specified axis.
  • np.clip
    • Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.
      UPPERBOUND, LOWERBOUND = np.percentile(x, [1,99])
      y = np.clip(x, UPPERBOUND, LOWERBOUND)
      

1.3 Rank - sets spaces between sorted values to be equal.

scipy.stats.rankdata

>>> from scipy.stats import rankdata
>>> rankdata([0, 2, 3, 2])
array([ 1. ,  2.5,  4. ,  2.5])
>>> rankdata([0, 2, 3, 2], method='min')
array([ 1.,  2.,  4.,  2.])
>>> rankdata([0, 2, 3, 2], method='max')
array([ 1.,  3.,  4.,  3.])
>>> rankdata([0, 2, 3, 2], method='dense')
array([ 1.,  2.,  3.,  2.])
>>> rankdata([0, 2, 3, 2], method='ordinal')
array([ 1.,  2.,  4.,  3.])

1.4 Other useful transformations.

Often helps non-tree-based methods, especially neural networks.

  • np.log(1+1)np.log(1+1)
  • np.sqrt(x+2/3)np.sqrt(x + 2/3)

1.5 Feature generation is powered by 1) Prior knowledge, Exploratory data analysis.

results matching ""

    No results matching ""