Decision Tree

Bacics

  • Non-parametric, supervised learning algorithms
  • Given the training data, a decision tree algorithm divides the feature space into regions. For inference, we first see which region does the test data point fall in, and take the mean label values (regression) or the majority label value (classification).
  • Construction: top-down, chooses a variable to split the data such that the target variables within each region are as homogeneous as possible. Two common metrics: gini impurity or information gain, won't matter much in practice.
  • Advantage: simply to understand & interpret, mirrors human decision making
  • Disadvantage:

    • can overfit easily (and generalize poorly)if we don't limit the depth of the tree
    • can be non-robust: a small change in the training data can lead to a totally different tree
    • instability: sensitive to training set rotation due to its orthogonal decision boundaries

    https://github.com/ShuaiW/ml-interview#Decision tree

Least Square Regression Tree

Input: Training data DD Output: regression tree f(x)f(x) Splitting Criteria: MSE

  1. Obtain the best optimal variable(feature) jj and splitting pint ss, according to: minj,s[minc1xiR1(j,s)(yic1)2+minc2xiR2(j,s)(yic2)2](1) \min_{j,s}\left[\min_{c_1}\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2+\min_{c_2}\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2\right] \:\:\:\:\: (1)

Iterate all variable jj and for a given splitting point ss, we need to get a pair of (j,s)(j,s) to minimize formula (1)

  1. Divide the training data D with (j,s)(j,s) and output the values R1(j,s)={xx(j)s},R2(j,s)={xx(j)>s} R_1(j,s)=\{x|x^{(j)}\le s\} \quad ,\quad R_2(j,s)=\{x|x^{(j)}\gt s\}

    c^m=1NmxiRm(j,s)yi,xRm,m=1,2 \hat{c}_m=\frac{1}{N_m}\sum_{x_i\in R_m(j,s)}y_i, x\in R_m, m=1,2

  2. Repeat step 1 and 2 to divide regions R1R_1 and R2R_2 until stopping criteria is satisfied.

  3. Divide the input space into MM regionsR1,R2,...RMR_1, R_2,...R_M, output the regression tree: f(x)=m=1McmI(xRm) f(x)=\sum_{m=1}^Mc_mI(x\in R_m)

Classification Tree

Gini index(Only used by CART)

Suppose we have KK classes in a classification problem, pkp_k denotes the probably that the samples with label kk, then the Gini index is defined as: Gini(p)=k=1Kpk(1pk)=1k=1Kpk2 \text{Gini}(p)=\sum_{k=1}^Kp_k(1-p_k)=1-\sum_{k=1}^Kp_k^2

For a given sample set DD, the Gini index is:

Gini(D)=1k=1K(CkD)2 Gini(D) = 1- \sum_{k=1}^{K}(\frac{|C_k|}{|D|})^2 where KK is the total number of classes and CkC_k is the number of samples belong to class KK.

If we divide DD into D1D_1 and D2D_2 based on some value aa of feature AA, then the average Gini index is:

Gini(D,A)=D1DGini(D1)+D2DGini(D2) \text{Gini}(D,A)=\frac{|D_1|}{|D|}\text{Gini}(D_1)+\frac{|D_2|}{|D|}\text{Gini}(D_2)

Similar to entropy, the higher value of Gini index, the more uncertain of the set it is. Therefore, we want to minimize Gini index. [1]http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf

[2] 《统计学习方法》笔记 (七) - 决策树(下)

[3]https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity

Entropy/Information Gain(Used by C4.5)

Given a set of samples SS, Entropy is defined as below: E(S)=i=1Kpilog(pi) E(S)= -\sum_{i=1}^{K} p_i log(p_i) where pip_i represents the percentage of each class present in the child node that results form a split in the tree. It generally shows the amount of unorderedness in the class distribution of SS. When we split the whole set SS into SiS_i by feature AA, the weighted average over all sets reultsing form the silage is: I(S,A)=iSiSE(Si) I(S,A) = \sum_{i}\frac{|S_i|}{|S|}E(S_i)

Information gain is defined as information gain = Entropy(parent) - Weighted Sum of Entropy(Children),

IG(S,A)=E(S)I(S,A)=E(S)SiSE(Si) IG(S, A) = E(S) - I(S,A) = E(S) - \sum\frac{|S_i|}{|S|}E(S_i)

Maximize the information gain is equivalent to minimize the average entropy, because E(S)E(S) is constant for all attributes.

Information Gain Ratio

The main issue with information gain is that it biases the decision tree against the attributes with a large number of distinct values. For example, the creditcard number attributes of a customer. Here we introduce intrinsic information of a split(How much information do we need to tell which branch an instance belongs to). It's defined as follows:

IntI(S,A)=iSiSlogSiS IntI(S,A) = -\sum_{i}\frac{|S_i|}{|S|} log\frac{|S_i|}{|S|} Then we can see that attributes with higher intrinsic information are less useful. To reduce the information gain's bias towards multi-valued attributes, we take the number and size of branches into account when choosing an attribute, i.e, corrects the information gain by taking the intrisic information of split into account. The information gain ratio can be defined as:

GR(S,A)=IG(S,A)IntI(S,A) GR(S,A) = \frac{IG(S,A)}{IntI(S,A)}

The higher, the better. [1] http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf

Compare Gini index, information gain and gain ratio

  • Information gain: based towards multivalued attributes.
  • Gain ratio: prefer unbalanced splits in which on partition is much smaller than the other.
  • Gini Index:
    • based towards multivalued attributes.
    • favor equal-sized partitions

[1]http://www.inf.unibz.it/dis/teaching/DWDM/slides2011/lesson5-Classification-2.pdf

CART

  1. Given a set of training data DD, calculate Gini index (or IG) on all features.
    • For each feature AA, we calculate Gini index if we divide DD into D1D_1 and D2D_2 when A=aA=a
  2. Select a pair of feature and split point (A,a)(A,a) which has minimum Gini diix as the best feature and split.

  3. Divide DD into D1D_1 and D2D_2

  4. repeat step 1 and step 2 on D1D_1 and D2D_2 until stopping criteria is satisfied.

  5. return CART tree.

CART pruning

Reduced error pruning

One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.

[1]https://en.wikipedia.org/wiki/Pruning_%28decision_trees%29

Cost complexity pruning

  1. Generate a series of trees T0,T1,T2,....TnT_0, T_1, T_2,....T_n where T0T_0is the initial tree and TnT_n is the root alone.
  2. At step ii, the tree is created by revmoing a subtree from tree i1i-1 and replacing it with a leaf node with value chosen(majority vote) as in the tree building algorithm.
    • Define the error rate of the tree TT over data sets SS as err(T,S)err(T, S). The subtree that minimmizes err(prune(T,t),S)err(T,S)leaves(T)leaves(prune(T,t)) \frac{err(prune(T,t),S)- err(T,S)}{|leaves(T)|-|leaves(prune(T,t))|} is chosen for removal. The function prune(T,t)prune(T,t) defines the tree gotten by pruning the subtree tt from the tree TT. Once the series of trees has been created, the best tree is chosen by cross-valididation on training data.

Notes:

  • Loss cost of tree T, is defined as: erra(T)=err(T)+aT(2) err_a(T) = err(T) + a|T| \:\:\: (2)
  • Loss cost of tree T-t, is defined as:

erra(T,t)=err(T,t)+aTt(3) err_a(T,t) = err(T,t)+ a|T-t| \:\:\: (3) where tt is subtree, Tt|T-t| is the number of leaves after pruning tt subtree. IF let (2)=(3), we get: a=err(T,t)err(T)TTt a = \frac{err(T,t)- err(T)}{|T|-|T-t|}

[1]http://daniellaah.github.io/2017/Statistical-Learning-Notes-Chapter5-DecisionTree-2.html [2]https://en.wikipedia.org/wiki/Pruning_%28decision_trees%29

results matching ""

    No results matching ""