Decision Tree
Bacics
- Non-parametric, supervised learning algorithms
- Given the training data, a decision tree algorithm divides the feature space into regions. For inference, we first see which region does the test data point fall in, and take the mean label values (regression) or the majority label value (classification).
- Construction: top-down, chooses a variable to split the data such that the target variables within each region are as homogeneous as possible. Two common metrics: gini impurity or information gain, won't matter much in practice.
- Advantage: simply to understand & interpret, mirrors human decision making
Disadvantage:
- can overfit easily (and generalize poorly)if we don't limit the depth of the tree
- can be non-robust: a small change in the training data can lead to a totally different tree
- instability: sensitive to training set rotation due to its orthogonal decision boundaries
Least Square Regression Tree
Input: Training data Output: regression tree Splitting Criteria: MSE
- Obtain the best optimal variable(feature) and splitting pint , according to:
Iterate all variable and for a given splitting point , we need to get a pair of to minimize formula (1)
Divide the training data D with and output the values
Repeat step 1 and 2 to divide regions and until stopping criteria is satisfied.
Divide the input space into regions, output the regression tree:
Classification Tree
Gini index(Only used by CART)
Suppose we have classes in a classification problem, denotes the probably that the samples with label , then the Gini index is defined as:
For a given sample set , the Gini index is:
where is the total number of classes and is the number of samples belong to class .
If we divide into and based on some value of feature , then the average Gini index is:
Similar to entropy, the higher value of Gini index, the more uncertain of the set it is. Therefore, we want to minimize Gini index. [1]http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf
[3]https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
Entropy/Information Gain(Used by C4.5)
Given a set of samples , Entropy is defined as below: where represents the percentage of each class present in the child node that results form a split in the tree. It generally shows the amount of unorderedness in the class distribution of . When we split the whole set into by feature , the weighted average over all sets reultsing form the silage is:
Information gain is defined as information gain = Entropy(parent) - Weighted Sum of Entropy(Children),
Maximize the information gain is equivalent to minimize the average entropy, because is constant for all attributes.
Information Gain Ratio
The main issue with information gain is that it biases the decision tree against the attributes with a large number of distinct values. For example, the creditcard number attributes of a customer. Here we introduce intrinsic information of a split(How much information do we need to tell which branch an instance belongs to). It's defined as follows:
Then we can see that attributes with higher intrinsic information are less useful. To reduce the information gain's bias towards multi-valued attributes, we take the number and size of branches into account when choosing an attribute, i.e, corrects the information gain by taking the intrisic information of split into account. The information gain ratio can be defined as:
The higher, the better. [1] http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf
Compare Gini index, information gain and gain ratio
- Information gain: based towards multivalued attributes.
- Gain ratio: prefer unbalanced splits in which on partition is much smaller than the other.
- Gini Index:
- based towards multivalued attributes.
- favor equal-sized partitions
[1]http://www.inf.unibz.it/dis/teaching/DWDM/slides2011/lesson5-Classification-2.pdf
CART
- Given a set of training data , calculate Gini index (or IG) on all features.
- For each feature , we calculate Gini index if we divide into and when
Select a pair of feature and split point which has minimum Gini diix as the best feature and split.
Divide into and
repeat step 1 and step 2 on and until stopping criteria is satisfied.
return CART tree.
CART pruning
Reduced error pruning
One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.
[1]https://en.wikipedia.org/wiki/Pruning_%28decision_trees%29
Cost complexity pruning
- Generate a series of trees where is the initial tree and is the root alone.
- At step , the tree is created by revmoing a subtree from tree and replacing it with a leaf node with value chosen(majority vote) as in the tree building algorithm.
- Define the error rate of the tree over data sets as . The subtree that minimmizes is chosen for removal. The function defines the tree gotten by pruning the subtree from the tree . Once the series of trees has been created, the best tree is chosen by cross-valididation on training data.
Notes:
- Loss cost of tree T, is defined as:
- Loss cost of tree T-t, is defined as:
where is subtree, is the number of leaves after pruning subtree. IF let (2)=(3), we get:
[1]http://daniellaah.github.io/2017/Statistical-Learning-Notes-Chapter5-DecisionTree-2.html [2]https://en.wikipedia.org/wiki/Pruning_%28decision_trees%29