Map > Data Mining > Predicting the Future > Modeling > Classification > Decision Tree > Overfitting  
Decision Tree  Overfitting 

Overfitting is a significant practical difficulty for decision tree
models and many other predictive models. Overfitting happens when the learning algorithm continues
to develop
hypotheses that reduce training set error at the cost of an increased test set error. There are several approaches to avoiding overfitting in building decision trees. 



Practically, the second approach of postpruning overfit trees is more successful because it is not easy to precisely estimate when to stop growing the tree.  
The important step of tree pruning is to define a criterion be used to determine the correct final tree size using one of the following methods:  


The first method is the most common approach. In this approach, the available data are separated into two sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of pruning the tree. The second method is also a common approach. Here, we explain the error estimation and Chi^{2} test.  
Postpruning using Error estimation 

Error estimate for a subtree is weighted sum of error estimates for all its leaves. The error estimate (e) for a node is:  
In the following example we set Z to 0.69 which is equal to a confidence level of 75%.  
The error rate at the parent node is 0.46 and since the error rate for its children (0.51) increases with the split, we do not want to keep the children.  
Postpruning using Chi^{2} test  
In Chi^{2} test we construct the corresponding frequency table and calculate the Chi^{2} value and its probability.  




Chi^{2} = 0.21 Probability = 0.90 degree of freedom=2 

If we require that the probability has to be less than a limit (e.g., 0.05), therefore we decide not to split the node.  