What is ERT ?
- Extremely randomized trees: a tree-based ensemble method for supervised classification and regression problems.
- It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node.
- The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter.
- Besides accuracy, the main strength of the resulting algorithm is computational efficiency.
- The Extra-Trees algorithm builds an ensemble of unpruned decision or regression trees according to the classical top-down procedure.
- Its two main differences with other tree based ensemble methods are that it splits nodes by choosing cut-points fully at random and that it uses the whole learning sample(rather than a bootstrap replica) to grow the trees.
where Hc(S) is the (log) entropy of the classification in S, Hs(S) is the split entropy (also called split information by Quinlan), and I< s, c> (S) is the mutual information of the split outcome and the classification.
- From the bias-variance point of view, the rationale behind the Extra-Trees method is that the explicit randomization of the cut-point and attribute combined with ensemble averaging should be able to reduce variance more strongly than the weaker randomization schemes used by other methods.
- The usage of the full original learning sample rather than bootstrap replicas is motivated in order to minimize bias.
- randomization increases bias and variance of individual trees, but may decrease their variance with respect to the learning sample.
- the part of the variance due to randomization can be canceled out by averaging over a sufficiently large ensemble of trees.
On the effect of parameters
- K determines the strength of the attribute selection process.
- nmin the strength of averaging output noise.
- M the strength of the variance reduction of the ensemble model aggregation.
Attribute selection strength K
- For a given problem, the smaller K is, the stronger the randomization of the trees and the weaker the dependence of their structure on the output values of the learning sample.
- For classification problems, the default value of K = √n is shown as a vertical line on the graphs; for regression the default corresponds to the highest possible value of K (i.e. K = n).
Smoothing strength nmin
- nmin is the minimum number of samples required for splitting a node.
- Larger values of nmin lead to smaller trees, higher bias and smaller variance, and its optimal value hence depends in principle on the level of output noise in the dataset.
- although possibly slightly suboptimal, the default values of nmin = 2 (classification) and nmin = 5 (regression) appear to be robust choices in a broad range of typical conditions.
Averaging strength M
- The parameter M denotes the number of trees in the ensemble.
- It is well known that for randomization methods the behavior of prediction error is a monotonically decreasing function of M (see, e.g., Breiman, 2001). So in principle, the higher the value of M, the better from the accuracy point of view.
- The choice of an appropriate value will thus essentially depend on the resulting compromise between computational requirements and accuracy.
- Default value: M = 100