**What is ERT ****?**

- Extremely randomized trees: a tree-based ensemble method for supervised
**classification and****regression**problems. - It essentially consists of randomizing strongly both
**attribute and****cut-point choice**while splitting a tree node. - The
**strength****of the randomization**can be tuned to problem specifics by the appropriate choice of a parameter. - Besides
**accuracy**, the main strength of the resulting algorithm is**computational****efficiency.**

**Extra-Trees algorithm**

- The Extra-Trees algorithm builds an ensemble of
**unpruned decision or regression****trees**according to the classical top-down procedure. - Its two main differences with other tree based ensemble methods are that it splits nodes by
**choosing cut-points fully at random**and that it**uses the whole learning sample**(rather than a bootstrap replica) to grow the trees.

where Hc(S) is the (log) **entropy of the classification** in S, Hs(S) is the **split entropy** (also called split information by Quinlan), and I< s, c> (S) is the **mutual information** of the split outcome and the classification.

**Bias/variance analysis**

- From the
**bias-variance**point of view, the rationale behind the Extra-Trees method is that the**explicit randomization of the cut-point and attribute**combined with**ensemble averaging**should be able to**reduce variance more strongly**than the weaker randomization schemes used by other methods. - The usage of the full original learning sample rather than bootstrap replicas is motivated in order to
**minimize bias**. - randomization increases bias and variance of individual trees, but may decrease their variance with respect to the learning sample.
- the part of the variance due to randomization can be canceled out by averaging over a sufficiently large ensemble of trees.

**On the effect of parameters**

determines the strength of the attribute selection process.*K**n*the strength of averaging output noise._{min}the strength of the variance reduction of the ensemble model aggregation.*M*

**Attribute selection strength ***K*

- For a given problem, the
**smaller**is, the*K***stronger the randomization**of the trees and the weaker the dependence of their structure on the output values of the learning sample. - For classification problems, the
**default value**of*K*= √*n*is shown as a vertical line on the graphs; for regression the default corresponds to the highest possible value of*K*(i.e.*K*=*n*).

**Smoothing strength ***n*_{min}

*n*_{min}is the minimum number of samples required for splitting a node.**Larger values of***n*_{min}lead to smaller trees, higher bias and smaller variance, and its optimal value hence depends in principle on the**level of output noise**in the dataset.- although possibly slightly suboptimal, the
**default values**of*n*_{min}= 2 (classification) and*n*_{min}= 5 (regression) appear to be robust choices in a broad range of typical conditions.

**Averaging strength ***M*

- The parameter
denotes the number of trees in the ensemble.*M* - It is well known that for randomization methods the behavior of
*prediction error is a monotonically decreasing**function of**M*(see, e.g., Breiman, 2001). So in principle, the**higher**the value of*M*, the**better**from the accuracy point of view. - The choice of an appropriate value will thus essentially depend on the resulting compromise between computational requirements and accuracy.
- Default value: M = 100