0

Extremely randomized trees (ERT)

What is ERT ?

  • Extremely randomized trees: a tree-based ensemble method for supervised classification and regression problems.
  • It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node.
  • The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter.
  • Besides accuracy, the main strength of the resulting algorithm is computational efficiency.

ert1

 

ert4

 

Extra-Trees algorithm

  • The Extra-Trees algorithm builds an ensemble of unpruned decision or regression trees according to the classical top-down procedure.
  • Its two main differences with other tree based ensemble methods are that it splits nodes by choosing cut-points fully at random and that it uses the whole learning sample(rather than a bootstrap replica) to grow the trees.

ert2

ert3

 where Hc(S) is the (log) entropy of the classification in S, Hs(S) is the split entropy (also called split information by Quinlan), and I< s, c> (S) is the mutual information of the split outcome and the classification.

ert5

Bias/variance analysis

  • From the bias-variance point of view, the rationale behind the Extra-Trees method is that the explicit randomization of the cut-point and attribute combined with ensemble averaging should be able to reduce variance more strongly than the weaker randomization schemes used by other methods.
  • The usage of the full original learning sample rather than bootstrap replicas is motivated in order to minimize bias.
  • randomization increases bias and variance of individual trees, but may decrease their variance with respect to the learning sample.
  • the part of the variance due to randomization can be canceled out by averaging over a sufficiently large ensemble of trees.

On the effect of parameters

  • determines the strength of the attribute selection process.
  • nmin the strength of averaging output noise.
  • the strength of the variance reduction of the ensemble model aggregation.

Attribute selection strength K

  • For a given problem, the smaller is, the stronger the randomization of the trees and the weaker the dependence of their structure on the output values of the learning sample.
  • For classification problems, the default value of K = √n is shown as a vertical line on the graphs; for regression the default corresponds to the highest possible value of K (i.e. K = n).

Smoothing strength nmin

  • nmin is the minimum number of samples required for splitting a node.
  • Larger values of nmin lead to smaller trees, higher bias and smaller variance, and its optimal value hence depends in principle on the level of output noise in the dataset.
  • although possibly slightly suboptimal, the default values of nmin = 2 (classification) and nmin = 5 (regression) appear to be robust choices in a broad range of typical conditions.

Averaging strength M

  • The parameter denotes the number of trees in the ensemble.
  • It is well known that for randomization methods the behavior of prediction error is a monotonically decreasing function of M (see, e.g., Breiman, 2001). So in principle, the higher the value of M, the better from the accuracy point of view.
  • The choice of an appropriate value will thus essentially depend on the resulting compromise between computational requirements and accuracy.
  • Default value: M = 100

Feixiang

发表评论

电子邮件地址不会被公开。 必填项已用*标注