What's the idea of Random Forest?

Random forest consists a (large) number of decision trees operating together (ensemble learning). The class with the most votes from the trees will be chosen as the final result of the RF's prediction. These decision tree models are relatively uncorrelated so that they can protect each other from their individual errors.

An illustration of the random forest's idea.

An illustration of the random forest's idea.

How (decision) trees are chosen? RF ensures that the chosen trees are not too correlated to the others.

  1. Bagging: From a sample of size N, trees are chosen so that they also have size N with replacement. For example, if our training data was [1, 2, 3, 4, 5] (size 5), then we might give one of our tree the list [1, 2, 2, 5, 5] (with replacement).
  2. Feature randomness: The features in the original dataset are chosen randomly. There may be some trees that are lacking in some features.

So in our random forest, we end up with trees that are not only trained on different sets of data (thanks to bagging) but also use different features to make decisions. (ref)

For each tree, we can use Decision Tree Classifier or Decision Tree Regression depending on the type our problem (classification or regression).

When we use Random Forest?

Using RF with Scikit-learn

Random forest classifier

Load the library,

from sklearn.ensemble import RandomForestClassifier