What's the idea of Decision Tree Regression?

The basic intuition behind a decision tree is to map out all possible decision paths in the form of a tree. It can be used for classification (Decision Tree Classifier ) and regression. In this post, let's try to understand the regression.

DT Regression is similar to Decision Tree Classifier , however we use Mean Square Error (MSE, default) or Mean Absolute Error (MAE) instead of cross-entropy or Gini impurity to determine splits.

$$ \begin{aligned} \text{MSE} &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y}i)^2, \\ \text{MAE} &= \frac{1}{n}\sum{i=1}^n \vert y_i - \bar{y}_i \vert. \end{aligned} $$

Suppose that we have a dataset $S$ like in the figure below,

An example of dataset $S$.

An example of dataset $S$.

A decision tree we want.

A decision tree we want.

Some basic concepts

Untitled

<aside> ☝ Other aspects of decision tree algorithm, check this note.

</aside>

<aside> ☝ Looking for an example, read this file.

</aside>

Below are a short algorithm:

  1. Calculate the Standard Deviation ($SD$) of the current node (let's say $S$, parent node) by using MSE or MAE,

    $$ \begin{aligned}SD(S) &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y}i)^2, \\\text{or } SD(S) &= \frac{1}{n}\sum{i=1}^n \vert y_i - \bar{y}_i \vert,\end{aligned} $$

    where $y_i\in$ the target values (Hours Played in the above example), $\bar{y}=\frac{\Sigma y}{n}$ is the mean value and $n$ is the number of examples in this node.

  2. Check the stopping conditions (we don't need to make any split at this node) to stop the split and this node becomes a leaf node. Otherwise, go to step 3.

  3. Calculate the Standard Deviation Reduction (SDR) after splitting node $S$ on each attribute (for example, consider attribute $O$). The attribute w.r.t. the biggest SDR will be chosen!

    $$ \underbrace{SDR(S,O)}{\text{Standard Deviation Reduction}}= \underbrace{SD(S)}{\text{SD before split}}- \underbrace{\sum_j P(O_j | S) \times SD(S,O_j)}_{\text{weighted SD after split}} $$

    where $j \in$ number of different properties in $O$ and $P(O_j)$ is the propability of property $O_j$ in $O$. Note that, $SD(S,O_j)$ means the SD of node $O_j$ which is also a child of node $S$.

  4. After splitting, we have new child nodes. Each of them becomes a new parent node in the next step. Go back to step 1.

Using Decision Tree Regression with Scikit-learn

Load and create