Decision Tree Regression

What's the idea of Decision Tree Regression?

The basic intuition behind a decision tree is to map out all possible decision paths in the form of a tree. It can be used for classification (Decision Tree Classifier ) and regression. In this post, let's try to understand the regression.

DT Regression is similar to Decision Tree Classifier , however we use Mean Square Error (MSE, default) or Mean Absolute Error (MAE) instead of cross-entropy or Gini impurity to determine splits.

$$ \begin{aligned} \text{MSE} &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y}i)^2, \\ \text{MAE} &= \frac{1}{n}\sum{i=1}^n \vert y_i - \bar{y}_i \vert. \end{aligned} $$

Suppose that we have a dataset $S$ like in the figure below,

An example of dataset $S$.

An example of dataset $S$.

A decision tree we want.

A decision tree we want.

Some basic concepts

Untitled

Splitting: It is a process of dividing a node into two or more sub-nodes.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
Parent node and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

<aside> ☝ Other aspects of decision tree algorithm, check this note.

</aside>

<aside> ☝ Looking for an example, read this file.

</aside>

Below are a short algorithm:

Calculate the Standard Deviation ($SD$) of the current node (let's say $S$, parent node) by using MSE or MAE,

$$ \begin{aligned}SD(S) &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y}i)^2, \\\text{or } SD(S) &= \frac{1}{n}\sum{i=1}^n \vert y_i - \bar{y}_i \vert,\end{aligned} $$

where $y_i\in$ the target values (Hours Played in the above example), $\bar{y}=\frac{\Sigma y}{n}$ is the mean value and $n$ is the number of examples in this node.
Check the stopping conditions (we don't need to make any split at this node) to stop the split and this node becomes a leaf node. Otherwise, go to step 3.
- The minimum number of samples required to split an internal node, use min_samples_split in scikit-learn.
- The maximum depth of the tree, use max_depth in scikit-learn.
- A node will be split if this split induces a decrease of the impurity greater than or equal to this value, use min_impurity_decrease in scikit-learn.
- Its coefficient of variation ($\frac{SD(S)}{\bar{y}}$) is less than a certain threshold.
Calculate the Standard Deviation Reduction (SDR) after splitting node $S$ on each attribute (for example, consider attribute $O$). The attribute w.r.t. the biggest SDR will be chosen!

$$ \underbrace{SDR(S,O)}{\text{Standard Deviation Reduction}}= \underbrace{SD(S)}{\text{SD before split}}- \underbrace{\sum_j P(O_j | S) \times SD(S,O_j)}_{\text{weighted SD after split}} $$

where $j \in$ number of different properties in $O$ and $P(O_j)$ is the propability of property $O_j$ in $O$. Note that, $SD(S,O_j)$ means the SD of node $O_j$ which is also a child of node $S$.
After splitting, we have new child nodes. Each of them becomes a new parent node in the next step. Go back to step 1.

What's the idea of Decision Tree Regression?

Some basic concepts

Using Decision Tree Regression with Scikit-learn

Load and create