\n", "

\n", ">- One way to capture those interactions is either to **multiply** the features, or to use algorithms (like trees) that can handle non-linearity.\n", "\n", "- Linear regression is parametric i.e we don't need to keep the training data after we get the model parameters (**w** (weights), **b** (intercepts))\n", "\n", "- Parametric and nonparametric models differ in the way parameters of the model are fixed or data needed each time to determine the parameters\n", "\n", "- Trees are considered as non-parametric models.\n", "\n", "- Trees partitions the features space horizontally or vertically thus they create axis-parallel hyper rectangles whereas linear models can divide the space in any direction or orientation.\n", "\n", "- This is a direct consequence of the fact that trees select a single feature at a time whereas linear models use a weighted combination of all features.\n", "\n", "- In principle, a tree can cut up the instance space arbitrarily finely into very small regions.\n", "- A linear model places only a single decision surface through the entire space. It has freedom in the orientation of the surface, but it is limited to a single division into two segments.\n", "\n", "- This is a direct consequence of there being a single (linear) equation that uses all of the variables, and must fit the entire data space" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### An hyperplane by a linear model\n", "\n", "\n", "An hyperplane created by a linear model divides the feature space into two\n", "![regtree1](/images/regtree1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Trees cut the space horizontally and vertically" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![regtree2](/images/regtree2.png)\n", "\n", "> - A 3-dimensional (with 3 features) tree. \n", ">- The first split (by the red vertical plane) cuts the space into 2 regions then, \n", ">- each of the two regions are split again (by the green horizontal planes) into 2 regions. \n", ">- Finally, those 4 regions are split (by the 4 blue vertical planes) into 2 regions. \n", ">- Since there is no more splitting, finally the whole space is split into 8 regions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advantages of Decision Trees:\n", "\n", "- Simple to interpret.\n", "- Trees can be visualised.\n", "- Requires little data preparation. \n", "- Other techniques often require \n", " - data normalization, \n", " - dummy variables need to be created \n", " \n", " \n", "- The cost of using the tree (i.e., predicting data) is **logarithmic** in the number of data points used to train the tree.\n", " \n", "- Able to handle both numerical and categorical data (Sklearn only support numerical data at the moment)\n", "- They can find non-linear relationships between features and target variables, as well as interactions between features. \n", "- **Quadratic, exponential, cyclical**, and other relationships can all be revealed, as long as we have enough data to support all of the necessary cuts. \n", "\n", "- Decision trees can also find **non-smooth behaviors, sudden jumps**, and **peaks**, that other models like linear regression or artificial neural networks can hide. \n", "\n", "- Decision tree can easily identify the most important(predictive) feautes and relation between two or more features. \n", "- With the help of decision trees, we can create new features that has better power to predict target variable. \n", "\n", "\n", "## Disadvantages of Decision Trees :\n", "\n", "- Decision-tree learners can create **over-complex trees** that do not generalise the data well. This is called overfitting. \n", "- Setting constraints such as\n", " - the minimum number of samples required at a leaf node or\n", " - the maximum depth of the tree and\n", " - pruning (not currently supported by Sklearn) are necessary to avoid this problem\n", " \n", "\n", "- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an **ensemble**. \n", "

\n", "

\n", "\n", "- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to **balance the dataset** prior to fitting with the decision tree.\n", "

\n", "

\n", "\n", "- Practical decision-tree learning algorithms are based on heuristic algorithms such as the **greedy algorithm** where **locally optimal decisions** are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. \n", "- This can be mitigated by training multiple trees in an ensemble learner, where the **features and samples are randomly sampled with replacement**.\n", "

\n", "

\n", "\n", "- Tree models can only interpolate but not extrapolate (the same is true for random forests and tree boosting). That is, a decision tree makes **constant prediction** for the objects that **lie beyond the bounding box** set by the training set in the feature space. \n", "\n", "\n", "\n", "- In principle, you can approximate anything with decision trees but in practice you can't approximate very well. Still decision trees are very efficient because in practice the feature spaces are high dimensional that is data is very sparse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How a Regression Tree Splits the Feature Space?\n", "\n", "- Trees start testing features from the ones that potentially most quickly lead to a conclusion i.e most predictive or important\n", "\n", "- In each node trees test a feature\n", "- Order of features matters\n", "- In a step, for each feature the loss function (accuracy in case of classification) on the target variables calculated and the feature who provides the least loss is selected as the split feature\n", "\n", "- If the features are continuous, like temperature in our example, internal nodes may test the value against a **treshold**\n", "- Everytime tree ask `Feature1>=something` or `Feature2>=something` results in two splits of the feature space horizontally or vertically\n", "\n", "- Decision trees are nested **if-then** staments\n", "\n", "- They predict the value of a target variable by learning simple decision rules inferred from the features.\n", "\n", "- For instance, in our dataset, decision trees learn from data with a set of **if-then-else** decision rules like if the `\"hour\"` is later than `5h` and earlier than `6h` else if `temperature` is higher than `15C` then..\n", "\n", "- The deeper the tree, the more complex the decision rules and the fitter the model.\n", "\n", "## Steps of Splitting Process\n", "\n", "Lets denote our features (like `temperature, humidity` etc) as $X_1, X_2,...,X_p$\n", "\n", "For the split process there are 2 steps:\n", "\n", "1. We divide the feature space (the set of possible values for features) into **J** distinct and non-overlapping regions: $R_1, R_2,...,R_J$\n", "

\n", "\n", "2. When we make any prediction for a sample, we check in which region(subset) our sample falls. \n", " - Say it is the region $R_j$. \n", " - Then the **mean** of the target values of the training samples in $R_j$ will be the predicted value of the given sample.\n", "\n", "For instance, suppose that \n", "- in Step 1 we obtain 2 regions, $R_1$ and $R_2$, and \n", "- target mean of the training samples in the first region is $10$, while \n", "- target mean of the training samples in the second region is $20$. \n", "- Then, if a new sample falls in $R_1$, we will predict a value of $10$ for the target value of that sample, and if it falls in $R_2$ we will predict a value of $20$.\n", "\n", "- For regression trees the future importance is measured by how much each feature reduce the variance when they split the data\n", "\n", "### How do we construct the regions $R1,...,RJ $ ?\n", "\n", "We divide the space into high-dimensional rectangles, or boxes. The goal is to find boxes $R_1,...,R_J$ that minimize the\n", "- **Mean Squared Error**, which minimizes the **L2 error** using **mean** values of the regions, or \n", "- **Mean Absolute Error**, which minimizes the **L1 error** using **median** values of the regions\n", "\n", "`Mean Squared Error:`\n", "\n", "![mse](/images/mse.png)\n", "\n", "where $ŶR_j$ is the target mean for the training observations within the $j th$ box and the **$Y_i$** is the actual target value of the test observation\n", "\n", "**Note:** \n", "- As seen above, regression tree split depends on the mean of the targets, since the mean is severly affected by outliers, the regression tree will suffer from outliers.

\n", "\n", "\n", "- There are two main approaches to solve this problem: \n", " - either removing the outliers or \n", " - building the decision tree algorithm that makes splits based on the median instead of the mean, as the median is not affected by outliers. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursive Binary Splitting\n", "For computational feasiblity, we take a top-down, greedy approach that is known as **recursive binary splitting**. \n", "\n", "The recursive binary splitting approach is top-down because\n", "- it begins at the top of the tree (at which **point all observations belong to a single region**) and then \n", "- successively splits the features space; each split is indicated via 2 new branches further down on the tree. \n", "\n", "It is greedy because \n", "- at each step of the tree-building process, the best split is made at that particular step, rather than \n", "- looking ahead and picking a split that will lead to a better tree in some future step" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to perform recursive binary splitting, \n", "- we first select the feature $X_j$ (jth feature) and the cutpoint $s$ such that splitting the feature space into two regions where $X_j< s$ and $X_j≥s$ leads to the greatest possible reduction in Mean Squared Error. \n", "\n", "\n", "That is, \n", "- we consider **all features** $X_1,...,X_p$, and **all possible values of the cutpoint** $s$ for each of the features, and then \n", "- choose the feature and cutpoint such that the resulting tree has the lowest mean squared error. \n", "\n", "In greater detail, \n", "- **for any** $j$ (indice of the features) and $s$ (cutting point), we define \n", "- a half-plane $R_1(j, s)$ where $X_j

\n", "https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf#page=31

\n", "https://scikit-learn.org/stable/modules/tree.html#classification

\n", "https://classroom.udacity.com/courses/ud501/lessons/4802710867/concepts/47973133930923

\n", "https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "nikola": { "category": "", "date": "2018-11-28 22:08:04 UTC+02:00", "description": "", "link": "", "slug": "tree-based models", "tags": "tree based models, decision trees, random forest, bagging, boosting", "title": "Tree-based Models", "type": "text" } }, "nbformat": 4, "nbformat_minor": 2 }