"
],
"text/plain": [
" holiday workingday atemp humidity windspeed count \\\n",
"datetime \n",
"2011-01-01 00:00:00 0 0 14.395 81 0.0 16 \n",
"2011-01-01 01:00:00 0 0 13.635 80 0.0 40 \n",
"\n",
" month weekday hour \n",
"datetime \n",
"2011-01-01 00:00:00 1 5 0 \n",
"2011-01-01 01:00:00 1 5 1 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Filter out the casual and registered columns \n",
"## Filter out the high correlated columns \"season\" (correlated with \"month\") and \"temp\" (correlated with \"atemp\") \n",
"## Also in the previous posts we noticed that the data record in the \"weather\" column is not relaible\n",
"## We will also drop it\n",
"bike_data= bike_data.drop([\"casual\", \"registered\", \"temp\", \"season\", \"weather\"], axis=1)\n",
"bike_data.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Split: Training and Hold-out Datasets\n",
"\n",
"- Even though there is a given test set (without target variable) for Kaggle submission we will analyse the data as an independent project and follow our data splitting principles for model evaluation\n",
"\n",
"- At the end we will use the Kaggle test set for submission\n",
"\n",
"- So after instantiating a model we will use **cross validation** for evaluating the model performance and hyperparameter tuning but we still need to keep an **hold-out set** for our final evaluation. \n",
"\n",
"- Since this is a timeseries dataset we must respect to the temporal order of the data. Thus, we must use only the past data to predict the future data. \n",
"\n",
"\n",
"- So we can take the **last %5** as our hold-out data. (In the Kaggle competition the test set is the last 10 days of the months)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The shape of the hold-out dataset: (522, 10)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
datetime
\n",
"
holiday
\n",
"
workingday
\n",
"
atemp
\n",
"
humidity
\n",
"
windspeed
\n",
"
count
\n",
"
month
\n",
"
weekday
\n",
"
hour
\n",
"
\n",
" \n",
" \n",
"
\n",
"
9914
\n",
"
2012-11-16 06:00:00
\n",
"
0
\n",
"
1
\n",
"
15.91
\n",
"
61
\n",
"
6.0032
\n",
"
130
\n",
"
11
\n",
"
4
\n",
"
6
\n",
"
\n",
"
\n",
"
9915
\n",
"
2012-11-16 07:00:00
\n",
"
0
\n",
"
1
\n",
"
15.15
\n",
"
61
\n",
"
11.0014
\n",
"
367
\n",
"
11
\n",
"
4
\n",
"
7
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" datetime holiday workingday atemp humidity windspeed \\\n",
"9914 2012-11-16 06:00:00 0 1 15.91 61 6.0032 \n",
"9915 2012-11-16 07:00:00 0 1 15.15 61 11.0014 \n",
"\n",
" count month weekday hour \n",
"9914 130 11 4 6 \n",
"9915 367 11 4 7 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Find the starting indice of the last five percent\n",
"last_five_percent_ind= int(len(bike_data)* 0.95)\n",
"last_five_percent_ind\n",
"\n",
"# Create the hold-out dataset\n",
"hold_out_df=bike_data.reset_index().iloc[last_five_percent_ind: ,:]\n",
"\n",
"print(\"The shape of the hold-out dataset:\", hold_out_df.shape)\n",
"hold_out_df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will do training and cross validation on the rest of the data. Lets create it"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" holiday workingday atemp humidity windspeed month weekday hour\n",
"9914 0 1 15.91 61 6.0032 11 4 6\n",
"9915 0 1 15.15 61 11.0014 11 4 7"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create the features and target datasets from hold_out data\n",
"# Hold-out target\n",
"y_hold=hold_out_df[\"count\"]\n",
"\n",
"# Hold-out features data\n",
"X_hold=hold_out_df.drop([\"datetime\", \"count\"], axis=1)\n",
"X_hold.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## RMSLE Calculator Function\n",
"\n",
"In this Kaggle competion we seek to identify the models that result in predictions which minimize the Root Mean Squared Logaritmic Error (RMSLE). In the earlier post we talk about this metric in detail"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# Define the RMSLE function for error calculation: rmsle_calculator\n",
"# Using the vectorized numpy functions instead of loops always better for computation\n",
"def rmsle_calculator(predicted, actual):\n",
" assert len(predicted) == len(actual)\n",
" return np.sqrt(\n",
" np.mean(\n",
" np.power(np.log1p(predicted)-np.log1p(actual), 2)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom Scoring Function\n",
"- We should define a scoring function in order to use as a scoring parameter for `model_selection.cross_val_score`\n",
"\n",
"- We need this parameter for model-evaluation tools which rely on a scoring strategy when using cross-validation internally (such as `model_selection.cross_val_score` and `model_selection.GridSearchCV`)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# Make a custom scorer \n",
"# rmsle_error will negate the return value of rmsle_calculator,\n",
"rmsle_error = make_scorer(rmsle_calculator, greater_is_better=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tips For Implementing Decision Trees\n",
"Before starting implemetation of Random Forest algorithm it would be nice to remind some practical tips about decision trees from Sklearn page. Here are some basic tips:\n",
"\n",
"- Decision trees tend to overfit on data with a large number of features.\n",
"- Getting the right ratio of **samples** to **number of features** is important, since a tree with few samples in high dimensional space is very likely to overfit.\n",
"- Consider performing dimensionality reduction (**PCA, ICA**, or **Feature selection**) beforehand to give your tree a better chance of finding features that are discriminative.\n",
"- Remember that the number of samples required to populate the tree **doubles for each additional level** the tree grows to. \n",
"- Use `max_depth` to control the size of the tree to prevent overfitting.\n",
"- Use `max_depth=3` as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth.\n",
"\n",
"- Use `min_samples_split` or `min_samples_leaf` to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. \n",
" - `min_samples_split` can create arbitrarily small leaves,\n",
" - `min_samples_leaf` guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes\n",
" - A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. \n",
" - Try `min_samples_leaf=5` as an initial value. \n",
" - If the sample size varies greatly, a **float number** can be used as percentage in these two parameters. \n",
" - For classification **with few classes**, `min_samples_leaf=1` is often the best choice.\n",
"\n",
"\n",
"- Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. \n",
" - Class balancing can be done by sampling an equal number of samples from each class, or \n",
" - preferably by normalizing the sum of the sample weights (`sample_weight`) for each class to the same value. \n",
" - Also note that weight-based pre-pruning criteria, such as `min_weight_fraction_leaf`, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like `min_samples_leaf`.\n",
" \n",
" \n",
" \n",
"- If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as `min_weight_fraction_leaf`, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.\n",
" \n",
"- If the input matrix X (features) is **very sparse**, it is recommended to convert to sparse `csc_matrix` before calling fit and sparse `csr_matrix` before calling predict. \n",
"\n",
"- Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Forest of Randomized Trees\n",
"\n",
"- In Sklearn, bagging algorithms takes a **user-specified base estimator** along with parameters specifying the **strategy to draw random subsets.** \n",
"\n",
"- In `RandomForestRegressor` the base estimators are decision trees\n",
"\n",
"- In random forests, each tree in the ensemble is built from a sample **drawn with replacement** (i.e., a bootstrap sample) from the training set. \n",
"\n",
"- When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. \n",
"- Instead, the split that is picked is the best split among a **random subset of the features**. \n",
" \n",
"- As a result of this randomness, the **bias** of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its **variance** also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.\n",
"\n",
"\n",
"- Like in the other bagging algorithms in `RandomForestRegressor`\n",
" - `max_samples` and `max_features` control the size of the subsets (in terms of samples and features), \n",
" - `bootstrap` and `bootstrap_features` control whether samples and features are drawn with or without replacement. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parameters of Random Forest\n",
"\n",
"- The main parameters to adjust of random forest is `n_estimators` and `max_features`. \n",
"\n",
"- The `n_estimator` is the number of trees in the forest.\n",
" - The larger the better, but also the longer it will take to compute. \n",
" - In addition, note that results will stop getting significantly better beyond a critical number of trees. \n",
"\n",
"\n",
"\n",
"- `max_features` is the size of the **random subsets of features** to consider when splitting a node. \n",
" - The lower the greater the reduction of variance, but also the greater the increase in bias.\n",
"\n",
"\n",
"- Empirical good default values are\n",
" - `max_features=n_features/3` for regression problems, and \n",
" - `max_features=sqrt(n_features)` for classification tasks (where n_features is the number of features in the data). \n",
"\n",
"\n",
"- Good results are often achieved when setting `max_depth=None` in combination with `min_samples_split=2` (i.e., when fully developing the trees). \n",
"\n",
"- Bear in mind though that these values are usually not optimal, and might result in models that consume a lot of RAM. \n",
"- The best parameter values should always be **cross-validated**. \n",
"\n",
"- In addition, note that in random forests, bootstrap samples are used by default (`bootstrap=True`) \n",
"\n",
"### Note\n",
"\n",
"- The size of the model with the default parameters is $O(M*N*log(N))$ , where $M$ is the number of trees and $N$ is the number of samples. \n",
"- In order to reduce the size of the model, you can change these parameters: \n",
" - `min_samples_split`, \n",
" - `max_leaf_nodes`,\n",
" - `max_depth` and \n",
" - `min_samples_leaf`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Random Forest Model\n",
"After reminding the practical tips we can instantiate our Random Forest model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Instantiate a Random Forest object with parameters\n",
"random_forest=RandomForestRegressor(n_estimators=300,\n",
" max_depth=6,\n",
" max_features=6, \n",
" min_samples_leaf=8,\n",
" random_state=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cross Validation Scores\n",
"\n",
"Let's define a function to utilize the `cross_val_score` function from `sklearn.metrics` to get the scores of each split. We will calculate the scores with the metrics:\n",
"- **R_squared** \n",
"- **RMSE**\n",
"- **MAE**, and\n",
"- **RMSLE** (Root Mean Squared Logarithmic Error, Kaggle's metric for this dataset. We will mainly take into account this metric)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"# Define a function for calculating the cross validation scores with different metrics\n",
"def scores(X, y, split, metric_lst, estimator=random_forest):\n",
" '''Takes features and target sets, a list of metrics and an estimator -> \n",
" returns the scores of the metrics in the list '''\n",
" # Fit and score the model with cross-validation\n",
" for metric_desc, metric_name in metric_lst:\n",
" score= cross_val_score(estimator, X, y, cv=split, scoring=metric_name)\n",
" if metric_name==\"neg_mean_squared_error\":\n",
" print(f\"RMSE values:{np.sqrt(-score)}\", \"\\n\")\n",
" \n",
" elif metric_name==\"r2\":\n",
" print(f\"{metric_desc} values:{score}, \"\\n\"\")\n",
" \n",
" else:\n",
" print(f\"{metric_desc} values:{-score}\")"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE values:[72.19103188 46.93845747 67.0675416 74.70634337 68.92369856]\n",
"\n",
"RMSE values:[104.52779911 71.88854555 105.68288502 105.67601192 97.75787222]\n",
"\n",
"R^2 values:[0.48259751 0.73514714 0.29747846 0.60058387 0.64609849]\n",
"\n",
"MSLE values:[0.61212597 0.52121142 0.70357754 0.50449061 0.46967002]\n",
"\n",
"MSLE values:[0.37469821 0.27166134 0.49502135 0.25451078 0.22058993]\n",
"\n"
]
}
],
"source": [
"# Create a metric list\n",
"metrics_lst=[(\"MAE\", \"neg_mean_absolute_error\"), \n",
" (\"MSE\", \"neg_mean_squared_error\"), \n",
" (\"R^2\", \"r2\"),\n",
" (\"MSLE\", rmsle_error), # Our custom defined RMSLE \n",
" (\"MSLE\", \"neg_mean_squared_log_error\")]\n",
"\n",
"# Split the timeseries data with TimeSeriesSplit\n",
"time_split = TimeSeriesSplit(n_splits=5)\n",
"\n",
"scores(X, y, time_split, metrics_lst, estimator=random_forest)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.45870403895429723"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Validate with the hold-out data\n",
"# Fit the data to random_forest\n",
"random_forest.fit(X, y)\n",
"\n",
"# Predict the hold-out test set\n",
"pred=random_forest.predict(X_hold)\n",
"\n",
"# Score the predictions with rmsle_calculator\n",
"rmsle_calculator(y_hold , pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualizing Features Importances\n",
"- Tree-based methods enable measuring the importance (predictivity) of each feature in prediction\n",
"- It is calculated by regarding how much the tree nodes use a particular feature to split the data and reduce the variance\n",
"- Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples.\n",
"\n",
"- In Sklearn we can retreive the feature importance by using the attribute `feature_importance_`"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAaQAAAEBCAYAAAA3ndFoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XtcVXWi///X5iJJaJJuTcJvTuhB66hhR8UriqYoAgIp1RlKZ7Ss03Q8mYXpmJdsZiTKS9MQjXNqHBNNUEmPih4zndHUo6RO6bE5XklT3HhBUHCz1+8Pf+4iLyC3vZa+n48Hjwd7r7U+6732Y8vb9dlr720zDMNARETEw7w8HUBERARUSCIiYhIqJBERMQUVkoiImIIKSURETEGFJCIipqBCEhERU1AhiYiIKaiQRETEFFRIIiJiCiokERExBRWSiIiYggpJRERMwcfTAazizJliXC5rfDB606YBOBwXPB3jllgts9XygvUyWy0vKPNVXl42AgPvvuXtVEhV5HIZlikkwFJZr7JaZqvlBetltlpeUOaa0JSdiIiYggpJRERMQYUkIiKmoEISERFTsBmGYY5Xs0RExDRKSlwUFxdXa1svLxtNmwbc8na6yq6KWreGI0c8nUJEpH4YhhfV7KNq05SdiIiYgmUKadu2bSQnJ3s6hoiI1BHLFJKIiNzeLFVIhYWFjBkzhkGDBjF27FjKysrIyspi6NChxMTEkJKS4n4RLjQ01L1ddnY2KSkpAERGRjJu3DgGDRqEw+HwyHGIiMi1LHVRw/Hjx0lPT+f+++9nxIgRLFq0iL/85S8sWbKEwMBApk2bxnvvvcdrr71203H69OnD7Nmz6ym1iIg12e2N6nV/liqkdu3a0apVKwBCQkIoKiqiX79+BAYGApCUlMTEiRMrHadTp051mlNE5HZQUFBUre2qe9m3pabsfHx+6E+bzUbjxo0rLDcMA6fTWeE2UOE+AD8/vzpMKSIi1WGpQrqeDRs2cPbsWQCWLFlCt27dAAgMDOTbb7/FMAw2bNjgyYgiIlIFlpqy+6mAgACee+45kpOTuXz5Mg8//DDTpk0DYPz48YwdO5ZmzZrx6KOPcubMGQ+nFRGRm9FHB1WRPqlBRO4khqHXkERE5A5l6Sm7+nT4sKcTiIjUn5ISV73vU4VURQ7HBdN8zW9l7PZG1T7V9hSrZbZaXrBeZqvlBWWuKU3ZiYiIKaiQRETEFFRIIiJiCiokERExBRWSiIiYggpJRERMQYUkIiKmoEISERFTUCGJiIgpqJBERMQUVEgiImIKKiQRETEFfbhqFVXnuz08yW5vVK/7KylxUVxcXK/7FJHbiwqpivQFfTdnGF6oj0SkJjRlJyIipmDqQkpOTvZ0BBERqSemLqTt27d7OoKIiNQTU7yG5HQ6mTp1Kt9++y2nT58mNDSUe++9F4Dhw4fz6aefsmnTJubOnYvT6SQ4OJgZM2YQGBhIZGQk0dHR/O1vf8PHx4cXXniBP/3pTxw5coTXXnuNIUOGkJKSgp+fH3v37qW4uJjnn3+eYcOGefioRUTkx0xxhpSXl4evry+LFy9m3bp1FBUV0bt3bwA+/fRTCgsLSUtLY/78+SxfvpxevXrx9ttvu7dv1qwZ2dnZhISEkJGRwZ/+9CdSU1PJyMhwr3Ps2DEWL17Mxx9/zKxZsygoKKj34xQRkRszxRlSly5daNKkCQsXLuTgwYMcPnyYkpIS9/Ldu3dz4sQJnn76aQBcLhf33HOPe3mfPn0ACAoKonnz5vj4+BAUFMT58+fd6yQkJODr68t9991H586d2blzJ1FRUfV0hHeGml5qXt+XqteU1fKC9TJbLS8oc02YopD++7//m7lz5/L000+TkJDAmTNnMAzDvby8vJzOnTuTnp4OQGlpaYX3vPj6+rp/9/G5/iF5e3u7f3e5XDdcT6qvoKCo2tva7Y1qtH19s1pesF5mq+UFZb7Ky8tWrfdummLKbuvWrQwePJjExEQaN27Mtm3bKC8vx9vbG6fTSadOnfjqq684dOgQAO+//z6zZs26pX2sXr0awzD47rvv2LNnD48++mhdHIqIiFSTKU4Thg8fziuvvMKqVavw9fWlc+fO5Ofn079/f+Li4sjOzuatt95i3LhxuFwuWrRoQWpq6i3t49KlSyQmJlJWVsb06dMJDAyso6MREZHqsBk/nhu7TaWkpNC1a1cSEhKqPYY+qeHmDENTdmZntcxWywvKfJWlp+xERERMMWVX137729/WeIzDh2ue43ZWUuLydAQRsbg7opBqg8NxAZfLGrObVpw2EBHRlJ2IiJiCCklERExBhSQiIqagQhIREVNQIYmIiCmokERExBRUSCIiYgoqJBERMQUVkoiImIIKSURETEGFJCIipqBCEhERU9CHq1ZRdb7b46dKSlwVvnpdRER+oDOkKmrdGmy2mv34++vhFhG5Ef2FFBERU6jTQtq2bRvJycm1OuakSZPYu3fvNfenpKSQnZ3NyZMnGTNmDACff/45//mf/1mr+xcRkbphudeQZs6cedPlLVq04MMPPwTg73//e31EEhGRWlDnU3aFhYWMGTOGQYMGMXbsWA4ePEhkZKR7+bx585g3bx4APXv2ZMqUKQwbNozRo0ezevVqnnrqKSIjI9m+fTsAycnJbNu2DcMw+M1vfsOgQYNITk7m6NGjAOTn5xMZGck//vEPMjMzyczM5NNPPyUyMpJDhw4BUFJSQkREBKWlpXV9+CIiUkV1XkjHjx9nypQprF69mtOnT7N169Ybrnv69Gn69OnD8uXLKS0tZf369XzyySf86le/4uOPP66w7tq1a/nmm29YuXIlc+bMcRfSVW3atOGJJ57giSeeYPjw4QwbNoycnBwAcnNz6du3L35+frV/wCIiUi11PmXXrl07WrVqBUBISAhnzpy56fp9+vQB4P777+fRRx8FICgoiPPnz1dYb/v27QwcOBBfX1/uvfde93Y3kpCQwKhRo/j3f/93li1bxssvv1zdQ6oRu73RbbWf2mS1zFbLC9bLbLW8oMw1UeeF5OPzwy5sNhsAhmG473M6nRXWadCggft3b2/vG45rs9kqjPPjMa4nODiYoKAgcnNzcTgcdOrUqeoHUYsKCorqfB92e6N62U9tslpmq+UF62W2Wl5Q5qu8vGzVeu9mvV/23ahRI86ePUthYSFlZWVs3ry5WuN0796d1atXU1ZWxrlz5647jre3N06n0307MTGRN998k9jY2GrnFxGRuuGRQho9ejSPP/44I0eOpEOHDtUaZ8CAAXTt2pWhQ4fy/PPPExIScs06Xbp04bPPPmPBggUADBw4kHPnzhEXF1ejYxARkdpnM34873UbMwyDTZs2sWjRItLT0295+9at4ciRmmbQlN2NWC2z1fKC9TJbLS8o81XVnbKz3PuQquutt97i888/d79HSUREzOWO+eigSZMmsX79en72s595OoqIiFzHHXOGVFOHD9d8jJISV80HERG5TamQqsjhuIDLdUe83CYi4hF3zJSdiIiYmwpJRERMQYUkIiKmoEISERFTUCGJiIgpqJBERMQUVEgiImIKKiQRETEFFZKIiJiCCklERExBhSQiIqagz7Kroup8t8dPlZS4KC4uroU0IiK3H50hVVHr1mCz1ezH318Pt4jIjegvpIiImMJtW0hLlixh5cqVAKSkpJCdne3hRCIicjO3bSHt2rWLsrIyT8cQEZEqMsVFDdu2bSM9PR1fX1/y8/OJjIzE39+f9evXA5CRkcHevXuZPXs2LpeLVq1aMX36dJo1a0ZkZCSxsbH89a9/5eLFi/zud7/j/PnzbNiwgS+//BK73Q7Axo0b+eSTT3A4HIwdO5akpCRPHrKIiPyEac6Qdu/ezbRp08jKymLhwoXce++9ZGdnExoaSmZmJlOmTOH3v/89n332GZ07d2b69OnubZs0acLSpUt54okn+OCDD+jRoweRkZG89NJL9O7dG4CysjI+/fRTPvjgA959911PHaaIiNyAKc6QAP7pn/6Jli1bAhAYGEj37t0BCAoKYsOGDXTs2JHg4GAAkpKSyMjIcG97tXTatm1Lbm7udcfv378/NpuNtm3bcubMmbo8lJuy2xvdVvupTVbLbLW8YL3MVssLylwTpikkX1/fCre9vb3dvxuGUWGZYRg4nU73bT8/PwBsNtsNx7863s3WqQ8FBUV1vg+7vVG97Kc2WS2z1fKC9TJbLS8o81VeXrZqvXfTNFN2N9OxY0d2795Nfn4+AIsXL6Zbt2433cbb25vy8vL6iCciIrXANGdIN9OsWTOmT5/Oiy++yOXLlwkKCmLmzJk33aZHjx688847NGpkjlNRERG5OZvx0/kwua7WreHIkZqNYRiasrsRq2W2Wl6wXmar5QVlvuq2nrITEZHbnwpJRERMwRKvIZnB4cM1H6OkxFXzQUREblMqpCpyOC7gcunlNhGRuqIpOxERMQUVkoiImIIKSURETEGFJCIipqBCEhERU1AhiYiIKaiQRETEFFRIIiJiCiokERExBRWSiIiYggpJRERMQZ9lV0WVfbdHSYmL4uLiekojInL70RlSFbVuDTbbjX/8/fVQiojUhP6KioiIKaiQRETEFOq8kLKzs0lJSbnm/jFjxnDy5Mkajz9v3jzmzZtX43FERMSzPHZRw4cffuipXYuIiAndtJBiYmKYPXs2ISEhjB8/noCAAKZNm0ZeXh5/+MMf6Ny5Mzk5OXh7e9OzZ08mTJjAiRMnGD16NIGBgdx1113ExMS4x5s5cyYOh4PU1FQee+wx/vznP7N9+3Y2b97MuXPnOHbsGD179mTq1KkApKWlsXbtWgIDA7Hb7URGRpKQkMAf//hHlixZQmBgII0bN6Zjx44A/OUvf2HFihVcvHgRX19f0tLSOHnyJHPmzCEzMxO4csa2e/dupk2bVkcPqYiIVMdNCykiIoKtW7cSEhLCgQMH3Pdv3ryZvn37snz5crKysvD19eVXv/oVmZmZREREcOjQIf74xz8SHBxMdnY2cGVq7eTJk7zzzjt4e3tX2E9eXh4rV67E29ubqKgonnzySb777jt27tzJypUruXjxIvHx8URGRrJ3716ysrJYtmwZNpuNpKQkOnbsyIULF1i/fj0LFizgrrvuYs6cOSxcuJDJkyczefJkjh49yv/7f/+P5cuXM378+Dp4KMFub1Qn41aHmbJUldUyWy0vWC+z1fKCMtdEpYX00UcfER4eTps2bTh48CAOh4NNmzbRtm1boqOjadiwIQCJiYksX76ciIgImjZtSnBwsHucTZs2UVhYyNKlS/HxuXaXYWFhBARceZ9Pq1atOHfuHFu2bGHw4ME0aNCABg0aMGDAAAC2b99OREQEd999NwBRUVG4XC4CAgJIS0tj1apVHD58mM2bN9O+fXtsNhvx8fHk5OSQkJCAw+GgU6dOtfPo/URBQVGdjHur7PZGpslSVVbLbLW8YL3MVssLynyVl5et0vduXne7my0MCwtj//79bNmyha5du9KlSxfWrFmD0+mkcePG16zvdDoBuOuuuyrcf//99zNjxgymT5+Oy+W6Zjs/Pz/37zabDcMw8PLyuu66V5dfdbXgTpw4QVJSEkVFRfTp04f4+Hj3evHx8axatYqVK1cSFxd3s0MWEREPuWkh+fj40LFjRxYsWEDXrl0JDw8nPT2diIgIwsPDWbVqFZcuXcLpdJKVlUV4ePh1xwkJCWH48OE0bNiQhQsXVilYjx49yM3NpaysjAsXLrBx40ZsNhvdu3fn888/p6ioiNLSUtatWwfA3r17eeCBBxg5ciQdOnRg/fr1lJeXA1cK8b777iMzM1OFJCJiUpVeZRcREcGOHTsICQnBbrfjcDjo27cvYWFh7Nu3j8TERJxOJ7169eLnP/8533///Q3Hmjp1Kk8++SSPPfZYpcH69u1LXl4e8fHx3HPPPTRv3hw/Pz/at2/PM888w+OPP07jxo0JCgoCoGfPnixatIghQ4ZgGAZdunTh22+/dY83ZMgQcnNzadGiRVUeFxERqWc248fzXyaSl5fH4cOHiY+P5/LlyyQlJfHWW2/Rrl27Wx7L6XTy6quvEhUVxcCBA6uVp3VrOHLkxssNQ68h1YTVMlstL1gvs9XygjJfVSevIXnSz372M1auXElsbCwJCQlER0dXq4wMw6B3797YbDb3hREiImI+pv207yZNmjB//vwaj2Oz2di6dWuNxzl8+ObLS0quvQBDRESqzrSFZDYOxwVcLlPOboqI3BZMO2UnIiJ3FhWSiIiYggpJRERMQYUkIiKmoEISERFTUCGJiIgpqJBERMQUVEgiImIKKiQRETEFFZKIiJiCCklERExBhSQiIqagD1etop9+t0dJiYvi4mIPpRERuf3oDKmKWrcGm+2HH39/PXQiIrVJf1VFRMQUTF9IoaGhla4TGRlJfn5+PaQREZG6YvpCEhGRO0OtF1JMTAz/93//B8D48eN54403AMjLy+PZZ58lIyOD+Ph4YmNjmTVrFoZx5VtYly9fTnx8PHFxcbz++uuUlpZWGHfXrl0MHDiQI0eOcPbsWcaMGUNMTAzjxo1zr3vhwgVeeuklkpKS6NevH6+//jqGYTBhwgSWLFniHis5OZndu3fX9qGLiEgN1PpVdhEREWzdupWQkBAOHDjgvn/z5s307duXL7/8kqVLl2Kz2ZgwYQI5OTk89NBDLFmyhMzMTPz8/EhLS2P+/Pm88MILAOzfv59JkyaRnp7OAw88wPTp03nooYf48MMP2bFjB6tXrwZg48aNtG/fnrlz51JWVkZ0dDRff/01iYmJzJs3jxEjRvDdd99RWFhIp06danysdnujGo9RV8yc7UasltlqecF6ma2WF5S5JuqkkD766CPCw8Np06YNBw8exOFwsGnTJtq2bcuePXtISEgA4NKlSwQFBVFUVMSRI0cYMWIEAJcvX+ahhx5yj/nLX/6SqKgoHnzwQQC2b99OWloaAF26dKFVq1YADB06lD179vDRRx9x8OBBzp49S0lJCd26dePXv/41+fn5rFixgri4uFo51oKColoZp7bZ7Y1Mm+1GrJbZannBepmtlheU+SovL9s1b5WpilovpLCwMFJSUtiyZQtdu3aladOmrFmzBqfTSaNGjXjmmWcYNWoUAOfPn8fb25ulS5cyePBgJk+eDEBxcTHl5eXuMd9++21effVVhg8fTrt27bDZbO6pPgBvb28AFixYwNq1axkxYgQ9evTgwIEDGIaBzWZj2LBhrFq1itWrVzN//vzaPmwREamhWn8NycfHh44dO7JgwQK6du1KeHg46enpREREEB4ezooVKyguLsbpdPJv//ZvrF27lm7durFu3TocDgeGYTB16lQ+/vhj95jdu3dn/PjxTJ48GZfLRffu3VmxYgUAe/bs4ejRowD87W9/IykpidjYWEpLS9m/fz8ulwuAhIQEMjMzadmyJS1atKjtwxYRkRqqk09qiIiIYMeOHYSEhGC323E4HPTt25ewsDD279/PiBEjKC8vp3fv3sTHx2Oz2XjxxRd55plncLlctG/fnmeffbbCmMOGDSM7O5sFCxbw0ksvkZKSQnR0NA8++KB7yu6ZZ55h6tSpZGRkEBAQQFhYmPty8JYtW9KyZUvi4+Pr4pBFRKSGbMaP575uU4ZhcOrUKZKTk1m5ciUNGjS45TFat4YjR348pl5Dqk1Wy2y1vGC9zFbLC8p8VXVfQ7oj3oe0du1a4uLiePnll6tVRiIiUvfuiA9XjYqKIioqqkZjHD5c8XZJiatG44mISEV3RCHVBofjAi7XbT+7KSLiMXfElJ2IiJifCklERExBhSQiIqagQhIREVNQIYmIiCmokERExBRUSCIiYgoqJBERMQUVkoiImIIKSURETEGFJCIipqBCEhERU1AhiYiIKaiQRETEFFRIIiJiCrVSSHv37mXSpEm3tE1oaGht7PqW5OfnExkZWe/7FRGRytXKF/R16NCBDh061MZQIiJyh6pyIcXExDB79mxCQkIYP348AQEBTJs2jby8PEaNGkWHDh1YsGABycnJdOjQgZ07d1JYWMjkyZOJiIggPz+fCRMmUFJSQqdOndzjbt26ldTUVADuuece0tLSKCkp4fnnn+fBBx/kH//4B0FBQaSmptKkSRM2bdrE3LlzcTqdBAcHM2PGDAIDA9mzZw+/+c1vuHTpEoGBgUybNo1WrVrxzTffuM/e2rVrV8sPn4iI1JYqT9lFRESwdetWAA4cOMCuXbsA2Lx5M6+++mqFdS9fvszixYuZOHEic+bMAWDGjBkkJCSwYsUKOnfu7F73/fffZ+rUqWRnZ9OjRw+++eYb9z6eeuopVq1aRUhICO+99x6FhYWkpaUxf/58li9fTq9evXj77bcpKytj8uTJpKWlsWzZMkaNGsWvf/1rAF577TVeeeUVli1bRnBwcA0eKhERqUtVPkOKiIjgo48+Ijw8nDZt2nDw4EEcDgebNm3i5z//eYV1e/fuDUDbtm05e/YsANu3byctLQ2A2NhYJk+eDED//v158cUXGTBgAP3796dnz57k5+fTunVrunXrBsCwYcN45ZVX6NmzJydOnODpp58GwOVycc8993D48GGOHTvG888/785w4cIFCgsLOXXqFD179gQgISGBrKysaj1QTZsGVGs7T7HbG3k6wi2zWmar5QXrZbZaXlDmmqhyIYWFhZGSksKWLVvo2rUrTZs2Zc2aNTidTlq2bFlhXT8/PwBsNluF+w3DcN/v5XXl5GzkyJH069ePzz//nNTUVPbs2UNMTAw+Pj4VtvP29qa8vJzOnTuTnp4OQGlpKcXFxZw6dYrg4GBWrFgBQHl5OadPn8Zms7n3CeDt7V3lB+anHI4LuFxG5SuagN3eiIKCIk/HuCVWy2y1vGC9zFbLC8p8lZeXrVr/ia/ylJ2Pjw8dO3ZkwYIFdO3alfDwcNLT04mIiKjS9j169CAnJweA3NxcSktLARg+fDjFxcWMHDmSkSNHuqfsDh06xL59+wDIysqiT58+dOrUia+++opDhw4BV6b7Zs2axYMPPsi5c+f4n//5H/f6r7zyCoGBgQQFBbFx40YAVq5cWdXDFRGRenZLV9lFRESwY8cOQkJCsNvtOBwO+vbtS1lZWaXbTpkyhQkTJrB48WL++Z//mbvvvhuAl19+mZSUFHx8fPD39+fNN98ErlzgMHfuXI4ePUpoaChvvvkm/v7+vPXWW4wbNw6Xy0WLFi1ITU2lQYMGzJkzh5kzZ1JaWkpAQAC/+93vAEhNTWXixInMnj2bRx555FYfHxERqSc248dzWiaRn5/P008/zYYNGzwdxU1TdnXLapmtlhesl9lqeUGZr6rzKTsREZG6ZMpCCg4ONtXZkYiI1D1TFpKIiNx5VEgiImIKKiQRETEFFZKIiJiCCklERExBhSQiIqagQhIREVNQIYmIiCmokERExBRUSCIiYgoqJBERMQUVkoiImIIKSURETEGFJCIipqBCEhERU1AhiYiIKXikkLZt20ZycnKV1w8NDQVg0aJFLFq06Jrl2dnZpKSk1Fo+ERGpfz6eDnArnnzySU9HEBGROuKxKbvCwkLGjBnDoEGDGDt2LGVlZWRlZTF06FBiYmJISUmhuLi4wjbz5s1j3rx5ACxfvpxBgwaRmJjIxo0b3eusXr2aESNGEBsbS1RUFLt27eLIkSP07dsXl8sFXDlDGz16dL0dq4iIVM5jhXT8+HGmTJnC6tWrOX36NIsWLSI9PZ0FCxbw2Wef0bBhQ957773rbnvy5EnefvttFi5cyOLFi93F5XK5yMzMJD09nZycHEaPHk1GRgYPPPAAwcHBbNu2DbhSZgkJCfV2rCIiUjmPTdm1a9eOVq1aARASEkJRURH9+vUjMDAQgKSkJCZOnHjdbfPy8ggLC6NZs2YAxMTE8OWXX+Ll5cXvf/97NmzYwKFDh9i+fTteXlc6NzExkZycHB555BG+/PJLpk6dekt5mzYNqOaReobd3sjTEW6Z1TJbLS9YL7PV8oIy14THCsnH54dd22w2GjduzPnz5933GYaB0+m87rY2mw3DMK4Zq7i4mMcff5zY2Fi6dOlCaGgoCxcuBCAqKop3332XtWvX0qdPH/z8/G4pr8NxAZfLqHxFE7DbG1FQUOTpGLfEapmtlhesl9lqeUGZr/LyslXrP/Gmuux7w4YNnD17FoAlS5bQrVu366736KOP8tVXX3Hy5ElcLhf/9V//BcDhw4ex2WyMHTuWbt26sW7dOsrLywFo2LAhffr04Z133tF0nYiICZnmKruAgACee+45kpOTuXz5Mg8//DDTpk277rrNmjVj8uTJjBw5koYNG9KmTRvgyjRg+/btGTx4MDabjV69erFz5073dtHR0ezatYtOnTrVyzGJiEjV2Ywfz33dxsrLy3n33Xdp2rQpo0aNuuXtNWVXt6yW2Wp5wXqZrZYXlPmq6k7ZmeYMqa4lJiYSGBjIH/7wB09HERGR67hjCmn58uWejiAiIjdhqosaRETkzqVCEhERU1AhiYiIKaiQRETEFFRIIiJiCnfMVXY15eVl83SEW2K1vGC9zFbLC9bLbLW8oMw1Ge+OeWOsiIiYm6bsRETEFFRIIiJiCiokERExBRWSiIiYggpJRERMQYUkIiKmoEISERFTUCGJiIgpqJBERMQUVEj/v88++4whQ4YwcOBAFi5ceM3yffv2kZCQwKBBg5g0aRJOp9MDKSuqLPNVr776KtnZ2fWY7MYqy7x+/Xri4uKIjY3lhRde4Ny5cx5I+YPK8q5bt46YmBiio6NJSUmhrKzMAykrqurzYuPGjURGRtZjsuurLO97771Hv379iIuLIy4u7qbHVF8qy3zw4EGSk5OJjY3ll7/8pamfx/v27XM/tnFxcfTu3ZuhQ4d6Jqghxvfff2/069fPOHPmjFFcXGzExMQY3377bYV1oqOjjby8PMMwDGPixInGwoULPRHVrSqZv//+e+O5554zOnbsaGRlZXkoacU8N8tcVFRk9OzZ0/j+++8NwzCM2bNnGzNmzPBU3ErzFhcXG7169TIKCgoMwzCMcePGGZmZmZ6KaxhG1Z4Zy3WzAAAEpElEQVQXhmEYBQUFRlRUlNGvXz8PpPxBVfI+99xzxq5duzyU8FqVZXa5XMbAgQONL774wjAMw0hNTTVmzZrlqbhVfk4YhmGUlJQY0dHRxo4dO+o55RU6QwK2bNlCeHg4TZo0wd/fn0GDBrFmzRr38u+++45Lly7xyCOPAJCQkFBhuSdUlhmu/K+of//+DB482EMpK6os8+XLl3njjTdo0aIFAKGhoZw4ccJTcSvN6+/vz4YNG2jWrBkXL17E4XDQuHFjj+WFqj0vACZPnsyLL77ogYQVVSXv3//+dz744ANiYmKYPn06paWlHkp7RWWZv/76a/z9/enTpw8AY8eO5V//9V89FbfKzwmADz74gC5duvAv//Iv9ZzyChUScOrUKex2u/t28+bNOXny5A2X2+32Css9obLMAKNHj2b48OH1He2GKsscGBjIY489BsClS5fIyMhgwIAB9Z7zqqo8xr6+vnzxxRf07duXM2fO0KtXr/qOWUFVMv/5z3/moYceolOnTvUd7xqV5S0uLqZ9+/ZMmDCBZcuWcf78ed5//31PRHWrLPPRo0dp1qwZr7/+OvHx8bzxxhv4+/t7IipQtecEQFFREUuWLPHof1RUSIDL5cJm++Hj0g3DqHC7suWeYMZMlalq5qKiIp599lnatWtHfHx8fUasoKp5IyIi2LZtG/369WPq1Kn1mPBalWU+cOAAubm5vPDCC56Id43K8t599918+OGHhISE4OPjwy9+8Qu++OILT0R1qyyz0+lk+/btPPnkkyxbtoxWrVrx29/+1hNRgao/j3NychgwYABNmzatz3gVqJCA++67j4KCAvftgoICmjdvfsPlp0+frrDcEyrLbEZVyXzq1CmeeuopQkNDmTlzZn1HrKCyvGfPnuWvf/2r+3ZMTAz/+7//W68Zf6qyzGvWrKGgoIDExESeffZZ9+PtKZXlPX78OEuXLnXfNgwDHx/Pfo1bZZntdjsPPPAAHTp0AGDo0KHs2bOn3nNeVdW/FevXr2fIkCH1Ge0aKiSgR48ebN26lcLCQi5evEhubq57/hfg/vvvx8/Pj507dwKwYsWKCss9obLMZlRZ5vLycsaOHcvgwYOZNGmSx8/4KstrGAYTJkzg+PHjwJU/9p07d/ZUXKDyzC+99BJr165lxYoVZGRk0Lx5cz755BPT5r3rrrtITU3l2LFjGIbBwoUL3dO6nlJZ5rCwMAoLC9m/fz8AGzZs4OGHH/ZU3Cr9rTAMg6+//pqwsDAPpfwhiBiGkZOTY0RHRxsDBw40MjIyDMMwjNGjRxt79uwxDMMw9u3bZyQmJhqDBg0yXn75ZaO0tNSTcQ3DqDzzVa+99poprrIzjJtnzs3NNUJDQ43Y2Fj3z+uvv27avIZhGOvWrTOGDh1qxMTEGP/xH/9hnD9/3pNxDcOo+vPi2LFjHr/KzjAqz7tmzRr38pSUFEv82/vqq6+MxMREY8iQIcYvfvEL4/Tp056MW2ne06dPGz169PBkRMMwDEPfGCsiIqagKTsRETEFFZKIiJiCCklERExBhSQiIqagQhIREVNQIYmIiCmokERExBRUSCIiYgr/H76jFWWrPa4VAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create a pandas Series of features importances: importances\n",
"# containing the feature names as index and their importances as values\n",
"importances = pd.Series(data=random_forest.feature_importances_ , index= X.columns)\n",
"\n",
"# Get the sorted importance values: importance_sorted\n",
"importance_sorted=importances.sort_values()\n",
"\n",
"# Plot the sorted importance values by using horizontal bars\n",
"importance_sorted.plot(kind=\"barh\", color=\"blue\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obviously, `hour` and `atemp` (temperature) are the most predictive features according to our `random_forest` model. The importances of these two features add up to `more than 90%`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kaggle Submission for Random Forest Model\n",
"\n",
"Now time to predict the given Kaggle test set and submit the predictions to Kaggle to get our score.\n",
"- First, lets read the Kaggle test data and apply all the same steps that we did for the training set like creating new features in order to make them match\n",
"\n",
"- Second, it would be better if we combine our `X` and `X_hold` and `y` and `y_hold` datasets then train our model with more data in order to let our model learn better before predicting on a new test data i.e Kaggle test dataset which we have not yet uploaded."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" holiday workingday atemp humidity windspeed month weekday hour\n",
"0 0 1 11.365 56 26.0027 1 3 0\n",
"1 0 1 13.635 56 0.0000 1 3 1\n",
"2 0 1 13.635 56 0.0000 1 3 2"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Drop the index\n",
"X_kaggle_test= kaggle_test.reset_index(drop=True)\n",
"# Drop the unnecessary columns\n",
"X_kaggle_test=X_kaggle_test.drop([\"temp\", \"season\", \"weather\"], axis=1)\n",
"X_kaggle_test.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit Combined Data to Random Forest Model"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"# Combine X and X hold: combined_train\n",
"combined_X=pd.concat([X, X_hold])\n",
"\n",
"# Combine the y and y hold: combined_test\n",
"combined_y =pd.concat([y, y_hold])\n",
"\n",
"random_forest.fit(combined_X, combined_y)\n",
"\n",
"# Predict the Kaggle test set\n",
"final_predictions_rf=random_forest.predict(X_kaggle_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Having predicted the test targets now we need to create a dataframe complying with Kaggle submission format "
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
count
\n",
"
\n",
"
\n",
"
datetime
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2011-01-20 00:00:00
\n",
"
17.469270
\n",
"
\n",
"
\n",
"
2011-01-20 01:00:00
\n",
"
10.278210
\n",
"
\n",
"
\n",
"
2011-01-20 02:00:00
\n",
"
6.495004
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count\n",
"datetime \n",
"2011-01-20 00:00:00 17.469270\n",
"2011-01-20 01:00:00 10.278210\n",
"2011-01-20 02:00:00 6.495004"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kaggle_sub_rf=pd.DataFrame({\"datetime\":kaggle_test.index, \"count\":final_predictions_rf}).set_index(\"datetime\")\n",
"kaggle_sub_rf.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"# Save the submission dataframe\n",
"kaggle_sub_rf.to_csv(\"kaggle_sub_rf.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Kaggle Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## OOB ('Out of the Bag' or 'Out of the Boot') Validation\n",
"- The single trees in Random Forest algorithm take the input data by randomly selecting with replacement from the original data (bootstrap samples)\n",
"- This means each tree very highly containing dublicated samples and not containing some of the samples in the original training set\n",
"- On average, \n",
" - for each model, `60%` of the training instances are sampled\n",
" - each single tree does not use `40%` of the training instance. \n",
" - They are called **\"Out of the bag\"**. A better name is **\"Out of Boot\"** because they are the ones not choosen by the bootstrap method\n",
" \n",
" \n",
"- This unseen samples can be used as validation set\n",
"- Random forest algorithm can run internally the single tree predictions on these OOB data\n",
"- Sometimes, our dataset can be very small and if we dont want to split it for the Validation set, OOB option can be a good alternative\n",
"- This allows us to see whether the model is `over-fitting`, without needing a `separate validation set`. \n",
"- In Sklearn RandomForestRegressor the parameter for OOB option is `oob_score = True`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### OOB Error and Timeseries Data\n",
"- OOB is good for especially small dataset which are **not timeseries**\n",
"- However OOB validation will give us a misleading evaluation of performance on a time-series dataset because it will be evaluating performance on past data using future data due to the random selection during the bootstrap method application\n",
"- Therefore, we need to use a methodology which respect to time order like `TimeSeriesSplit`.\n",
"- In this project we will not use OOB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Adaboost\n",
"\n",
"After implementing RandomForestRegressor now let's implement `AdaBoostRegressor`. \n",
"\n",
"Here are some reminders:\n",
"- In **Adaboost** we fit a new estimator repeatedly by **modifing the data each time**. \n",
"- The data modifications at each boosting iteration consist of applying weights $w_1,...w_n$ to each of the training samples. \n",
"- Initially, those weights are all equally set to $w_i=1/N$, so that the first step simply trains a weak learner **on the original data**. \n",
"- For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data.\n",
"\n",
"- At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. \n",
"\n",
"- As iterations proceed, examples that are difficult to predict receive ever-increasing influence. \n",
"- Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence.\n",
"\n",
"**Important Parameters of Sklearn `AdaBoostRegressor`**\n",
"- `base_estimator` : The base estimator from which the boosted ensemble is built. Default is `DecisionTreeRegressor(max_depth=3)`. We can use a developped tree\n",
"\n",
"- The complexity of the base estimators is important (e.g., its depth `max_depth` or minimum required number of samples to consider a split `min_samples_split`).\n",
"\n",
"- `n_estimators` : The maximum number of estimators at which boosting is terminated.In case of perfect fit, the learning procedure is stopped early. Default is 50 \n",
"\n",
"- `learning_rate` : Controls the contribution of the weak learners in the final combination. \n",
" - Learning rate shrinks the contribution of each regressor by the provided value. \n",
" - There is a trade-off between learning_rate and n_estimators. \n",
" - If learning rate is small the number of estimators will be large. \n",
" - A number beetween 0 and 1. Default is 0.1\n",
"\n",
"- `loss` : The loss function to use when updating the weights after each boosting iteration.\n",
"Options are `linear`, `square`, `exponential`. Default is `linear`\n",
"\n",
"- `random_state` : Since Boosting involves randomness when choosing the input data, it is important to seed this parameter in order to reproduce the same results later"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# Instantiate a DecisionTreeRegressor with arguments\n",
"dtree = DecisionTreeRegressor(max_depth=6,\n",
" max_features=6, \n",
" min_samples_leaf=8,\n",
" random_state=1)\n",
"\n",
"# Instantiate an AdaBoostClassifier with 300 trees, \n",
"adaboost = AdaBoostRegressor(base_estimator=dtree, \n",
" n_estimators=300,\n",
" learning_rate=0.02,\n",
" random_state=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's evaluate the performance of AdaboostRegressor with scores function we defined earlier"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE values:[66.53556939 43.77378891 65.3348207 72.22024961 65.9730083 ]\n",
"\n",
"RMSE values:[ 94.81543874 64.38501852 102.36121393 102.0862418 94.38518 ]\n",
"\n",
"R^2 values:[0.57428101 0.78755088 0.34094573 0.62725897 0.67009678]\n",
"\n",
"MSLE values:[0.57223361 0.51306689 0.68107254 0.48182514 0.45777072]\n",
"\n",
"MSLE values:[0.3274513 0.26323764 0.46385981 0.23215547 0.20955404]\n",
"\n"
]
}
],
"source": [
"# Calculate the scores\n",
"scores(X, y, time_split, metrics_lst, estimator=adaboost)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4269805955614845"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Validate with the hold-out data\n",
"ada=adaboost.fit(X, y)\n",
"pred=ada.predict(X_hold)\n",
"rmsle_calculator(y_hold , pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kaggle Submission of Adaboost Model"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
count
\n",
"
\n",
"
\n",
"
datetime
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2011-01-20 00:00:00
\n",
"
16.488889
\n",
"
\n",
"
\n",
"
2011-01-20 01:00:00
\n",
"
9.366667
\n",
"
\n",
"
\n",
"
2011-01-20 02:00:00
\n",
"
6.297357
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count\n",
"datetime \n",
"2011-01-20 00:00:00 16.488889\n",
"2011-01-20 01:00:00 9.366667\n",
"2011-01-20 02:00:00 6.297357"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ada=adaboost.fit(combined_X, combined_y)\n",
"final_predictions_ada= ada.predict(X_kaggle_test)\n",
"kaggle_sub_ada=pd.DataFrame({\"datetime\":kaggle_test.index, \"count\":final_predictions_ada}).set_index(\"datetime\")\n",
"kaggle_sub_ada.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"# Save the submission dataframe\n",
"kaggle_sub_ada.to_csv(\"kaggle_sub_ada.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grid Search Adaboost\n",
"Lets implement a short grid search for tuning some hyperparameters of adaboost"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'base_estimator__criterion': 'mse',\n",
" 'base_estimator__max_depth': 6,\n",
" 'base_estimator__max_features': 6,\n",
" 'base_estimator__max_leaf_nodes': None,\n",
" 'base_estimator__min_impurity_decrease': 0.0,\n",
" 'base_estimator__min_impurity_split': None,\n",
" 'base_estimator__min_samples_leaf': 8,\n",
" 'base_estimator__min_samples_split': 2,\n",
" 'base_estimator__min_weight_fraction_leaf': 0.0,\n",
" 'base_estimator__presort': False,\n",
" 'base_estimator__random_state': 1,\n",
" 'base_estimator__splitter': 'best',\n",
" 'base_estimator': DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=6,\n",
" max_leaf_nodes=None, min_impurity_decrease=0.0,\n",
" min_impurity_split=None, min_samples_leaf=8,\n",
" min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
" presort=False, random_state=1, splitter='best'),\n",
" 'learning_rate': 0.02,\n",
" 'loss': 'linear',\n",
" 'n_estimators': 300,\n",
" 'random_state': 1}"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# First get the parameters of adaboost\n",
"adaboost.get_params()"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"# Create the parameters grid for adaboost:params_ada\n",
"params_ada = {'n_estimators':[40, 50, 60],\n",
" 'learning_rate':[0.01, 0.02]}\n",
" \n",
"# Instantiate a 3-fold CV grid search object:grid_ada\n",
"grid_ada= GridSearchCV(estimator=adaboost,\n",
" param_grid=params_ada, \n",
" cv=5,\n",
" scoring=\"neg_mean_squared_log_error\",\n",
" #verbose=True,\n",
" n_jobs=-1)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"# Fit the combined data to grid_ada \n",
"grid_ada.fit(combined_X, combined_y);"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'learning_rate': 0.01, 'n_estimators': 50}"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the best parameters of the grid search\n",
"grid_ada.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This grid search on two parameters did not improved the score. We need to do a more complex grid search with more parameters together but adding each new parameter makes the grid search computationaly more expensive. If you have time and gpu power you can give it a try"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gradient Boosting\n",
"Let's continue our analysis by applying a `GradientBoostingRegressor` model to our data\n",
"\n",
"- In contrast to Adaboost, in **Gradient Boosting** the weights of training samples are not modified between squential predictors.\n",
"- Instead, each predictor is trained using the residual errors `(y-ลท)` of its predecessors **as target values(labels)**\n",
"- One disadvantage of boosting algorithms is **scalability**, due to the sequential nature of boosting it can hardly be parallelized.\n",
"\n",
"**Important Parameters of Sklearn `GradientBoostingRegressor`**\n",
"\n",
"Since in each stage a regression tree is fit on the Sklearn `GradientBoostingRegressor` the parameters of DecisionTreeRegressor are also the parameters of GradientBoostingRegressor. However in `AdaboostRegressor` we provide the base estimator with a seperate parameter. \n",
"\n",
"- `n_estimators`: number of boosting stages, or trees, to use.\n",
"- `learning_rate`: A number beetween 0 and 1. Learning rate shrinks the contribution of each regressor by the provided value. There is a trade-off between learning_rate and n_estimators. If learning rate is small the number of estimators will be large. Default is 0.1\n",
"- `subsample`: The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.\n",
"- `in_samples_leaf`: The minimum number of samples required to be at a leaf node\n",
"- `max_depth` : Maximum depth of the individual regression trees."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE values:[63.78243526 29.52160019 58.18626779 74.91291222 72.97491302]\n",
"\n",
"RMSE values:[ 90.93426958 45.57951373 88.49253049 102.05948781 97.819009 ]\n",
"\n",
"R^2 values:[0.60842039 0.89353064 0.507435 0.62745431 0.64565569]\n",
"\n",
"MSLE values:[0.55231098 0.41224693 0.63309638 0.47929402 0.44532874]\n",
"\n",
"MSLE values:[0.30504742 0.16994753 0.40081102 0.22972276 0.19831768]\n",
"\n"
]
}
],
"source": [
"# Instantiate a GradientBoostingRegressor object\n",
"gbr=GradientBoostingRegressor(n_estimators=80, \n",
" learning_rate=0.05,\n",
" max_depth=10,\n",
" min_samples_leaf=20,\n",
" random_state=5)\n",
"\n",
"# Calculate the scores of GradientBoostingRegressor\n",
"scores(X, y, time_split, metrics_lst, estimator=gbr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kaggle Submission For Gradient Boosting"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
count
\n",
"
\n",
"
\n",
"
datetime
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2011-01-20 00:00:00
\n",
"
17.592650
\n",
"
\n",
"
\n",
"
2011-01-20 01:00:00
\n",
"
9.359919
\n",
"
\n",
"
\n",
"
2011-01-20 02:00:00
\n",
"
7.708385
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count\n",
"datetime \n",
"2011-01-20 00:00:00 17.592650\n",
"2011-01-20 01:00:00 9.359919\n",
"2011-01-20 02:00:00 7.708385"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gbr.fit(combined_X, combined_y)\n",
"final_predictions_gbr= gbr.predict(X_kaggle_test)\n",
"kaggle_sub_gbr=pd.DataFrame({\"datetime\":kaggle_test.index, \"count\":final_predictions_gbr}).set_index(\"datetime\")\n",
"kaggle_sub_gbr[\"count\"]=kaggle_sub_gbr[\"count\"].abs()\n",
"kaggle_sub_gbr.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"# Save the submission dataframe\n",
"kaggle_sub_gbr.to_csv(\"kaggle_sub_gbr.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Stochastic Gradient Boosting\n",
"\n",
"Each tree in the ensemble is trained to find the best features and the best split points. This may cause the decesion trees use the **same features** and the **same split points**. To decrease this effect we can use `Stochastic Gradient Boosting (SGB)` by introducing further randomization to Gradient Boosting\n",
"- In SGB each decision tree is trained on a random subset of the training data\n",
"- The subsets are choosen without replacement\n",
"- **At each node** to choose the best-splits the features are also choosen without replacement \n",
" - This create further diversity in the ensemble and add more variance \n",
" \n",
" \n",
"- SGB combines gradient boosting with bootstrap averaging (bagging)\n",
"\n",
"- Here we are using a similar sampling method as in Random Forest algorithm\n",
"- The difference is that the trees continue to be trained as in the Gradient Boosting ie after training the first tree the subsequent trees are trained regarding the residual errors of the preciding trees\n",
"\n",
"#### `learning_rate` and `subsampling` effect\n",
"Some tips from Sklearn page:\n",
"- The figure below illustrates the effect of `shrinkage` and `subsampling` on the goodness-of-fit of the model. \n",
"- We can clearly see that shrinkage outperforms no-shrinkage. \n",
"- Subsampling with shrinkage can further increase the accuracy of the model. \n",
"- Subsampling without shrinkage, on the other hand, does poorly.\n",
"\n",
"\n",
"\n",
" - The number of subsampled features can be controlled via the `max_features` parameter.\n",
" \n",
"\n",
"**Note**: Using a small `max_features` value can significantly decrease the runtime. \n",
"\n",
"### Interpretation\n",
"- Individual decision trees can be interpreted easily by simply visualizing the tree structure. \n",
"- Gradient boosting models, however, comprise hundreds of regression trees thus they cannot be easily interpreted by visual inspection of the individual trees. \n",
"- But we can take the average of the feature importance of each tree to get the important features of the ensembles.\n",
"- The feature importance scores of a fit gradient boosting model can be accessed via the `feature_importances_` property"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE values:[61.99985965 40.33608158 60.5692487 74.9969457 68.99776331]\n",
"\n",
"RMSE values:[ 88.02700477 57.99335732 91.11700132 104.47887358 95.44953937]\n",
"\n",
"R^2 values:[0.63305857 0.8276379 0.47778522 0.60958209 0.66261434]\n",
"\n",
"MSLE values:[0.64242847 0.55233378 0.7128809 0.57816768 0.52222838]\n",
"\n"
]
}
],
"source": [
"# Instantiate a GradientBoostingRegressor object\n",
"sgb=GradientBoostingRegressor(n_estimators=500,\n",
" subsample=0.5,\n",
" max_depth=4, \n",
" min_samples_split= 2,\n",
" learning_rate=0.01, \n",
" max_features=0.75,\n",
" random_state=5)\n",
"\n",
"# Calculate the scores of GradientBoostingRegressor\n",
"scores(X, y, time_split, metrics_lst, estimator=sgb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kaggle Submission For Stochastic Gradient Boosting"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sgb.fit(combined_X, combined_y)\n",
"final_predictions_sgb= sgb.predict(X_kaggle_test)\n",
"kaggle_sub_sgb=pd.DataFrame({\"datetime\":kaggle_test.index, \"count\":final_predictions_sgb}).set_index(\"datetime\")\n",
"kaggle_sub_sgb[\"count\"]=kaggle_sub_sgb[\"count\"].abs()\n",
"kaggle_sub_sgb.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save the submission dataframe\n",
"kaggle_sub_sgb.to_csv(\"kaggle_sub_sgb.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final Submission\n",
"Here are the models and the scores on Kaggle submission\n",
"- Random Forest :0.5271\n",
"- Adaboost:0.6493\n",
"- GradientBoostingRegressor: 0.5386\n",
"- Stochastic GradientBoosting: 0.6690\n",
"\n",
"So we will submit the Random Forest.\n",
"\n",
"- We should notice that ensemble algorithms have a lot of parameters to tune, i.e a lot of knobs that control the model. \n",
"\n",
"- The best way to optimize these knobs to make comprehensive Grid Searh or Randomized Search. In our notebook we did not do rigorous optimization though all the models performed better than the linear model in the earlier post."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a closed competion though to get a feeling of ranking here is our ranking falls in among 3,251 teams :)\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sources: \n",
"https://scikit-learn.org/stable/modules/ensemble.html \n",
"https://rpmcruz.github.io/machine%20learning/2017/02/07/random-forests.html \n",
"https://datascience.stackexchange.com/questions/16800/why-max-features-n-features-does-not-make-the-random-forest-independent-of-num \n",
"https://stackoverflow.com/questions/23939750/understanding-max-features-parameter-in-randomforestregressor "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.5"
},
"nikola": {
"category": "",
"date": "2018-12-08 22:08:04 UTC+02:00",
"description": "",
"link": "",
"slug": "bikeshare part3",
"tags": "timeseries data ,Kaggle, Randomized Forest, Adaboost, Gradient Boosting, TimeSeriesSplit, Stochastic Gradient Boosting",
"title": "Ensemble Models - Kaggle Submission",
"type": "text"
}
},
"nbformat": 4,
"nbformat_minor": 2
}