{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "- With this notebook we continue our analysis of `bikeshare` dataset.\n", "- In this part we will do machine learning steps in order to predict the bike rentals for the given Kaggle test set and submit it to the Kaggle\n", "\n", "![time](/images/timecycle.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- In the previous post, we prepared the dataset partially \n", "- We created new columns and dropped the outliers regarding to the `\"count\"` column \n", "- Now we start by loading the dataset and continue to prepare it before applying machine learning algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline:\n", " \n", "- About Time Data\n", "- Trigonometric functions for cyclic time data transformation\n", "- Data Split\n", "- TimeSeriesSplit for the cross validation \n", "- Sklearn Pipeline\n", "- Interpratation of RSMLE metric\n", "- Creating a custom scoring function \n", "- Kaggle submission " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import the necessary modules\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import TimeSeriesSplit\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.metrics import r2_score\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.metrics import make_scorer\n", "from sklearn.pipeline import Pipeline\n", "\n", "import warnings \n", "warnings.filterwarnings(\"ignore\")\n", "\n", "%matplotlib inline\n", "sns.set()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's remind our dataset " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Kaggle terms for the dataset:\n", "\n", "- The training set is comprised of the first `19 days` of each month, while the test set comprised of the days from `20th` to the end of the month for each month. \n", "\n", "- Predict the `total count of bikes rented during each hour` covered by the test set, using only `information available prior to the rental period`.\n", "\n", "Here is the [Kaggle link for the dataset](https://www.kaggle.com/c/bike-sharing-demand/data)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasonholidayworkingdayweathertempatemphumiditywindspeedcasualregisteredcountmonthweekdayhour
datetime
2011-01-01 00:00:0010019.8414.395810.031316150
2011-01-01 01:00:0010019.0213.635800.083240151
2011-01-01 02:00:0010019.0213.635800.052732152
\n", "
" ], "text/plain": [ " season holiday workingday weather temp atemp \\\n", "datetime \n", "2011-01-01 00:00:00 1 0 0 1 9.84 14.395 \n", "2011-01-01 01:00:00 1 0 0 1 9.02 13.635 \n", "2011-01-01 02:00:00 1 0 0 1 9.02 13.635 \n", "\n", " humidity windspeed casual registered count month \\\n", "datetime \n", "2011-01-01 00:00:00 81 0.0 3 13 16 1 \n", "2011-01-01 01:00:00 80 0.0 8 32 40 1 \n", "2011-01-01 02:00:00 80 0.0 5 27 32 1 \n", "\n", " weekday hour \n", "datetime \n", "2011-01-01 00:00:00 5 0 \n", "2011-01-01 01:00:00 5 1 \n", "2011-01-01 02:00:00 5 2 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the bike_data_inliers\n", "bike_data=pd.read_csv(\"bike_data_inliers.csv\", parse_dates=[\"datetime\"], index_col=\"datetime\")\n", "bike_data.head(3)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
holidayworkingdaytemphumiditywindspeedcountmonthweekdayhour
datetime
2011-01-01 00:00:00009.84810.016150
2011-01-01 01:00:00009.02800.040151
2011-01-01 02:00:00009.02800.032152
\n", "
" ], "text/plain": [ " holiday workingday temp humidity windspeed count \\\n", "datetime \n", "2011-01-01 00:00:00 0 0 9.84 81 0.0 16 \n", "2011-01-01 01:00:00 0 0 9.02 80 0.0 40 \n", "2011-01-01 02:00:00 0 0 9.02 80 0.0 32 \n", "\n", " month weekday hour \n", "datetime \n", "2011-01-01 00:00:00 1 5 0 \n", "2011-01-01 01:00:00 1 5 1 \n", "2011-01-01 02:00:00 1 5 2 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Filter out the casual and registered columns and \n", "## high correlated columns \"season\" (correlated with \"month\") and \"atemp\" (correlated with \"temp\") should be omitted\n", "## Also in the previous post we noticed that the data record in the \"weather\" column is not relaible\n", "## We will also drop it\n", "bike_data= bike_data.drop([\"casual\", \"registered\", \"season\", \"atemp\", \"weather\"], axis=1)\n", "bike_data.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About Time Data\n", "- Since our data is a timeseries, i.e data recorded with a time dimension, we need to take into account the timeseries practices.\n", "\n", "\n", "- Firstly, we need to be aware of the `cyclic` nature of our time data\n", "\n", "\n", "- As we walked through the time frame charts in the previous post, we saw that ` month` and `hour` variables are very predictive over the number of bike rentals. \n", "\n", "\n", "- This is very resaonable as we might expect that the bike usage and rentals during `January` is less than ` May` due to the weather conditions and rentals at `3am` in the morning is less than at `14h`, during the day. \n", "\n", "\n", "- So we should use the `\"month\"` and `\"hour\"` data as features for our models. However, we have a problem of how to utilize these features. \n", "\n", "- We need to keep them **numeric** for the machine learning algorithms but if we use them without any transformation, it would not be so smart practice. \n", "\n", "To make it clearer here is a question:\n", "- Which one would you expect is more similar to the bike rentals in `January`: rentals in `December` or rentals in `May`? \n", "\n", "\n", "- Regarding the seasonal affects we can tell that number of bike rentals in `December` is more similar to rentals in `January`, however we represent `December` with number `12` and `May` with number `5`. \n", "\n", "- This is a wrong representation of the time features especially for the algorithms that use distance or the algorithms like linear regression.\n", "\n", "> As an example let's assume our model is $$ bikerentals= 15+ 2*hour $$ \n", "At 0h the number of estimated rentals = 15
\n", "At 23h the number of estimated rentals= 61
\n", "Even though there is only 1 hour difference between 0h-23h with this represantation the difference is the largest \n", "\n", "\n", "- Because of the cyclic nature of the months and hours we need to find a way of representing these features such that \n", " - the `12th` month should be closer to `1st` month than `5th` month or \n", " - `23th` hour should be closer to `1st` hour than `7th hour`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Representation of Cyclic Time Features\n", "- We map each cyclical variable onto a circle such that the `lowest value` for that variable appears right next to the `largest value`. \n", "\n", "- We represent each variable with two components: x axis value and y axis value\n", "\n", "- We compute the `x` and `y` components of that point using `sin` and `cos` functions. \n", "\n", "![hours](/images/hours.jpg)\n", "\n", "- When we perform this transformation for the `\"month\"` variable, we also shift the values down by one such that it extends from `0 to 11`, for convenience\n", "\n", "**NOTE:** Instead of represanting the time features with trigonometric functions we can also use dummy variables. Here we will go with the trigonometric functions" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Create the sin and cos components of hour, month and weekday columns\n", "bike_data['hour_sin'] = np.sin(bike_data.index.hour*(2.*np.pi/24))\n", "bike_data['hour_cos'] = np.cos(bike_data.index.hour*(2.*np.pi/24))\n", "bike_data['month_sin'] = np.sin((bike_data.index.month-1)*(2.*np.pi/12))\n", "bike_data['month_cos'] = np.cos((bike_data.index.month-1)*(2.*np.pi/12))\n", "bike_data[\"weekday_cos\"]= np.cos((bike_data.index.weekday)*(2.*np.pi/7))\n", "bike_data['weekday_sin'] = np.sin((bike_data.index.weekday)*(2.*np.pi/7))\n", "\n", "# Now we can drop the columns \"month\", \"weekday\" and \"hour\"\n", "bike_data=bike_data.drop([\"month\", \"weekday\", \"hour\"], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split the data\n", "- Even though there is a given test set (without target variable) for Kaggle submission we will analyse the data as an independent project and follow our data splitting principles\n", "\n", "- At the end we will use the Kaggle test set for submission\n", "\n", "- So, after instantiating a model we will use **cross validation** for evaluating model performance and hyperparameter tuning but we still we need to keep an **hold-out set** for our final evaluation. \n", "\n", "- Since this is a timeseries dataset we must respect to the temporal order of the data. Thus, we must use only the past data to predict the future data. \n", "- So we can take the **last %5** as our hold-out data. (In the Kaggle competition the test set is the last 10 days of the months)\n", "\n", "- We need to split the data before doing transformation (incase we do) because test data points represent real-world data. \n", "- Transformation of the variables use information from the dataset like **mean** and **standard deviation**\n", "- If we transform the data before splitting we take information like mean and variance of the whole dataset thus we will be **introducing future information to the training data**. \n", "\n", "- Therefore, we should perform transformation over the training data. Then perform the same transformation on testing data as well, but this time using the parameters (like mean and variance) of the training data. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of the hold-out dataset: (522, 13)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetimeholidayworkingdaytemphumiditywindspeedcounthour_sinhour_cosmonth_sinmonth_cosweekday_cosweekday_sin
99142012-11-16 06:00:000112.30616.00321301.0000006.123234e-17-0.8660250.5-0.900969-0.433884
99152012-11-16 07:00:000112.306111.00143670.965926-2.588190e-01-0.8660250.5-0.900969-0.433884
99162012-11-16 09:00:000113.94537.00153510.707107-7.071068e-01-0.8660250.5-0.900969-0.433884
\n", "
" ], "text/plain": [ " datetime holiday workingday temp humidity windspeed \\\n", "9914 2012-11-16 06:00:00 0 1 12.30 61 6.0032 \n", "9915 2012-11-16 07:00:00 0 1 12.30 61 11.0014 \n", "9916 2012-11-16 09:00:00 0 1 13.94 53 7.0015 \n", "\n", " count hour_sin hour_cos month_sin month_cos weekday_cos \\\n", "9914 130 1.000000 6.123234e-17 -0.866025 0.5 -0.900969 \n", "9915 367 0.965926 -2.588190e-01 -0.866025 0.5 -0.900969 \n", "9916 351 0.707107 -7.071068e-01 -0.866025 0.5 -0.900969 \n", "\n", " weekday_sin \n", "9914 -0.433884 \n", "9915 -0.433884 \n", "9916 -0.433884 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find the starting indice of the last five percent\n", "last_five_percent_ind= int(len(bike_data)* 0.95)\n", "last_five_percent_ind\n", "\n", "# Create the hold-out dataset\n", "hold_out_df=bike_data.reset_index().iloc[last_five_percent_ind: ,:]\n", "\n", "print(\"The shape of the hold-out dataset:\", hold_out_df.shape)\n", "hold_out_df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will do training and cross validation on the rest of the data. Lets create it" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetimeholidayworkingdaytemphumiditywindspeedcounthour_sinhour_cosmonth_sinmonth_cosweekday_cosweekday_sin
99112012-11-16 03:00:000112.3658.998160.7071070.707107-0.8660250.5-0.900969-0.433884
99122012-11-16 04:00:000112.3657.001550.8660250.500000-0.8660250.5-0.900969-0.433884
99132012-11-16 05:00:000112.3656.0032360.9659260.258819-0.8660250.5-0.900969-0.433884
\n", "
" ], "text/plain": [ " datetime holiday workingday temp humidity windspeed \\\n", "9911 2012-11-16 03:00:00 0 1 12.3 65 8.9981 \n", "9912 2012-11-16 04:00:00 0 1 12.3 65 7.0015 \n", "9913 2012-11-16 05:00:00 0 1 12.3 65 6.0032 \n", "\n", " count hour_sin hour_cos month_sin month_cos weekday_cos \\\n", "9911 6 0.707107 0.707107 -0.866025 0.5 -0.900969 \n", "9912 5 0.866025 0.500000 -0.866025 0.5 -0.900969 \n", "9913 36 0.965926 0.258819 -0.866025 0.5 -0.900969 \n", "\n", " weekday_sin \n", "9911 -0.433884 \n", "9912 -0.433884 \n", "9913 -0.433884 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data= bike_data.reset_index().iloc[:last_five_percent_ind, :]\n", "data.tail(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Features and Target Variable\n", "Having splitted the hold-out dataset, now time to create the features (X) and the target (y) datasets and fit a Linear Regression model" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Target data\n", "y=data[\"count\"]\n", "\n", "# Features data\n", "X=data.drop([\"datetime\", \"count\"], axis=1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
holidayworkingdaytemphumiditywindspeedhour_sinhour_cosmonth_sinmonth_cosweekday_cosweekday_sin
0009.84810.00.0000001.0000000.01.0-0.222521-0.974928
1009.02800.00.2588190.9659260.01.0-0.222521-0.974928
2009.02800.00.5000000.8660250.01.0-0.222521-0.974928
\n", "
" ], "text/plain": [ " holiday workingday temp humidity windspeed hour_sin hour_cos \\\n", "0 0 0 9.84 81 0.0 0.000000 1.000000 \n", "1 0 0 9.02 80 0.0 0.258819 0.965926 \n", "2 0 0 9.02 80 0.0 0.500000 0.866025 \n", "\n", " month_sin month_cos weekday_cos weekday_sin \n", "0 0.0 1.0 -0.222521 -0.974928 \n", "1 0.0 1.0 -0.222521 -0.974928 \n", "2 0.0 1.0 -0.222521 -0.974928 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notes about Cross Validation of Timeseries Data\n", "\n", "Here is the the notes from [sklearn official page](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) :\n", "\n", "Time series data is characterised by the **correlation between observations that are near in time (autocorrelation)**.\n", "\n", "However, classical cross-validation techniques such as `KFold` and `ShuffleSplit` \n", "- assume the samples are **independent** and **identically distributed**, and \n", "- would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. \n", "\n", "Therefore, it is very important to evaluate our model for time series data on the “future” observations least like those that are used to train the model. \n", "\n", "### Time Series Split\n", "\n", "`TimeSeriesSplit` is a variation of k-fold which \n", "- returns first `k` folds as train set and \n", "- the `k+1 th` fold as test set. \n", "\n", "\n", "- We should **not shuffle** our data when making predictions with timeseries.\n", "- Unlike standard cross-validation methods, successive training sets are **supersets** of those that come before them. \n", "- Also, it adds all **surplus data** to the **first training partition**, which is always used to train the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sklearn Pipeline \n", "Whenever possible, using Sklearn Pipeline object is always a smart practice because they \n", "- are powerfull tools to standardise our operations,\n", "- create an easy-to-understand workflow with clear order of steps,\n", "- are reproducable\n", "\n", "We will create a pipeline with `StandartScaler` and `LinearRegression` objects" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Instantiate the pipeline with the StandardScaler and LinearRegression\n", "pipeline=Pipeline(steps= [(\"scaler\", StandardScaler()),\n", " (\"linreg\", LinearRegression())])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross validation scores\n", "\n", "Let's use the `cross_val_score` from `sklearn.metrics` to get the scores of each split " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R^2 scores of each split: [0.27492746 0.32817026 0.19586317 0.37070607 0.41382806]\n" ] } ], "source": [ "# Split the timeseries data\n", "split = TimeSeriesSplit(n_splits=5)\n", "\n", "# Fit and score the model with cross-validation\n", "scores = cross_val_score(pipeline, X, y, cv=split)\n", "\n", "print(\"R^2 scores of each split:\", scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing CV Splits for Timeseries: \"TimeSeriesSplit\"\n", "\n", "Visualize the splits of cross validation with `TimeSeriesSplit` object" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Initialize the cross-validation iterator with 10 splits\n", "cv = TimeSeriesSplit(n_splits=10)\n", "\n", "fig, ax = plt.subplots(figsize=(10, 5)) \n", "\n", "# Loop over the cross validation splits\n", "# cv.split() method creates train and test arrays for each split\n", "for idx, (train, test) in enumerate(cv.split(X, y)):\n", "\n", " # Plot training and test indices\n", " indeces1 = ax.scatter(train, [idx] * len(train), c=[plt.cm.coolwarm(.1)], marker='_', lw=8)\n", " indeces2 = ax.scatter(test, [idx] * len(test), c=[plt.cm.coolwarm(.9)], marker='_', lw=8)\n", " \n", " ax.set(ylim=[10, -1], title='TimeSeriesSplit behavior', xlabel='Data index', ylabel='CV iterates')\n", " ax.legend([indeces1, indeces2], ['Training', 'Validation'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scoring Metric: \"RMSLE\"\n", "\n", "Lets try to understand the metric which we will use for the Kaggle evaluation. Here is the screen shot from the Kaggle evaluation page.\n", "\n", "![scoring](/images/scoring.png)\n", "\n", "So we need to take into account the metric `Root Mean Squared Log Error(RMSLE)` for this project. For the interpretation of RMSLE take a look to [this page](https://hrngok.github.io/posts/metrics/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## RSMLE Calculator Function" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Define the RMSLE function for error calculation: rmsle_calculator\n", "# Using the vectorized numpy functions instead of loops always better for computation\n", "def rmsle_calculator(predicted, actual):\n", " assert len(predicted) == len(actual)\n", " return np.sqrt(\n", " np.mean(\n", " np.power(np.log1p(predicted)-np.log1p(actual), 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom Scoring Function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Having defined our function for the RMSLE calculation, now we should define a scoring function in order to use as a scoring parameter for `model_selection.cross_val_score`\n", "\n", "- We need this parameter for model-evaluation tools which rely on a scoring strategy when using cross-validation internally (such as `model_selection.cross_val_score` and `model_selection.GridSearchCV`) \n", " \n", " \n", " We will use `make_scorer` function of Sklearn to generate a callable object (from our `rsmle_calculator` function) for scoring.
\n", "When defining a custom scorer via `sklearn.metrics.make_scorer`, the convention is that \n", "- custom functions ending in `_score` return a value to maximize and\n", "- for scorers ending in `_loss` or `_error`, a value is returned to be minimized. \n", "- We can use this functionality by setting the `greater_is_better` parameter inside `make_scorer`. \n", "- This parameter would be \n", " - `True` for scorers where higher values are better, and\n", " - `False` for scorers where lower values are better. (this will be our choise since the lower RSMLE values are better)\n", "\n", "**NOTE:**\n", "- If a loss, the output of the python function, is negated by the scorer object, conforming to the cross validation convention; that scorers return higher values for better models i.e\n", "- when `greater_is_better` is `False`, the scorer object will sign-flip the outcome of the `score_func`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Make a custom scorer \n", "# rmsle_error will negate the return value of rmsle_calculator,\n", "rmsle_error = make_scorer(rmsle_calculator, greater_is_better=False)\n", "\n", "# Fit and score the model with cross-validation\n", "scores = cross_val_score(pipeline, X, y, cv=split, scoring=rmsle_error)\n", "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As explained the scores are negative due to the Sklearn customs. We can multiply the results by `-1` for our interpretation" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The cross validation scores: [1.06115722 0.97511162 1.02056734 0.93509628 0.92447915]\n" ] } ], "source": [ "print(\"The cross validation scores:\", scores*-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hold-out Data Prediction Scores" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R^2 score of hold-out: 0.3889173616885714\n", "Root mean squared error of hold-out: 120.48857034926871\n", "RMSLE value for linear regression of hold-out: 0.715555147546642\n" ] } ], "source": [ "# Hold-out target data \n", "y_hold=hold_out_df[\"count\"]\n", "\n", "# Hold-out features data\n", "X_hold=hold_out_df.drop([\"datetime\", \"count\"], axis=1)\n", "\n", "# Fit the pipeline to train data\n", "pipeline.fit(X, y)\n", "\n", "# Generate predictions for hold-out data\n", "predictions_hold = pipeline.predict(X_hold)\n", "\n", "# R^2 score\n", "score = r2_score(y_hold, predictions_hold)\n", "\n", "print(\"R^2 score of hold-out:\", score)\n", "print(\"Root mean squared error of hold-out:\", np.sqrt(mean_squared_error(y_hold, predictions_hold)))\n", "print (\"RMSLE value for linear regression of hold-out: \", rmsle_calculator(y_hold , abs(predictions_hold)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kaggle submission\n", "Lets predict the given Kaggle test set and submit the predictions to Kaggle to get our score.
\n", "First it would be better if we combine our `X` and `X_hold` and `y` and `y_hold` datasets then train our model with more data in order to let our model learn better before predicting on a new test data i.e Kaggle test dataset which we have not yet uploaded." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linreg', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Combine X and X hold: combined_train\n", "combined_X=pd.concat([X, X_hold])\n", "\n", "# Combine the y and y hold: combined_test\n", "combined_y =pd.concat([y, y_hold])\n", "\n", "# Fit the model to the combined datasets\n", "pipeline.fit(combined_X, combined_y)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasonholidayworkingdayweathertempatemphumiditywindspeed
datetime
2011-01-20 00:00:00101110.6611.3655626.0027
2011-01-20 01:00:00101110.6613.635560.0000
2011-01-20 02:00:00101110.6613.635560.0000
\n", "
" ], "text/plain": [ " season holiday workingday weather temp atemp \\\n", "datetime \n", "2011-01-20 00:00:00 1 0 1 1 10.66 11.365 \n", "2011-01-20 01:00:00 1 0 1 1 10.66 13.635 \n", "2011-01-20 02:00:00 1 0 1 1 10.66 13.635 \n", "\n", " humidity windspeed \n", "datetime \n", "2011-01-20 00:00:00 56 26.0027 \n", "2011-01-20 01:00:00 56 0.0000 \n", "2011-01-20 02:00:00 56 0.0000 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the Kaggle test data\n", "kaggle_test=pd.read_csv(\"bike_kaggle_test.csv\", parse_dates=[\"datetime\"], index_col=\"datetime\")\n", "kaggle_test.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets also apply all the steps that we followed for train data to the test data in order to conform the train and test datasets" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
holidayworkingdaytemphumiditywindspeedhour_sinhour_cosmonth_sinmonth_cosweekday_cosweekday_sin
datetime
2011-01-20 00:00:000110.665626.00270.0000001.0000000.01.0-0.9009690.433884
2011-01-20 01:00:000110.66560.00000.2588190.9659260.01.0-0.9009690.433884
2011-01-20 02:00:000110.66560.00000.5000000.8660250.01.0-0.9009690.433884
\n", "
" ], "text/plain": [ " holiday workingday temp humidity windspeed \\\n", "datetime \n", "2011-01-20 00:00:00 0 1 10.66 56 26.0027 \n", "2011-01-20 01:00:00 0 1 10.66 56 0.0000 \n", "2011-01-20 02:00:00 0 1 10.66 56 0.0000 \n", "\n", " hour_sin hour_cos month_sin month_cos weekday_cos \\\n", "datetime \n", "2011-01-20 00:00:00 0.000000 1.000000 0.0 1.0 -0.900969 \n", "2011-01-20 01:00:00 0.258819 0.965926 0.0 1.0 -0.900969 \n", "2011-01-20 02:00:00 0.500000 0.866025 0.0 1.0 -0.900969 \n", "\n", " weekday_sin \n", "datetime \n", "2011-01-20 00:00:00 0.433884 \n", "2011-01-20 01:00:00 0.433884 \n", "2011-01-20 02:00:00 0.433884 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kaggle_test[\"month\"]=kaggle_test.index.month\n", "kaggle_test[\"weekday\"]=kaggle_test.index.dayofweek\n", "kaggle_test[\"hour\"]=kaggle_test.index.hour\n", "\n", "# Create the sin and cos components of hour, month and weekday columns\n", "kaggle_test['hour_sin'] = np.sin(kaggle_test.hour*(2.*np.pi/24))\n", "kaggle_test['hour_cos'] = np.cos(kaggle_test.hour*(2.*np.pi/24))\n", "kaggle_test['month_sin'] = np.sin((kaggle_test.month-1)*(2.*np.pi/12))\n", "kaggle_test['month_cos'] = np.cos((kaggle_test.month-1)*(2.*np.pi/12))\n", "kaggle_test[\"weekday_cos\"]= np.cos((kaggle_test.weekday)*(2.*np.pi/7))\n", "kaggle_test['weekday_sin'] = np.sin((kaggle_test.weekday)*(2.*np.pi/7))\n", "\n", "# Drop the wrong represantation of time: month, weekday, hour\n", "# Drop the correlated features: atemp, season\n", "# Drop the weather column (not relaible data records)\n", "X_kaggle_test= kaggle_test.drop([\"season\", \"month\", \"weekday\", \"hour\", \"atemp\", \"weather\"], axis=1)\n", "X_kaggle_test.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our Kaggle test dataset is in the same structure with our `combined_X` dataset. Time to predict and submit!" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "final_predictions=pipeline.predict(X_kaggle_test)" ] }, { "attachments": { "resim.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "Having predicted the test targets now we need to create a dataframe complying with Kaggle submission format like shown in the screenshot\n", "![resim.png](attachment:resim.png)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
count
datetime
2011-01-20 00:00:0029.463588
2011-01-20 01:00:0032.955071
2011-01-20 02:00:0022.045803
\n", "
" ], "text/plain": [ " count\n", "datetime \n", "2011-01-20 00:00:00 29.463588\n", "2011-01-20 01:00:00 32.955071\n", "2011-01-20 02:00:00 22.045803" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kaggle_sub=pd.DataFrame({\"datetime\":kaggle_test.index, \"count\":final_predictions}).set_index(\"datetime\")\n", "kaggle_sub[\"count\"]=kaggle_sub[\"count\"].abs()\n", "kaggle_sub.head(3)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Save the submission dataframe\n", "kaggle_sub.to_csv(\"kaggle_sub.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kaggle Submission Score\n", "\n", "When we fit our model to the Kaggle test set and submited we got the score below. This competition is closed so there is no ranking anymore but to get a bit of idea about our model's performance the mean ranking and the two scores in the ranking that our score fall in between were below. \n", "
\n", "\n", "For the start of our project this is a pretty good score. \n", "- We have just tried Linear Regression so far and also \n", "- we did not used new features except from the time features. \n", "- Later we can work more on this and i think our ranking can get better" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our score:\n", "![kaggle1](/images/kaggle1.jpg)\n", "\n", "Mean value benchmark:\n", "\n", "![kaggle2](/images/kaggle2.jpg)\n", "\n", "The ranking that our score falls in\n", "\n", "![kaggle3](/images/kaggle3.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrap Up\n", "\n", "This was the second part of our analysis on bikeshare dataset
\n", "\n", "In this notebook we\n", "\n", "- continued to prepare our dataset for machine learning\n", "- used the trigonometric function for time data transformation in order to better represent the cyclic nature of time features\n", "- splitted our data set and keep an hould-out test set for final evaluation\n", "- used `TimeSeriesSplit` iterators for the cross validation of timeseries data. This was important in order to only using past data to evaluate our model on the “future” observations \n", "- used Sklearn Pipeline for better workflow\n", "- understood the mechanism and the functionality of the RSMLE metric\n", "- created a custom scoring function which uses RSMLE for the sklearn model evaluation tools \n", "- prepared and predicted the Kaggle testset and made a submission\n", "\n", "In the next post we will continue our analysis of bikeshare data with tree-based models and focus on model performance especially with visualization tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sources:
\n", "https://scikit-learn.org/stable/modules/model_evaluation.html#defining-your-scoring-strategy-from-metric-functions
\n", "http://blog.davidkaleko.com/feature-engineering-cyclical-features.html
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "nikola": { "category": "", "date": "2018-11-16 22:08:04 UTC+02:00", "description": "", "link": "", "slug": "bikeshare part2", "tags": "timeseries data ,Kaggle, TimeSeriesSplit, cyclic time features", "title": "Cyclic Nature of Time- Kaggle Submission", "type": "text" } }, "nbformat": 4, "nbformat_minor": 2 }