In this notebook we will be working on the Kaggle project: SF Bay Area Bike Share

Project In a nutshell¶

We are trying to predict the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour.
We have 3 datasets: station data, trip data, weather data

Table of contents¶

Feature preprocessing

Monthly, daily, hourly trips charts

Create hourly trip dataframe: hourly_trips

Modeling approach

Model with 140 stations

Model with 70 stations

In [1]:

# Notebook setup
import pandas as pd
import numpy as np
import glob
from pandas.tseries.holiday import USFederalHolidayCalendar
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split


from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib

from rfpimp import *
from pprint import pprint

import folium

sns.set()

%matplotlib inline
# Set the option to display the max number of columns and rows 
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 20000)

'Pen & Paper' thinking¶

What can be the drivers of the bike rentals in each station?¶

Bike rentals in each station can flactuate by two kind of reasons:

Station specific factors which are more about the location and the environment of each station, like centrality, height difference etc
Dynamic factors affecting all the stations like weather and time.

Here are some main factors:

Weather:
Since bike users are exposed directly to the weather conditions during the ride it is expected to be one of the main parameters.

Snow
Rain
Ice etc

Time:
Since we are working with data created by humans, it is intuitive to expect different characteristics in different time periods

Month
Day of the week
Holiday days
Weekdays and weekends
Hour of the day

Mobility in the city:
Since bikes are an option of transportation we need to take in to account the nature of the mobility (distribution of the mobility) in the city

Centrality
Population density of the area
Residential or office(job) area
Closeness to "hot" spots like parks, universities, culture and convention centers other easy public transportation availibilities
Critical events' times (concert, a sport activity etc)

Comparision with other transportation alternatives:

Cost of other options
Price of gas
Time spend in the traffic jam

Infrastructure:

Safe bike lanes
connections

Data exploration¶

Loading datasets & summaries¶

Station data¶

In [2]:

# Read the `station_data` 
station_df = pd.read_csv("station_data.csv")
station_df.head(2)

Out[2]:

	Id	Name	Lat	Long	Dock Count	City
0	2	San Jose Diridon Caltrain Station	37.329732	-121.901782	27	San Jose
1	3	San Jose Civic Center	37.330698	-121.888979	15	San Jose

In [3]:

# Summary of station_data
station_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 6 columns):
Id            76 non-null int64
Name          76 non-null object
Lat           76 non-null float64
Long          76 non-null float64
Dock Count    76 non-null int64
City          76 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 3.6+ KB

Station locations on the map¶

To have an image of the stations let's see where the stations are on the map

Trip data¶

In [4]:

# Read the 'trip_data': trip_df
trip_df = pd.read_csv("trip_data.csv", 
                    parse_dates=['Start Date', 'End Date'], 
                    infer_datetime_format=True)

trip_df.head(2)

Out[4]:

	Trip ID	Start Date	Start Station	End Date	End Station	Subscriber Type
0	913460	2015-08-31 23:26:00	50	2015-08-31 23:39:00	70	Subscriber
1	913459	2015-08-31 23:11:00	31	2015-08-31 23:28:00	27	Subscriber

In [5]:

# Summary of trip_df
trip_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354152 entries, 0 to 354151
Data columns (total 6 columns):
Trip ID            354152 non-null int64
Start Date         354152 non-null datetime64[ns]
Start Station      354152 non-null int64
End Date           354152 non-null datetime64[ns]
End Station        354152 non-null int64
Subscriber Type    354152 non-null object
dtypes: datetime64[ns](2), int64(3), object(1)
memory usage: 16.2+ MB

There is an object type column: Subscriber Type. We need to encode this column's values to categories

Weather data¶

The given weather dataset provides weather measurements with daily precision however we will make our analysis with samples that are in an hour range.
Even though the given dataset provides zip code specific weather data, during a day-time period the measurement differences can be significant. So it would be better to have an hourly weather dataset.
Therefore instead of using the given dataset, we will use the weather data taken from Kaggle Datasets (Historical Hourly Weather Data 2012-2017)
Here, hourly weather measurements data of various weather attributes, such as temperature, humidity, air pressure, etc. are provided for many cities, including San Francisco,
Additionally, for each city we also have the country, latitude and longitude information in a separate file.

Kaggle hourly weather dataset¶

Since datasets are given by a common datetime column, we can read all the datasets and exctract the "San Francisco" columns in order to create a weather dataset related to our are of interest

In [6]:

# Pattern of weather attributes datasets: pattern
pattern = 'kaggle_data\*.csv'

# Save all matching files with glob function: weather_files
weather_files = glob.glob(pattern)

# Aggregate all the datasets in a list
# by subsetting the 'datetime' and 'San Francisco' column of each dataset
weather_df_lst = [pd.read_csv(file, usecols=["datetime", "San Francisco"]) for file in weather_files]

# Concat all the dataframes in the weather_df_lst
weather_df = pd.concat(weather_df_lst, axis=1)
print(weather_df.head(2))

# Set the first 'datetime' column as index and
# Drop the other 'datetime' columns
weather_df = weather_df.set_index(weather_df.iloc[:, 0]).drop("datetime", axis=1)

# Convert the index to datetime
weather_df.index=pd.to_datetime(weather_df.index)

# Set the column names 
column_names = ['Humidity', 'Pressure', 'Temperature', 'Weather Description', 'Wind Direction', 'Wind Speed']
weather_df.columns = column_names
weather_df

# Subset the date interval [2014-09-01: 2015-08-31] (interval of bike trip dates)
weather_df = weather_df["2014-09-01": "2015-08-31"]
weather_df.head(3)

              datetime  San Francisco             datetime  San Francisco  \
0  2012-10-01 12:00:00            NaN  2012-10-01 12:00:00            NaN   
1  2012-10-01 13:00:00           88.0  2012-10-01 13:00:00         1009.0   

              datetime  San Francisco             datetime  San Francisco  \
0  2012-10-01 12:00:00            NaN  2012-10-01 12:00:00            NaN   
1  2012-10-01 13:00:00         289.48  2012-10-01 13:00:00     light rain   

              datetime  San Francisco             datetime  San Francisco  
0  2012-10-01 12:00:00            NaN  2012-10-01 12:00:00            NaN  
1  2012-10-01 13:00:00          150.0  2012-10-01 13:00:00            2.0

Out[6]:

	Humidity	Pressure	Temperature	Weather Description	Wind Direction	Wind Speed
datetime
2014-09-01 00:00:00	72.0	1024.0	293.995500	sky is clear	237.0	2.0
2014-09-01 01:00:00	69.0	1024.0	294.414333	sky is clear	241.0	2.0
2014-09-01 02:00:00	72.0	1024.0	293.579667	sky is clear	239.0	2.0

In [7]:

weather_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2014-09-01 00:00:00 to 2015-08-31 23:00:00
Data columns (total 6 columns):
Humidity               8760 non-null float64
Pressure               8760 non-null float64
Temperature            8760 non-null float64
Weather Description    8760 non-null object
Wind Direction         8759 non-null float64
Wind Speed             8760 non-null float64
dtypes: float64(5), object(1)
memory usage: 479.1+ KB

There is a missing value in Wind Direction column.
Weather Description column is categorical. We need to convert the categories into dummy variables

In [8]:

# Summary statistics of station_df
weather_df.describe().T

Out[8]:

	count	mean	std	min	25%	50%	75%	max
Humidity	8760.0	86.355594	12.037149	26.000000	78.000	88.000000	97.000000	100.00
Pressure	8760.0	1026.396119	8.010935	985.000000	1023.000	1028.000000	1032.000000	1044.00
Temperature	8760.0	288.063666	4.538506	276.145333	284.987	287.466917	290.150625	309.63
Wind Direction	8759.0	208.150474	86.875425	0.000000	170.000	232.000000	270.000000	360.00
Wind Speed	8760.0	2.476370	1.904894	0.000000	1.000	2.000000	3.000000	13.00

Missing values¶

In [9]:

# Look at the missing values in 3 datasets
print(station_df.isna().values.any())
print(trip_df.isna().values.any())
print(weather_df.isna().values.any())

False
False
True

In [10]:

# Impute the single missing value in weather data
weather_df["Wind Direction"].fillna(method='ffill', inplace=True)

Now we don't have expected missing values (the ones pandas can detect)

Categorical features¶

Let's encode the categorical columns Weather Description and Subscriber Type in weather_df and trip_df respectively with dummy variables.
We will do the encoding for the categorical variables of the trip_df after doing some plotting

In [11]:

# Create dummy variables from 'Weather Description' column categories
dummies= pd.get_dummies(weather_df["Weather Description"], drop_first=True)
# Add the dummy variables to the weather_df and
# Drop the original features from the weather_df
weather_df= pd.concat([weather_df, dummies], axis=1).drop("Weather Description", axis=1)
weather_df.head(2)

Out[11]:

	Humidity	Pressure	Temperature	Wind Direction	Wind Speed	drizzle	few clouds	fog	haze	heavy intensity rain	light intensity drizzle	light intensity shower rain	light rain	mist	moderate rain	overcast clouds	proximity shower rain	proximity thunderstorm	proximity thunderstorm with rain	scattered clouds	shower rain	sky is clear	smoke	thunderstorm	thunderstorm with heavy rain	thunderstorm with light rain	thunderstorm with rain	very heavy rain
datetime
2014-09-01 00:00:00	72.0	1024.0	293.995500	237.0	2.0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
2014-09-01 01:00:00	69.0	1024.0	294.414333	241.0	2.0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

Stations departures count¶

Let's count the departures from each stations

In [12]:

fig, ax=plt.subplots(figsize=(16, 5))
sns.countplot(trip_df["Start Station"], ax=ax);

We see that stations traffic is not balanced

Moved stations on the map¶

Let's see the moved stations on the map.
Please zoom in and hoover over the markers to see the position and the number of each moved stations.
The blue circles represent the new locations.

In [13]:

## Moved stations
moved_stations = [23, 24, 49, 69, 72]
new_stations1 = [85, 86, 87, 88, 89]
new_stations2 = [90]

# Create station list
station_lst = station_df["Id"].tolist()
station_moved = station_df[station_df["Id"].isin(moved_stations)]
station_new1 = station_df[station_df["Id"].isin(new_stations1)]
station_new2 = station_df[station_df["Id"].isin(new_stations2)]

# Create the list of coordinates
coord_list1 = list(zip(station_moved["Lat"], station_moved["Long"]))
coord_list2 = list(zip(station_new1["Lat"], station_new1["Long"]))
coord_list3 = list(zip(station_new2["Lat"], station_new2["Long"]))

# Zip the coordinates and the stations ids
moved_coords1 = list(zip(coord_list1, moved_stations))
moved_coords2 = list(zip(coord_list2, new_stations1))
moved_coords3 = list(zip(coord_list3, new_stations2))

## See the moved stations
stations_map = folium.Map(location=[37.56236, -122.150876], 
                          tiles='cartodbpositron',
                          zoom_start=10) 

# Add the pointers to the map by iterating over the coordinates
for point, station in moved_coords1: #
            marker = folium.Marker(location=point, popup='<i>Mt. Hood Meadows</i>', tooltip=station)
            marker.add_to(stations_map)    
            
for point, station in moved_coords2:
    marker2= folium.CircleMarker(location=point, popup='<i>Mt. Hood Meadows</i>', tooltip=station, radius=7)
    marker2.add_to(stations_map)
    
for point, station in moved_coords3:
    marker3= folium.CircleMarker(location=point, popup='<i>Mt. Hood Meadows</i>', tooltip=station, radius=7)
    marker3.add_to(stations_map)

stations_map

Out[13]:

Combine the stations¶

Generally stations moved to close points. Even though station 24 moved quite far away as station 86 for the we will combine all the stations

In [14]:

moved_stations=[23, 24, 49, 69, 72]
new_stations1=[85, 86, 87, 88, 89]
new_stations2=[90]

# Zip the moved stations and the new ones
replace_zip= list(zip(moved_stations, new_stations1))

# Replace the moved station values in 'Start Station' column with the new ones
for s1, s2 in replace_zip:
    trip_df.loc[trip_df["Start Station"]==s1, "Start Station"]=s2

# Replace the station 89 in 'Start Station' column with 90    
trip_df.loc[trip_df["Start Station"]==89, "Start Station"]=90

# Replace the moved station values in 'End Station' column with the new ones
for s1, s2 in replace_zip:
    trip_df.loc[trip_df["End Station"]==s1, "End Station"]=s2

# Replace the station 89 in 'End Station' column with 90         
trip_df.loc[trip_df["End Station"]==89, "End Station"]=90

Delete the moved stations from `station_df`¶

Since we converted the moved stations to new stations we should filter out the old stations from station_df.

In [15]:

# Drop the old stations
station_df= station_df[~station_df['Id'].isin(moved_stations + [89])]

Duplicates¶

Let's check the duplicates in 3 datasets

In [16]:

# Check the duplicates in 'station_df' values
print(station_df["Id"].nunique()==len(station_df["Id"]))

# Check the duplicates in 'Trip ID' values in trip_df
print(trip_df["Trip ID"].nunique()==len(trip_df["Trip ID"]))

# Check the duplicates in 'Date' values in weather_df
print(weather_df.index.nunique()==len(weather_df))

True
True
True

Feature preprocessing¶

Add time features¶

Since the bike usage is very related with the breakdowns of the time we will add them as seperate features.
Here we need to be aware of the cyclic nature of our time data and the non-linearity dependence between the bike rentals and the hours of the day.

Examples:

Regarding the seasonal effects we can expect that number of bike rentals in December is more similar to rentals in January, than rentals in May. However we represent December with the number 12, January with 1 and May with 5. Contrary to the reality, in the numeric represantation January is closer to May.

Similarly we can expect the number of rentals in the 23th hour is closer the number in the 0th hour in the night than the number in the 8th hour in the morning. Even though there is only 1 hour difference between 0h-23h with numeric(0 and 23) represantation the difference is the largest
This is not a good represantation for the linear models as the effect of time becomes monotonic i.e either the target will increase or decrease with time.
The models that take the distance into account will be misinformed.
For decision trees, time values close to each other will be grouped together. So 23h will be in different group than 0h.
We should create 24 different columns for hours and 12 different columns for month with binary (0,1) values
Here are the new features:
- Month(0-11)
- Day (day of the week)
- Hour(0-23)
- Holiday (1 or 0)
We can add this features to trip_df

In [17]:

# Use the 'Start Date' column to create the time features
trip_df["Month"]= trip_df["Start Date"].dt.month
trip_df["Day"]= trip_df["Start Date"].dt.dayofweek
trip_df["Hour Start"]= trip_df["Start Date"].dt.hour

Add 'Duration' column¶

In [18]:

# Create the trip duration column
trip_df["Duration"]= trip_df["End Date"]- trip_df["Start Date"]
# Convert the Duration into minutes
trip_df['Duration']=trip_df['Duration']/np.timedelta64(1,'m')

Outliers in `Duration`¶

We will take out the trips longer than two hours.

In [19]:

# Plot the boxplot
ax=sns.boxplot(x="Duration", data=trip_df, width=0.2, orient="h")

# Add the violinplot on the same figure
sns.violinplot(x="Duration", data=trip_df, bw=.2, orient="h")

# Add the label and title
ax.set(xlabel='Duration', title="Violin & Box Plot of Duration");

In [20]:

# Find the inter quantile of the duration
q1=trip_df['Duration'].quantile(0.25)
q3= trip_df['Duration'].quantile(0.75)
iqr = q3 - q1

print("Lower bound of outliers:", q1 - 1.5 * iqr)
print("Upper bound of outliers:", q3 + 1.5 * iqr, "\n")

# Only take the rows with duration is less then 120 min: trips
trips = trip_df[trip_df["Duration"] < 120]

# Print the percetage of removed data
print("The percentage of data removed:",\
      np.around((trip_df["Duration"] > 120).sum()/len(trip_df["Duration"])*100, decimals=2))

Lower bound of outliers: -3.0
Upper bound of outliers: 21.0 

The percentage of data removed: 1.53

In [21]:

# Create a figure and two axes for boxplot and distplot
fig, (ax_box, ax_dist) = plt.subplots(nrows=2, 
                                      sharex=True, 
                                      figsize=(10, 6), 
                                      gridspec_kw={"height_ratios": (.15, .85)})
 
# Plot the boxplot on the ax_box axis
sns.boxplot(trips["Duration"], ax=ax_box)
# Plot the distplot on the ax_dist axis
sns.distplot(trips["Duration"], ax=ax_dist)
 # Remove xlabel for the boxplot
ax_box.set(xlabel='');

Monthly, daily, hourly trips charts¶

Let's see the global (non station based) trip patterns by the breakdown of time. It is intiutive to expect to see meaningful relations on the charts.

Here are our steps:

Observe the data by splitting it into "weekdays" and "weekends".
We will also start by the longer time frames and follow by the shorter time frames

In [22]:

# Create the 'weekdays' and 'weekends' dataframes
weekdays=trips[trips["Start Date"].dt.weekday <5]
weekends=trips[trips["Start Date"].dt.weekday >=5]

A plot function for groups: `group_plot`¶

Let's define a plot function for plotting the dataframes groupped by different columns and different agregation functions

In [23]:

def group_plot(df, lst, agg_func, labelx, labely, title):
    '''
    Takes a dataframe, a list of columns to groupby, an aggregation function for the groups, 
    3 strings: label of x-axis, label of y-axis and title returns a line plot
    Plots the data with labels and title.
    '''
    plt.style.use('fivethirtyeight')
    group= getattr(df.groupby(lst), agg_func)().unstack(lst[1])/1000
    ax= group.plot(figsize=(11,4), rot=70, fontsize=11)        
    ax.set_xlabel(labelx)
    ax.set_ylabel(labely)
    ax.set_title(title)
    plt.show()

Monthly trips by subscription type in weekdays¶

In [24]:

labelx= "Month"
labely="Number of trips"
title= "Monthly weekdays trips by subscriber type (thousands)"
group_plot(weekdays,["Month", "Subscriber Type"], 'size', labelx, labely, title)

Number of subscriber type users are dominating during weekdays. Also number of subscribers' usage vary monthly. Especially after October till December decreasing steadily.

Monthly trips by subscription type in weekends¶

In [25]:

title= "Monthly weekend trips by subscriber type (thousands)"
group_plot(weekends,["Month", "Subscriber Type"], 'size', labelx, labely, title)

From the chart we can see that total daily usage is less than weekdays. Also the relative usage between subscribers and the casual users are different. Number of casual users increased in weekends compare to weekdays. It might happen due to the visitors.

Daily trips by subscription type in weekdays¶

In [26]:

labelx= "Day"
title= "Daily trips by subscriber type (thousands)"
group_plot(trips,["Day", "Subscriber Type"], 'size', labelx, labely, title)

Bike usage decrease on fridays compare to other weekdays.

Hourly trips by subscription type in weekdays¶

In [27]:

labelx= "Hour"
title= "Hourly weekdays trips by subscriber type (thousands)"
group_plot(weekdays,["Hour Start", "Subscriber Type"], 'size', labelx, labely, title)

Firstly, bike usage between 0h-4h is close to zero.
Subscriber usage is increasing after 5h and making first peak around 8h which is high probably due to commuting to work or schools.
The second peak is around 17h which is high probably due to commuting back from work or schools
It is intiutive that people who have a regular schedule prefer to subscribe to the system
However we will not use the subscriber type information in our model because when we want to predict future data of bike stock net change in the next hour at that moment we will not have the subscriber type data for the coming hour(s).

Hourly trips by subscription type in weekends¶

In [28]:

title= "Hourly weekends trips by subscriber type (thousands)"
group_plot(weekends,["Hour Start", "Subscriber Type"], 'size', labelx, labely, title)

Hourly trips by the day of the week¶

Let's create a dataframe with the columns showing the global count of bike trips started in each hour in order to use for the Seaborn's poinplot function.

In [29]:

# Group the trips based on 'Hour Start' and 'Day'
# Aggregate by size of the groups
hourly=trips.groupby(["Hour Start", "Day"]).size().reset_index("Day").reset_index()
# Change the name of the columns
hourly.columns=["Hour Start", "Day", "Total"]
hourly.head(2)

Out[29]:

	Hour Start	Day	Total
0	0	0	87
1	0	1	117

In [30]:

# Plot the hourly rental means in the weekdays and weekends on poinplot
fig, ax=plt.subplots(figsize=(10, 5))

sns.pointplot(x="Hour Start", 
              y="Total", 
              hue="Day", 
              data=hourly, 
              errwidth=0,
              scale=.5, 
              ax=ax);

hue_labels = ["Sun", "Mon","Tue","Wed","Thu","Fri", "Sat"]
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, hue_labels);

ax.set_title("Hourly trips across the days");

We can see the difference in the bike usage hours patterns of the weekdays and weekends

Create hourly trip dataframe: `hourly_trips`¶

After getting some insights through the charts now we can start creating our hourly features dataframe.

We have two datasets to create our features: hourly weather data set weather_df and trips dataset trips_df
Weather data is already hourly sampled so we will also resample the trips data hourly by counting the number of departures from a station and the number of arrivals to a station.
We will later merge the hourly weather data and hourly trips data.

To create hourly_trips dataframe, instead of indidividual informations of bike trips like start station, end station etc we will only use hourly counts of bike trips and the time features like the month of the year, day of the week, hour of the day etc
We start by resampling the trip_df by hour and taking the size of each sample(number of bike trips in each hour)

In [31]:

# Resample trips_df by hour and get the size of each hour samples
hourly_total=trip_df.set_index("Start Date").resample("H").size()
# Create hourly_trips dataframe 
hourly_trips=hourly_total.to_frame(name="Total")
# Change the index name
hourly_trips.index.rename('Date', inplace=True)

# Lets add again time variables 
# Use the index to create the time features
hourly_trips["Month"]= hourly_trips.index.month
hourly_trips["Day"]= hourly_trips.index.dayofweek
hourly_trips["Hour"]= hourly_trips.index.hour

# Drop the 'Total' column
hourly_trips.drop("Total", axis=1, inplace=True)

Encode the `Month` column¶

Represent the months with binary encoding in seperate columns like m0, m1,..m12

In [32]:

# Add month dummy variables: m0, m1,..m12
dummy_months = pd.get_dummies(hourly_trips["Month"], drop_first=True)

# Add the dummy variables to the hourly_trips and
# Drop the original features from the hourly_trips
hourly_trips = pd.concat([hourly_trips, dummy_months], axis=1).drop("Month", axis=1)
# Rename the month columns
hourly_trips.columns = ["m"+ str(col) if type(col)==int else col for col in hourly_trips.columns]
hourly_trips.head(2)

Out[32]:

	Day	Hour	m2	m3	m4	m5	m6	m7	m8	m9	m10	m11	m12
Date
2014-09-01 00:00:00	0	0	0	0	0	0	0	0	0	1	0	0	0
2014-09-01 01:00:00	0	1	0	0	0	0	0	0	0	1	0	0	0

Flags for day groups¶

Based on the daily number of trips, we can also group the data into 3 as

"WD1": weekdays 1, group of (Monday, Tuesday, Wednesday)
"WD2": weekdays 2, group of (Thursday, Friday)
"WKD": weekends

In [33]:

# Boolean dictionary to map the 'True' values into 1 and "False" values into 0
bool_dct = {True:1, False:0}
# Only two flags would be enough to represent 3 partitions of the days
# hourly_trips["WD1"]=hourly_trips["Day"].isin([0,1,2]).map(bool_dct)
hourly_trips["WD2"] = hourly_trips["Day"].isin([3,4]).map(bool_dct)
hourly_trips["WKD"] = hourly_trips["Day"].isin([5,6]).map(bool_dct)
hourly_trips.head(2)

Out[33]:

	Day	Hour	m2	m3	m4	m5	m6	m7	m8	m9	m10	m11	m12	WD2	WKD
Date
2014-09-01 00:00:00	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
2014-09-01 01:00:00	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0

Encode the `Day` column¶

Represent the days with binary encoding in seperate columns like d0, d1,..d6

In [34]:

## Add day dummy variables: d0, d1,..d6
dummy_days = pd.get_dummies(hourly_trips["Day"], drop_first=True)

# Add the dummy variables to the hourly_trips and
# Drop the original features from the hourly_trips
hourly_trips= pd.concat([hourly_trips, dummy_days], axis=1).drop("Day", axis=1)
# Rename the day columns
hourly_trips.columns= ["d"+ str(col) if type(col)==int else col for col in hourly_trips.columns]
hourly_trips.head(2)

Out[34]:

	Hour	m2	m3	m4	m5	m6	m7	m8	m9	m10	m11	m12	WD2	WKD	d1	d2	d3	d4	d5	d6
Date
2014-09-01 00:00:00	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
2014-09-01 01:00:00	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0

Encode the '`Hour`' column¶

Represent the hours with binary encoding in seperate columns like h0, h1,..h23

In [35]:

## Add hour dummy variables: h0, h1,..h23
dummy_hours= pd.get_dummies(hourly_trips["Hour"], drop_first=True)

# Add the dummy variables to the hourly_trips and
# Drop the original features from the hourly_trips
hourly_trips= pd.concat([hourly_trips, dummy_hours], axis=1).drop("Hour", axis=1)
# Rename the hour columns
hourly_trips.columns= ["h"+str(col) if type(col)==int else col for col in hourly_trips.columns]
hourly_trips.head(2)

Out[35]:

	m2	m3	m4	m5	m6	m7	m8	m9	m10	m11	m12	WD2	WKD	d1	d2	d3	d4	d5	d6	h1	h2	h3	h4	h5	h6	h7	h8	h9	h10	h11	h12	h13	h14	h15	h16	h17	h18	h19	h20	h21	h22	h23
Date
2014-09-01 00:00:00	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2014-09-01 01:00:00	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Add holiday column¶

We might also expect the users behave differently on holidays. So it can be a good idea to add an holiday indicator.

In [36]:

# Use pandas US calendar
calendar = USFederalHolidayCalendar()
holidays = calendar.holidays(start=hourly_trips.index.min(), end=hourly_trips.index.max())
# Convert the calendar index into array
holidays = pd.to_datetime(holidays, format='%Y/%m/%d').date
# Create 'Holiday' column with binary values
hourly_trips["Holiday"]=[1 if day==True else 0 for day in hourly_trips.index.isin(holidays)]
hourly_trips.head(2)

Out[36]:

	m2	m3	m4	m5	m6	m7	m8	m9	m10	m11	m12	WD2	WKD	d1	d2	d3	d4	d5	d6	h1	h2	h3	h4	h5	h6	h7	h8	h9	h10	h11	h12	h13	h14	h15	h16	h17	h18	h19	h20	h21	h22	h23	Holiday
Date
2014-09-01 00:00:00	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
2014-09-01 01:00:00	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Features dataset¶

As we mentioned before hourly trip data and the weather data will be our features data
Let's now merge the hourly_trips and weather_df to create the features dataset

In [37]:

features_df = pd.concat([hourly_trips, weather_df], axis=1).reset_index(drop=True)
features_df.head(2)

Out[37]:

	m2	m3	m4	m5	m6	m7	m8	m9	m10	m11	m12	WD2	WKD	d1	d2	d3	d4	d5	d6	h1	h2	h3	h4	h5	h6	h7	h8	h9	h10	h11	h12	h13	h14	h15	h16	h17	h18	h19	h20	h21	h22	h23	Holiday	Humidity	Pressure	Temperature	Wind Direction	Wind Speed	drizzle	few clouds	fog	haze	heavy intensity rain	light intensity drizzle	light intensity shower rain	light rain	mist	moderate rain	overcast clouds	proximity shower rain	proximity thunderstorm	proximity thunderstorm with rain	scattered clouds	shower rain	sky is clear	smoke	thunderstorm	thunderstorm with heavy rain	thunderstorm with light rain	thunderstorm with rain	very heavy rain
0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	72.0	1024.0	293.995500	237.0	2.0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	69.0	1024.0	294.414333	241.0	2.0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

Finally, we have our features dataset: features_df

In [38]:

features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 71 columns):
m2                                  8760 non-null uint8
m3                                  8760 non-null uint8
m4                                  8760 non-null uint8
m5                                  8760 non-null uint8
m6                                  8760 non-null uint8
m7                                  8760 non-null uint8
m8                                  8760 non-null uint8
m9                                  8760 non-null uint8
m10                                 8760 non-null uint8
m11                                 8760 non-null uint8
m12                                 8760 non-null uint8
WD2                                 8760 non-null int64
WKD                                 8760 non-null int64
d1                                  8760 non-null uint8
d2                                  8760 non-null uint8
d3                                  8760 non-null uint8
d4                                  8760 non-null uint8
d5                                  8760 non-null uint8
d6                                  8760 non-null uint8
h1                                  8760 non-null uint8
h2                                  8760 non-null uint8
h3                                  8760 non-null uint8
h4                                  8760 non-null uint8
h5                                  8760 non-null uint8
h6                                  8760 non-null uint8
h7                                  8760 non-null uint8
h8                                  8760 non-null uint8
h9                                  8760 non-null uint8
h10                                 8760 non-null uint8
h11                                 8760 non-null uint8
h12                                 8760 non-null uint8
h13                                 8760 non-null uint8
h14                                 8760 non-null uint8
h15                                 8760 non-null uint8
h16                                 8760 non-null uint8
h17                                 8760 non-null uint8
h18                                 8760 non-null uint8
h19                                 8760 non-null uint8
h20                                 8760 non-null uint8
h21                                 8760 non-null uint8
h22                                 8760 non-null uint8
h23                                 8760 non-null uint8
Holiday                             8760 non-null int64
Humidity                            8760 non-null float64
Pressure                            8760 non-null float64
Temperature                         8760 non-null float64
Wind Direction                      8760 non-null float64
Wind Speed                          8760 non-null float64
drizzle                             8760 non-null uint8
few clouds                          8760 non-null uint8
fog                                 8760 non-null uint8
haze                                8760 non-null uint8
heavy intensity rain                8760 non-null uint8
light intensity drizzle             8760 non-null uint8
light intensity shower rain         8760 non-null uint8
light rain                          8760 non-null uint8
mist                                8760 non-null uint8
moderate rain                       8760 non-null uint8
overcast clouds                     8760 non-null uint8
proximity shower rain               8760 non-null uint8
proximity thunderstorm              8760 non-null uint8
proximity thunderstorm with rain    8760 non-null uint8
scattered clouds                    8760 non-null uint8
shower rain                         8760 non-null uint8
sky is clear                        8760 non-null uint8
smoke                               8760 non-null uint8
thunderstorm                        8760 non-null uint8
thunderstorm with heavy rain        8760 non-null uint8
thunderstorm with light rain        8760 non-null uint8
thunderstorm with rain              8760 non-null uint8
very heavy rain                     8760 non-null uint8
dtypes: float64(5), int64(3), uint8(63)
memory usage: 1.1 MB

Modeling approach¶

We will try to predict the hourly net change in the bike stock in each station.
We created our features set with time and weather attributes.
Feature dataset has the shape [n_hours, n_time&weather_attributes]

Here the question is "how should we represent the targets dataset?"
We can try two options:
- 1) a matrix of hours and stations net changes (trips ends in a station - trips starts in a station)
- 2) a matrix of hours and arrivals and departures for each stations separetely

In the first option the shape will be [n_hours, n_stations], in the second option the shape will be [n_hours, 2*n_stations]
Since arrivals and departures are independent from each other we can represent them separetely.

At first sight each option has it is own advantages and disadvantages:
- First one is less complicated. In our case targets will have only 70 columns.
- Also this(net change) is the direct goal asked in the question.
- However taking the arrival and departure differences offsets the additional information related to each station.
- For instance, let's think of two cases: a combination of the stations in a quite hour with no traffic and a combination of stations with intense but balanced(arrivals equal to departures) traffic.
- In both cases targets will be the same. Here the difference is not discriminative.
- In separete version it reveals the information about entire traffic. It makes the targets space more sparse.
- On the other hand it is more complicated, it doubles the targets dimensions. In our case there will be 140 columns. Maybe it might affect severely the performance of the model.
- Since in this case we just care the net change using denser target space might be better.

Though we will try the both options to see the results
We start modeling by with the targets in which the arrivals and departures are separeted.

Targets dataset with 140 columns¶

Targets dataset will have the shape [n_hours, 2*n_stations] in our example [8760, 140]
For each hour row, there will be columns with the count of departures from each station and the count of arrivals to each station

Here are the steps to create the targets dataframe:

We will extract the station ids from station_df and
create station arrival and station departures columns by iterating over trips_df

Arrivals and departures hourly count¶

In [39]:

# Create a list containing station ids
station_lst=station_df["Id"].tolist()

# Create departure columns' names
# 'd' stands for departures
departure_lst=[str(station) + "d" for station in station_lst]
# Create arrival columns' names
# 'a' stands for arrivals
arrival_lst=[str(station)+ "a" for station in station_lst]

# Create arrival and departure dictionaries
arrival_dict=dict(zip(station_lst, arrival_lst))
departure_dict=dict(zip(station_lst, departure_lst))

# Create targets dataframe: station_matrix
station_matrix=pd.DataFrame(columns=arrival_lst + departure_lst)

# Concat the trips_df and station_matrix: trips_extended
trips_extended=pd.concat([trip_df, station_matrix], sort=False).reset_index()

# Iterate over the trips_extended df and 
# assign 1 to the arrival column of the corresponding the Start Station on the same the row
# if on a row the 'Start Station' is 50 then the 50_a column value of the same row will be 1
# The same procedure is applied for the departure columns also ie
# if on a row the 'End Station' is 70 then the '70_d' column value of the same row will be 1
for i in trips_extended.index:
    trips_extended.at[i, arrival_dict[trips_extended.at[i,"Start Station"]]]=1
    trips_extended.at[i, departure_dict[trips_extended.at[i,"End Station"]]]=1

# After assigning the 1 value for each station's relevant arrival and departure columns
# fill the other NaN values with 0
trips_extended= trips_extended.fillna(0)
trips_extended.head(2)

Out[39]:

	index	Trip ID	Start Date	Start Station	End Date	End Station	Subscriber Type	Month	Day	Hour Start	Duration	2a	3a	4a	5a	6a	7a	8a	9a	10a	11a	12a	13a	14a	16a	21a	22a	85a	25a	86a	26a	27a	28a	29a	30a	31a	32a	33a	34a	35a	36a	37a	38a	41a	42a	45a	46a	47a	48a	87a	50a	51a	39a	54a	55a	56a	57a	58a	59a	60a	61a	62a	63a	64a	65a	66a	67a	68a	88a	70a	71a	90a	73a	74a	75a	76a	77a	80a	82a	83a	84a	2d	3d	4d	5d	6d	7d	8d	9d	10d	11d	12d	13d	14d	16d	21d	22d	85d	25d	86d	26d	27d	28d	29d	30d	31d	32d	33d	34d	35d	36d	37d	38d	41d	42d	45d	46d	47d	48d	87d	50d	51d	39d	54d	55d	56d	57d	58d	59d	60d	61d	62d	63d	64d	65d	66d	67d	68d	88d	70d	71d	90d	73d	74d	75d	76d	77d	80d	82d	83d	84d
0	0	913460.0	2015-08-31 23:26:00	50.0	2015-08-31 23:39:00	70.0	Subscriber	8.0	0.0	23.0	13.0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
1	1	913459.0	2015-08-31 23:11:00	31.0	2015-08-31 23:28:00	27.0	Subscriber	8.0	0.0	23.0	17.0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Now we have trips_extended dataframe whoose rows are containing every single trip information with a departure column and an arrival column encoded with 1. Because every single trip starts in a station and ends in a station

Afterwards, we will resample trips_extended dataframe hourly and count the arrivals and the departures

In [40]:

# Sort the 'trips_extended' df by the date index
trips_extended= trips_extended.set_index("Start Date").sort_index()

# Create a dataframe with hourly resampling the 'trips_extended' df
# Sum the number of departures and arrivals in each station and each hour
trips_extended_hourly= trips_extended.resample('H').sum()

# See the departures distibution of station 70
trips_extended_hourly["70d"].plot(kind="hist");

In [41]:

# Create stations_hourly df by taking just the stations column
stations_hourly=trips_extended_hourly[arrival_lst+departure_lst].reset_index(drop=True)
stations_hourly.head(2)

Out[41]:

	2a	3a	4a	5a	6a	7a	8a	9a	10a	11a	12a	13a	14a	16a	21a	22a	85a	25a	86a	26a	27a	28a	29a	30a	31a	32a	33a	34a	35a	36a	37a	38a	41a	42a	45a	46a	47a	48a	87a	50a	51a	39a	54a	55a	56a	57a	58a	59a	60a	61a	62a	63a	64a	65a	66a	67a	68a	88a	70a	71a	90a	73a	74a	75a	76a	77a	80a	82a	83a	84a	2d	3d	4d	5d	6d	7d	8d	9d	10d	11d	12d	13d	14d	16d	21d	22d	85d	25d	86d	26d	27d	28d	29d	30d	31d	32d	33d	34d	35d	36d	37d	38d	41d	42d	45d	46d	47d	48d	87d	50d	51d	39d	54d	55d	56d	57d	58d	59d	60d	61d	62d	63d	64d	65d	66d	67d	68d	88d	70d	71d	90d	73d	74d	75d	76d	77d	80d	82d	83d	84d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

In [42]:

stations_hourly.shape

Out[42]:

(8760, 140)

Now we have our targets dataset: stations_hourly

Data split with `TimeSeriesSplit`¶

Since our dataset is a timeseries we must respect to the temporal order of the data. Thus, we must use only the past data to predict the future data.
In order to stict to this principle we will take the last %10 of the sorted dataset as a hold-out set and use sklearn TimeSeriesSplit object for cross-validation

In TimeSeriesSplit unlike standard cross-validation methods, successive training sets are supersets of those that come before them
Also we should not shuffle the data. Sklearn cross_validation defaults to not to shuffle

Features and targets datasets split¶

In [43]:

# Targets dataset:y
y= stations_hourly
# Features dataset:X
X= features_df

# Find the starting indice of the last 10 percent for hold-out datasets
idx= int(len(y)* 0.90)
# Create the features train dataset: X_train
X_train= X.iloc[:idx, :]
# Create the features test dataset: X_test
X_test= X.iloc[idx:, :]
# Create the targets train dataset: y_train
y_train= y.iloc[:idx, :]
# Create the targets test dataset: y_test
y_test= y.iloc[idx: ,:]

print("X_train shape:", X_train.shape, "X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape, "y_test shape:", y_test.shape)

X_train shape: (7884, 71) X_test shape: (876, 71)
y_train shape: (7884, 140) y_test shape: (876, 140)

Multi-target models¶

Now we have our train and test datasets so we can fit a model and make the predictions. Here we have two approaches:
- 1) to build one model per station
- 2) to predict all the stations together in one model

One model per station approach is not able to take into account of the similarities between stations.

In Sklearn some algorithms like random forest, knn and linear regression natively support multi-target regression thus suitable for second approach
The others process the multi-target-regression only with the MultiOutputRegressor but this works as the first approach.
Here is the explanation from the Sklearn page: "This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression. As MultiOutputRegressor fits one regressor per target it can not take advantage of correlations between targets.

Till finding a better solution we will work with the algorithms that natively support multi-target regression

Model with 140 stations¶

First baseline for the model with 140 stations¶

In this model a target corresponding to a features sample will be a point with the permutations of 140 dimensions(70 arrivals and 70 departures)
Here the baseline will be the estimate of these permutations.
We will take the average of the these permutations i.e hourly arrivals and departures as a baseline for this model and try to beat it.

In [44]:

# Add the time index to stations_hourly df
stations_hourly.index= trips_extended_hourly.index
# Take the averages of the columns
averages= stations_hourly.mean()
# Convert the averages into array
baseline= averages.values
# Stack vertically the baseline to match the y_test shape
baseline= np.tile(averages,(y_test.shape[0], 1))
# Check the shape of baseline
#print("baseline shape:", baseline.shape)

# Baseline errors are the mean of the differences between y_test and the baseline predictions
baseline_error= round (mean_squared_error(y_test.values, baseline)**0.5, 2)
print('Average baseline error: ', baseline_error)

Average baseline error:  1.7

The baseline estimates 1.7 arrivals or departures per hour.

Training the first model¶

Time to build our first model, using random forest implementation from Sklearn.
We will use the first 90% of the data for training (the training data set is sorted by time), and the remaining 10% for testing.
Let's write a simple function that returns the performance of the model and then train our first regressor.

`get_val_score_rf` function¶

Since we need to calculate model performance during training with tuned parameters it would be better to define a function in order to reuse

In [45]:

def get_val_score_rf(X_tr, y_tr, dct):
    '''
    takes X_train, y_train and a dictionary containing hyperparameter inputs for random forest -> 
    returns the cross validation and X_train scores of the model. Both are calculated as 
    root mean squared errors 
    '''
    # Instantiate a RandomForestRegressor object by unpacking the parameters dictionary
    forest = RandomForestRegressor(**dct)
    # Instantiate TimeSeriesSplit object with 5 splits
    split=TimeSeriesSplit(n_splits=5)
    # Calculate the performance with cross-validation
    score = cross_val_score(forest, X_tr, y_tr, cv=split, scoring='neg_mean_squared_error', n_jobs=-1)
    # Take the sqrt all the splits 
    score= np.sqrt(abs(score))
    print("Cross-validation scores:", score,"\n","Cross-validation mean:", f'{np.mean(score):.2f}')
    # fit the model
    forest.fit(X_train, y_train)
    # Predict the train set to see if there is over fitting
    pred_train=forest.predict(X_train)
    print(f'RMSE Train:{np.sqrt(mean_squared_error(y_train, pred_train)):.2f}')

First scores of the model¶

Let's get the first scores of our model with initial hyperparameter settings. We will do the feature selection and hyperparameter tuning afterwards.
Let's start with some parameter tips from Sklearn's site to see the first performance.

In [46]:

# Create the params dictionary with basic hyperparameters, include the random_state
initial_params={'n_estimators':100,
        'max_depth':3,
        'min_samples_leaf':2,
        'random_state':37}

# Call the get_val_score_rf on the params dictionary
get_val_score_rf(X_train, y_train, initial_params)

Cross-validation scores: [1.21253239 1.17777628 1.23724672 1.23315127 1.27263581] 
 Cross-validation mean: 1.23
RMSE Train:1.23

Model looks little bit underfitting because $rmseTRAIN \approx rmseCV$
Additional data would be great
We can try
- feature selection
- increase the number of estimators, max_depth or decrease min_samples_leaf and look for better combination with other hyperparameters also.
We will use this scores as baseline2 and continue to improve our model with feature selection

Feature selection¶

Feature importance¶

Even though we are using an ensemble method still random forest is dependent on decision trees and decision trees tend to overfit on data with a large number of features
We can try omitting the features without prediction power in order to potentially increase the performance of our model and make it computationaly efficient.
Let's check the feature importances by using two approach.
We can get the feature importances from Sklearn RandomForestRegressor however there are other methods also.
Here there is an interesting article about the biasedness of the Sklearn Random Forest future importance method.
For this work we will try both

Feature importance by `sklearn RandomForestRegressor`¶

For regression trees the future importance is measured by how much each feature reduce the variance when they split the data

In [47]:

rf1 = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf1.fit(X_train, y_train)
# Train the regressor
rf1.predict(X_test)

# Create a pandas Series of features importances: importances
# containing the feature names as index and their importances as values
importances_rf = pd.Series(data=rf1.feature_importances_ , index= X_train.columns)

# Get the sorted importance values: importance_sorted
importance_sorted=importances_rf.sort_values()

# Plot the sorted importance values by using horizontal bars
importance_sorted.plot(kind="barh", color='lightgreen', figsize=(10, 12));

Cumulative importances¶

We can get the cumulative importances from the sorted importances and fixed the level that we want to cut off.
Below the 99% level is drawn for the Sklearn importance ranking

In [48]:

# Sort the importances in descending order
importance_desc=importances_rf.sort_values(ascending=False)
sorted_features=importance_desc.index

# Cumulate the importances
cumulative_importances = np.cumsum(importance_desc)

# list of x locations for plotting
x_values = list(range(len(importance_desc)))

plt.figure(figsize=(14,5))
# Make a line graph
plt.plot(x_values, cumulative_importances, 'g-')
# Draw line at 99% of importance retained
plt.hlines(y = 0.99, xmin=0, xmax=len(importance_desc), color = 'r', linestyles = 'dashed')
# Format x ticks and labels
plt.xticks(x_values, sorted_features, rotation = 'vertical')
# Axis labels and title
plt.xlabel('Variable'); plt.ylabel('Cumulative Importance'); plt.title('Cumulative Importances');

Now we can find number of features for cumulative importance of 99%

In [49]:

# Add 1 because Python is zero-indexed
print('Number of features for 99.5% importance:', np.where(cumulative_importances > 0.995)[0][0] + 1)

# Extract the names of the most important features
important_features = [feature for feature in importance_desc[0:48].index]

# Create training and testing sets with only the important features
X_train_red_sk = X_train[important_features] # red_sk stands for reduced sklearn
X_test_red_sk = X_test[important_features]

# Sanity check
print('Important train features shape:', X_train_red_sk.shape)
print('Important test features shape:', X_test_red_sk.shape)

Number of features for 99.5% importance: 47
Important train features shape: (7884, 48)
Important test features shape: (876, 48)

In [51]:

# Check the scores of the X_train_red_sk to compare with the first feature df X_train
get_val_score_rf(X_train_red_sk, y_train, initial_params)

Cross-validation scores: [1.21231799 1.17721283 1.23724747 1.23315189 1.27263622] 
 Cross-validation mean: 1.23
RMSE Train:1.23

After dropping 26 columns from the features dataset we get the same CV mean and RMSE train. It is better to continue with reduced dataset

Feature importance by `rfpimp`¶

In [ ]:

# Instantiate a random forest object
rf_perm = RandomForestRegressor(n_estimators=100, n_jobs=-1)
# fit the model
rf_perm.fit(X_train, y_train)
# Apply the permutation function
imp = importances(rf_perm, X_test, y_test)
# plot the future importances
viz = plot_importances(imp)
viz.view()

The results are quite different from each other however in both case as expected the night hours are not predictive.
In our chart analysis we have seen the bike usage was almost zero between 23h-04h.
Binary features related with rain are not predictive in both cases, probably the number of the cases are not enough to learn and also it is corelated with humidity
We can test both of the performances of the two reduced dataset regarding the results of two ranking of importances

Most important futures with `rfpimp`¶

In [ ]:

# Cumulate the importances
cumulative_imp = np.cumsum(imp.values)
# Add 1 because Python is zero-indexed
print('Number of features for 99.9% importance:', np.where(cumulative_imp > 0.999)[0][0] + 1)

# Extract the names of the most important features
important_perm = [feature for feature in imp[0:20].index]
# Create training and testing sets with only the important features
X_train_perm = X_train[important_perm] # red_sk stands for reduced sklearn
X_test_perm= X_test[important_perm]

# Sanity check
print('Important train features shape:', X_train_perm.shape)
print('Important test features shape:', X_test_perm.shape)

In [ ]:

# Check the scores of the X_train_perm to compare with 
# the first feature df(X_train) and the X_train_red_sk 
get_val_score_rf(X_train_perm, y_train, params)

We get the same score just with 20 columns using rfpimp.

Hyperparameters tuning¶

Randomized Search CV¶

Let's run a randomized search for hyperparameter tuning.
We can define a grid of hyperparameter ranges, and randomly sample from the grid by performing cross-validation on each combination of values using Sklearn's RandomizedSearchCV object
Even though RandomizedSearchCV is not an exhaustive search like GridSearch it can be computationally heavy depending on our choise of number of iteration and cross-validation and ranges of hyperparameters.

Both in Randomized and Grid Search we should use
- scoring='neg_mean_absolute_error' argument as our loss function is RMSE and
- TimeSeriesSplit as CV splits as we work with timeseries data

Here we will do a simple search with 10 iteration and 2 cross validation and will not tune all the hyperparameters.

In [ ]:

# Create the base model to tune
forest_tune = RandomForestRegressor()
# print and see all the parameters of RandomForestRegressor
pprint(forest_tune.get_params())

As we see there are quite a lot parameters to tune.
We will focus on the most important ones: 'n_estimators', 'max_depth' and 'min_samples_leaf'.
However we will create the full grid to make it ready for furher searches. We can play with the ranges depending on our time.
Please uncomment the parameters in the dictionary to run a wider search.

In [ ]:

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop =1000, num = 3)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(3, 50, num = 2)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               #'max_features': max_features,
               #'min_samples_split': min_samples_split,
               #'bootstrap': bootstrap
               'min_samples_leaf': min_samples_leaf}

pprint(random_grid)

Since it is not efficient doing random search with the personal laptop at the moment we will try the random search on Google Colab with these parameters.

In [ ]:

# Create the splits for timeseries data 
split=TimeSeriesSplit(n_splits=2)
# Instantiate a Random Search object with parameters: n_iter = 10, cv = 2, 
forest_random = RandomizedSearchCV(estimator = forest_tune, 
                                   param_distributions = random_grid, 
                                   n_iter = 10, 
                                   cv = split,
                                   scoring='neg_mean_absolute_error',
                                   verbose=3, 
                                   random_state=37, 
                                   n_jobs = -1)

# Fit the random search model on reduced X_train_red_sk
#forest_random.fit(X_train_red_sk, y_train) #(pls uncomment to execute)

First results of the simple randomized search from Colab: params1

Evaluate random search results¶

To find out if random search produced better model parameters than the previous one with initial_params we will run with new parameters and compare the results

In [ ]:

# Create the best params dictionary
best_params_random={'n_estimators':100,
                    'max_depth':50,
                    'min_samples_leaf':4,
                    'random_state':37}
# Check the scores of the X_train_red_sk
get_val_score_rf(X_train_red_sk, y_train, best_params_random)

Compare the scores before and after search¶

Our first random forest model have the scores:
- Cross-validation mean: 1.23
- RMSE Train:1.23
After applying the parameters of randomized search the scores becomes:
- Cross-validation mean: 0.97
- RMSE Train:0.74

The scores are both better and also this time it looks like it does not ovetfit

Grid Search CV¶

Random search allowed us to narrow down the range for each hyperparameter.
Now we can specify more the combination of settings with GridSearchCV

In [ ]:

# Create the parameter grid based on the results of random search 
param_grid = {'max_depth':[50, 60],
              'min_samples_leaf':[4, 5],
              'n_estimators':[100, 150]}
# Instantiate a RandomForestRegressor object
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, 
                           param_grid = param_grid, 
                           cv = split, 
                           scoring='neg_mean_absolute_error',
                           n_jobs = -1, 
                           verbose = 2)
pprint(param_grid)

In [ ]:

# (pls uncomment to execute)
# Fit the grid search to the data 
#grid_search.fit(X_train_red_sk, y_train)
#grid_search.best_params_

First results of the simple grid search from Colab: params2

All the parameters augmented little bit.

Evaluate the GridSearch results¶

In [ ]:

# Create the best params dictionary
best_params_grid={'n_estimators':150,
                  'max_depth':60,
                  'min_samples_leaf':5,
                  'random_state':37}
# Check the scores 
get_val_score_rf(X_train_red_sk, y_train, best_params_grid)

Cross-validation mean score did not change. (Only changed in the 3rd decimal probably).
Train score increased by 0.03 because of maximum depth difference.
Even though there is chance to improve results by tweaking the ranges.
For now we can stop our hypermeter tuning here. We will continue to use the previous random search results.
Again, we have focused only on 3 hyperparameters due to the computational and time constraints.

Hold-out set score¶

Now we can see our final test prediction performance by creating our final model

In [ ]:

# Instantiate a RandomForestRegressor object with the best parameters of random search
model_rf=RandomForestRegressor(**best_params_random)
# Fit the model 
model_rf.fit(X_train_red_sk, y_train)
# Predict the train set
pred_final=model_rf.predict(X_test_red_sk)

print(f'RMSE Test(hold-out):{np.sqrt(mean_squared_error(y_test, pred_final)):.2f}')

Our test score improved to 0.90 from 0.97 CV mean.
It is high probaly thanks to the higher amount of training data (we used all the 5 folds together for training)

Test with the reduced `X_train_perm`¶

In [ ]:

# Instantiate a RandomForestRegressor object with the best parameters of random search
model_rf=RandomForestRegressor(**best_params_random)
# Fit the model 
model_rf.fit(X_train_perm, y_train)
# Predict the train set
pred_final=model_rf.predict(X_test_perm)

print(f'RMSE Test(hold-out):{np.sqrt(mean_squared_error(y_test, pred_final)):.2f}')

We get the same test result by using the features dataset reduced by the rfpimp
Since X_train_perm has less columns (20), it looks a better idea to use this dataset for model training and testing.

Reminder:¶

So far the model we created predicts the number of arrivals and departures not their differences (net bike stock change)
Our baselines were also about how close we are to the number of arrivals and departures not the net change.
To get the net change predictions we need to convert the predictions of the model.

In [ ]:

# Save the model for future use 
filename = 'model_rf.model'
joblib.dump(model_rf, filename)

In [ ]:

# Check if the model properly saved
model=joblib.load(filename)
pred_joblib=model.predict(X_test_perm)
print(f'RMSE Test(hold-out):{np.sqrt(mean_squared_error(y_test, pred_joblib)):.2f}')

Our model by construction predicting the number of arrivals and departure seperately thus when we give it unseen data and it returns [n_unseen_data, 2*stations] matrix.
So we need to define a function which converts the predictions into net change (n_departures - n_arrivals)

In [ ]:

def net_change(model, df):
    '''
    Takes our model and dataframe to predict returns->
    a dataframe with the half number of columns of the predictons made by our model
    by taking the difference of the departures and arrival columns
    '''
    # Create a dataframe from predictions
    predictions_df = pd.DataFrame(model.predict(df))
    # Create list of column names
    columns_lst = list(predictions_df.columns)
    # Create a list to aggregate the differences of departures and arrivals
    difference_cols =[]
    # Iterate over the column names list till the half point
    for idx in range(len(columns_lst)//2):
        # Collect the column differences in the list
        difference_cols.append(predictions_df[idx+70]-predictions_df[idx])
    # At the end of the loop concat all the columns at once
    net_rate_df = pd.concat(difference_cols, axis=1)
    return net_rate_df

In [ ]:

# Call the net_change on our model with test data
net_change(model_rf, X_test_perm).head(2)

As seen above we converted the predictions of arrivals and departures to the differences with 70 columns

Net rates dataframe¶

Lets create the net rates dataframe by taking the differences of departures and arrivals for each hour
In this option the shape will be [n_hours, n_stations], in our case [8760, 70]

In [ ]:

# Create an empty list to aggregate the net change in every row(hour)
columns_lst=[]

## Loop over stations_hourly df columns to find the net change (arrivals-departures)
for station in stations_hourly.columns:
    if station[-1]=="a":
        station_arr = station
        station_dep = station[:-1]+ "d"
        # Append the column difference in the list
        columns_lst.append((stations_hourly[station_arr]- stations_hourly[station_dep]))

# Create the change_df by concatenating the columns in the column_lst
net_df = pd.concat(columns_lst, axis=1) 
# Create the column names of the net_df
net_cols =[]
for station in stations_hourly.columns:
    if station[-1]=="a":
        net_cols.append(station[:-1]+ "c")
# Add the column names        
net_df.columns = net_cols       
net_df.head(2)

In [ ]:

net_df.shape

RMSE of the `predicted net changes` and the `actual net changes`¶

Now we can make the final evaluation of our 140 stations approach by finding the RMSE of the prediction of net change and the actual net changes.

In [ ]:

# Get the predictions of the test 
predicted_net_change= net_change(model_rf, X_test_perm)
# Actual net changes
actual_net_change=y_test_net
# Get the Root Mean Squared Errors of differences
rmse_model_rf=round(mean_squared_error(y_test_net, predicted_net_change)**0.5, 2)
rmse_model_rf

Our model's performance decreased to 2.89 on prediction the net rate change

Model with 70 stations¶

`Net Rate` targets approach¶

Now we will try the net rate approach by directly taking the net bike stock change in an hour in a station as targets data
Here we will repeat the same steps like we did before for the dataset with targets as arrivals and departures.

Target datasets : `y_train_net`, `y_test_net`¶

We only need to split the targets dataset (net_df)
We can contunie to use the splitted X_train and X_test as features datasets

In [ ]:

# Targets dataset:y
y_net= net_df

# Find the starting indice of the last 10 percent for hold-out datasets
idx= int(len(y_net)* 0.90)

# Create the targets train dataset: y_train_net
y_train_net= y_net.iloc[:idx, :]
# Create the targets test dataset: y_test_net
y_test_net= y_net.iloc[idx: ,:]

# Check the shapes
print("X_train shape:", X_train.shape, "X_test shape:", X_test.shape)
print("y_train shape:", y_train_net.shape, "y_test shape:", y_test_net.shape)

First baseline method for net rate approach¶

In [ ]:

# Take the averages of the columns
averages= net_df.mean()
# Convert the averages into array
baseline= averages.values
# Stack vertically the baseline to match the y_test shape
baseline= np.tile(averages,(y_test_net.shape[0], 1))
# Check the shape of baseline
#print("baseline shape:", baseline.shape)
# Baseline errors are the mean of the differences between y_test and the baseline predictions
baseline_error= round (mean_squared_error(y_test_net.values, baseline)**0.5, 2)
print('Average baseline error: ', baseline_error)

Quite interesting just taking the daily averages can estimates 1.72 bikes difference on the average

First model performance with net rate targets¶

We will use the validation function get_val_score_rf that we defined in the previous part to evaluate the first performance of random forest algorithm with the parameters below.

In [ ]:

# Create the params dictionary with basic hyperparameters, include the random_state
params={'n_estimators':100,
        'max_depth':3,
        'min_samples_leaf':2,
        'random_state':37}

# Call the get_val_score_rf on X_train_net the params dictionary
get_val_score_rf(X_train, y_train_net, params)

Even with these parameters net change model beats the first model (140 targets)
Also these scores are better than the first baseline of this approach
Score improved from 1.72 to 1.23

Feature importance `rfpimp`¶

Now we can continue with the features importance function

In [ ]:

# Instantiate a random forest object
rf_perm = RandomForestRegressor(n_estimators=100, n_jobs=-1)
# fit the model
rf_perm.fit(X_train, y_train_net)
# Apply the permutation function
imp = importances(rf_perm, X_test, y_test_net)
# Cumulate the importances
cumulative_imp = np.cumsum(imp.values)
# Add 1 because Python is zero-indexed
print('Number of features for 99.9% importance:', np.where(cumulative_imp > 0.999)[0][0] + 1)

According to model importance funtion only 4 features are enough to get the 99.9% prediction power.

In [ ]:

# Extract the names of the most important features. Let's keep 20 of them
important_net = [feature for feature in imp[0:20].index]
# Create training and testing sets with only the important features
X_train_red_net = X_train[important_perm] # red_sk stands for reduced sklearn
X_test_red_net= X_test[important_perm]

# Sanity check
print('Important train features shape:', X_train_red_net.shape)
print('Important test features shape:', X_test_red_net.shape)

Model performance with randomized search parameters¶

We did a RandomizedSearchCV on Colab and get the parameters below. Let's get the cross validation scores on reduced features with tweaked parameters

In [ ]:

# Parameters from RandomizedSearchCV
params_random={'n_estimators':100,
               'max_depth':50,
               'min_samples_leaf':4,
               'random_state':37}

# Call the get_val_score_rf on X_train_net the params dictionary
get_val_score_rf(X_train_red_net, y_train_net, params_random)

Cross-validation mean improved from 1.29 to 1.19
RMSE Train improved from 1.23 to 0.74

Model performance with Grid Search parameters¶

After narrowing down the range we ran a simple grid search on Colab to tweak the parameters.

In [ ]:

params_grid={'n_estimators':100,
             'max_depth':60,
             'min_samples_leaf':5,
             'random_state':37}

# Call the get_val_score_rf on X_train, y_train_net and the params dictionary
get_val_score_rf(X_train_red_net, y_train_net, params_grid)

Cross validation score just improved 0.01 and RMSE train get worse 0.03 point. We can use the previous parameters.

Test on hold-out set¶

Now we can see our final test prediction performance with net rate targets by creating our final model

In [ ]:

# Instantiate a RandomForestRegressor object with the best parameters of random search
model_net_final=RandomForestRegressor(**best_params_random)
# Fit the model 
model_net_final.fit(X_train_red_net, y_train_net)
# Predict the train set
pred_final_net=model_net_final.predict(X_test_red_net)

print(f'RMSE Test(hold-out):{np.sqrt(mean_squared_error(y_test_net, pred_final_net)):.2f}')

Our test score on final hold-out set is 1.16 and little better than last cross validation score

Save model for future use ¶

In [ ]:

filename = 'model_net.model'
joblib.dump(model_net_final, filename)

In [ ]:

# Check if the model properly saved
model=joblib.load(filename)

pred_joblib=model.predict(X_test_red_net)
print(f'RMSE Test(hold-out):{np.sqrt(mean_squared_error(y_test_net, pred_joblib)):.2f}')

Performance analysis¶

Model with 140 columns targets¶

The model with 140 targets columns performance on predicting the arrivals and departures:

Model with 70 columns targets¶

results70

With this final evaluation we completed our steps for this takehome project.

	2a	3a	4a	5a	6a	7a	8a	9a	10a	11a	12a	13a	14a	16a	21a	22a	85a	25a	86a	26a	27a	28a	29a	30a	31a	32a	33a	34a	35a	36a	37a	38a	41a	42a	45a	46a	47a	48a	87a	50a	51a	39a	54a	55a	56a	57a	58a	59a	60a	61a	62a	63a	64a	65a	66a	67a	68a	88a	70a	71a	90a	73a	74a	75a	76a	77a	80a	82a	83a	84a	2d	3d	4d	5d	6d	7d	8d	9d	10d	11d	12d	13d	14d	16d	21d	22d	85d	25d	86d	26d	27d	28d	29d	30d	31d	32d	33d	34d	35d	36d	37d	38d	41d	42d	45d	46d	47d	48d	87d	50d	51d	39d	54d	55d	56d	57d	58d	59d	60d	61d	62d	63d	64d	65d	66d	67d	68d	88d	70d	71d	90d	73d	74d	75d	76d	77d	80d	82d	83d	84d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	2a	3a	4a	5a	6a	7a	8a	9a	10a	11a	12a	13a	14a	16a	21a	22a	85a	25a	86a	26a	27a	28a	29a	30a	31a	32a	33a	34a	35a	36a	37a	38a	41a	42a	45a	46a	47a	48a	87a	50a	51a	39a	54a	55a	56a	57a	58a	59a	60a	61a	62a	63a	64a	65a	66a	67a	68a	88a	70a	71a	90a	73a	74a	75a	76a	77a	80a	82a	83a	84a	2d	3d	4d	5d	6d	7d	8d	9d	10d	11d	12d	13d	14d	16d	21d	22d	85d	25d	86d	26d	27d	28d	29d	30d	31d	32d	33d	34d	35d	36d	37d	38d	41d	42d	45d	46d	47d	48d	87d	50d	51d	39d	54d	55d	56d	57d	58d	59d	60d	61d	62d	63d	64d	65d	66d	67d	68d	88d	70d	71d	90d	73d	74d	75d	76d	77d	80d	82d	83d	84d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Project In a nutshell¶

Table of contents¶

'Pen & Paper' thinking¶

What can be the drivers of the bike rentals in each station?¶

Data exploration¶

Loading datasets & summaries¶

Station data¶

Station locations on the map¶

Trip data¶

Weather data¶

Kaggle hourly weather dataset¶

Missing values¶

Categorical features¶

Stations departures count¶

Moved stations on the map¶

Combine the stations¶

Delete the moved stations from station_df¶

Duplicates¶

Feature preprocessing¶

Add time features¶

Add 'Duration' column¶

Outliers in Duration¶

Monthly, daily, hourly trips charts¶

A plot function for groups: group_plot¶

Monthly trips by subscription type in weekdays¶

Monthly trips by subscription type in weekends¶

Daily trips by subscription type in weekdays¶

Hourly trips by subscription type in weekdays¶

Hourly trips by subscription type in weekends¶

Hourly trips by the day of the week¶

Create hourly trip dataframe: hourly_trips¶

Encode the Month column¶

Flags for day groups¶

Encode the Day column¶

Encode the 'Hour' column¶

Add holiday column¶

Features dataset¶

Modeling approach¶

Targets dataset with 140 columns¶

Arrivals and departures hourly count¶

Data split with TimeSeriesSplit¶

Features and targets datasets split¶

Multi-target models¶

Model with 140 stations¶

First baseline for the model with 140 stations¶

Training the first model¶

get_val_score_rf function¶

First scores of the model¶

Feature selection¶

Feature importance¶

Feature importance by sklearn RandomForestRegressor¶

Cumulative importances¶

Feature importance by rfpimp¶

Most important futures with rfpimp¶

Hyperparameters tuning¶

Randomized Search CV¶

Evaluate random search results¶

Compare the scores before and after search¶

Grid Search CV¶

Evaluate the GridSearch results¶

Hold-out set score¶

Test with the reduced X_train_perm¶

Reminder:¶

Net rates dataframe¶

RMSE of the predicted net changes and the actual net changes¶

Model with 70 stations¶

Net Rate targets approach¶

Target datasets : y_train_net, y_test_net¶

First baseline method for net rate approach¶

First model performance with net rate targets¶

Feature importance rfpimp¶

Model performance with randomized search parameters¶

Model performance with Grid Search parameters¶

Test on hold-out set¶

Save model for future use ¶

Performance analysis¶

Model with 140 columns targets¶

Model with 70 columns targets¶

Comments

Delete the moved stations from `station_df`¶

Outliers in `Duration`¶

A plot function for groups: `group_plot`¶

Create hourly trip dataframe: `hourly_trips`¶

Encode the `Month` column¶

Encode the `Day` column¶

Encode the '`Hour`' column¶

Data split with `TimeSeriesSplit`¶

`get_val_score_rf` function¶

Feature importance by `sklearn RandomForestRegressor`¶

Feature importance by `rfpimp`¶

Most important futures with `rfpimp`¶

Test with the reduced `X_train_perm`¶

RMSE of the `predicted net changes` and the `actual net changes`¶

`Net Rate` targets approach¶

Target datasets : `y_train_net`, `y_test_net`¶

Feature importance `rfpimp`¶

	2a	3a	4a	5a	6a	7a	8a	9a	10a	11a	12a	13a	14a	16a	21a	22a	85a	25a	86a	26a	27a	28a	29a	30a	31a	32a	33a	34a	35a	36a	37a	38a	41a	42a	45a	46a	47a	48a	87a	50a	51a	39a	54a	55a	56a	57a	58a	59a	60a	61a	62a	63a	64a	65a	66a	67a	68a	88a	70a	71a	90a	73a	74a	75a	76a	77a	80a	82a	83a	84a	2d	3d	4d	5d	6d	7d	8d	9d	10d	11d	12d	13d	14d	16d	21d	22d	85d	25d	86d	26d	27d	28d	29d	30d	31d	32d	33d	34d	35d	36d	37d	38d	41d	42d	45d	46d	47d	48d	87d	50d	51d	39d	54d	55d	56d	57d	58d	59d	60d	61d	62d	63d	64d	65d	66d	67d	68d	88d	70d	71d	90d	73d	74d	75d	76d	77d	80d	82d	83d	84d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0