SF Bay Area Bike Data
In this notebook we will be working on the Kaggle project: SF Bay Area Bike Share
Project In a nutshell¶
-
We are trying to predict the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour.
-
We have 3 datasets:
station data
,trip data
,weather data
Table of contents¶
-
Model with 140 stations
- First baseline for the model with 140 stations
- 'get_val_score_rf' function
- First scores of the model
- Feature importance by Sklearn random forest
- Feature importance by 'rfpimp'
- Hyperparameters tuning
- Grid Search CV
- Hold-out set score
- Net rates dataframe
- RMSE of the 'predicted net changes' and the 'actual net changes'
# Notebook setup
import pandas as pd
import numpy as np
import glob
from pandas.tseries.holiday import USFederalHolidayCalendar
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
from rfpimp import *
from pprint import pprint
import folium
sns.set()
%matplotlib inline
# Set the option to display the max number of columns and rows
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 20000)
'Pen & Paper' thinking¶
What can be the drivers of the bike rentals in each station?¶
Bike rentals in each station can flactuate by two kind of reasons:
- Station specific factors which are more about the location and the environment of each station, like centrality, height difference etc
- Dynamic factors affecting all the stations like weather and time.
Here are some main factors:
Weather
:
Since bike users are exposed directly to the weather conditions during the ride it is expected to be one of the main parameters.
- Snow
- Rain
- Ice etc
Time
:
Since we are working with data created by humans, it is intuitive to expect different characteristics in different time periods
- Month
- Day of the week
- Holiday days
- Weekdays and weekends
- Hour of the day
Mobility in the city
:
Since bikes are an option of transportation we need to take in to account the nature of the mobility (distribution of the mobility) in the city
- Centrality
- Population density of the area
- Residential or office(job) area
- Closeness to "hot" spots like parks, universities, culture and convention centers other easy public transportation availibilities
- Critical events' times (concert, a sport activity etc)
Comparision with other transportation alternatives
:
- Cost of other options
- Price of gas
- Time spend in the traffic jam
Infrastructure
:
- Safe bike lanes
- connections
# Read the `station_data`
station_df = pd.read_csv("station_data.csv")
station_df.head(2)
# Summary of station_data
station_df.info()
Station locations on the map¶
To have an image of the stations let's see where the stations are on the map
Trip data¶
# Read the 'trip_data': trip_df
trip_df = pd.read_csv("trip_data.csv",
parse_dates=['Start Date', 'End Date'],
infer_datetime_format=True)
trip_df.head(2)
# Summary of trip_df
trip_df.info()
There is an object type column: Subscriber Type
. We need to encode this column's values to categories
Weather data¶
-
The given weather dataset provides weather measurements with daily precision however we will make our analysis with samples that are in an hour range.
-
Even though the given dataset provides zip code specific weather data, during a day-time period the measurement differences can be significant. So it would be better to have an hourly weather dataset.
-
Therefore instead of using the given dataset, we will use the weather data taken from Kaggle Datasets (Historical Hourly Weather Data 2012-2017)
-
Here, hourly weather measurements data of various weather attributes, such as temperature, humidity, air pressure, etc. are provided for many cities, including San Francisco,
-
Additionally, for each city we also have the country, latitude and longitude information in a separate file.
Kaggle hourly weather dataset¶
- Since datasets are given by a common
datetime
column, we can read all the datasets and exctract the"San Francisco"
columns in order to create a weather dataset related to our are of interest
# Pattern of weather attributes datasets: pattern
pattern = 'kaggle_data\*.csv'
# Save all matching files with glob function: weather_files
weather_files = glob.glob(pattern)
# Aggregate all the datasets in a list
# by subsetting the 'datetime' and 'San Francisco' column of each dataset
weather_df_lst = [pd.read_csv(file, usecols=["datetime", "San Francisco"]) for file in weather_files]
# Concat all the dataframes in the weather_df_lst
weather_df = pd.concat(weather_df_lst, axis=1)
print(weather_df.head(2))
# Set the first 'datetime' column as index and
# Drop the other 'datetime' columns
weather_df = weather_df.set_index(weather_df.iloc[:, 0]).drop("datetime", axis=1)
# Convert the index to datetime
weather_df.index=pd.to_datetime(weather_df.index)
# Set the column names
column_names = ['Humidity', 'Pressure', 'Temperature', 'Weather Description', 'Wind Direction', 'Wind Speed']
weather_df.columns = column_names
weather_df
# Subset the date interval [2014-09-01: 2015-08-31] (interval of bike trip dates)
weather_df = weather_df["2014-09-01": "2015-08-31"]
weather_df.head(3)
weather_df.info()
- There is a missing value in
Wind Direction
column. -
Weather Description
column is categorical. We need to convert the categories into dummy variables
# Summary statistics of station_df
weather_df.describe().T
Missing values¶
# Look at the missing values in 3 datasets
print(station_df.isna().values.any())
print(trip_df.isna().values.any())
print(weather_df.isna().values.any())
# Impute the single missing value in weather data
weather_df["Wind Direction"].fillna(method='ffill', inplace=True)
Now we don't have expected missing values (the ones pandas can detect)
Categorical features¶
-
Let's encode the categorical columns
Weather Description
andSubscriber Type
in weather_df and trip_df respectively with dummy variables. -
We will do the encoding for the categorical variables of the trip_df after doing some plotting
# Create dummy variables from 'Weather Description' column categories
dummies= pd.get_dummies(weather_df["Weather Description"], drop_first=True)
# Add the dummy variables to the weather_df and
# Drop the original features from the weather_df
weather_df= pd.concat([weather_df, dummies], axis=1).drop("Weather Description", axis=1)
weather_df.head(2)
fig, ax=plt.subplots(figsize=(16, 5))
sns.countplot(trip_df["Start Station"], ax=ax);