SF Bay Area Bike Data
In this notebook we will be working on the Kaggle project: SF Bay Area Bike Share
Project In a nutshell¶
We are trying to predict the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour.
We have 3 datasets:
station data
,trip data
,weather data
Table of contents¶
Model with 140 stations
- First baseline for the model with 140 stations
- 'get_val_score_rf' function
- First scores of the model
- Feature importance by Sklearn random forest
- Feature importance by 'rfpimp'
- Hyperparameters tuning
- Grid Search CV
- Hold-out set score
- Net rates dataframe
- RMSE of the 'predicted net changes' and the 'actual net changes'
# Notebook setup
import pandas as pd
import numpy as np
import glob
from pandas.tseries.holiday import USFederalHolidayCalendar
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
from rfpimp import *
from pprint import pprint
import folium
%matplotlib inline
# Set the option to display the max number of columns and rows
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 20000)
'Pen & Paper' thinking¶
What can be the drivers of the bike rentals in each station?¶
Bike rentals in each station can flactuate by two kind of reasons:
- Station specific factors which are more about the location and the environment of each station, like centrality, height difference etc
- Dynamic factors affecting all the stations like weather and time.
Here are some main factors:
Since bike users are exposed directly to the weather conditions during the ride it is expected to be one of the main parameters.
- Snow
- Rain
- Ice etc
Since we are working with data created by humans, it is intuitive to expect different characteristics in different time periods
- Month
- Day of the week
- Holiday days
- Weekdays and weekends
- Hour of the day
Mobility in the city
Since bikes are an option of transportation we need to take in to account the nature of the mobility (distribution of the mobility) in the city
- Centrality
- Population density of the area
- Residential or office(job) area
- Closeness to "hot" spots like parks, universities, culture and convention centers other easy public transportation availibilities
- Critical events' times (concert, a sport activity etc)
Comparision with other transportation alternatives
- Cost of other options
- Price of gas
- Time spend in the traffic jam
- Safe bike lanes
- connections
# Read the `station_data`
station_df = pd.read_csv("station_data.csv")
# Summary of station_data
Station locations on the map¶
To have an image of the stations let's see where the stations are on the map
Trip data¶
# Read the 'trip_data': trip_df
trip_df = pd.read_csv("trip_data.csv",
parse_dates=['Start Date', 'End Date'],
# Summary of trip_df
There is an object type column: Subscriber Type
. We need to encode this column's values to categories
Weather data¶
The given weather dataset provides weather measurements with daily precision however we will make our analysis with samples that are in an hour range.
Even though the given dataset provides zip code specific weather data, during a day-time period the measurement differences can be significant. So it would be better to have an hourly weather dataset.
Therefore instead of using the given dataset, we will use the weather data taken from Kaggle Datasets (Historical Hourly Weather Data 2012-2017)
Here, hourly weather measurements data of various weather attributes, such as temperature, humidity, air pressure, etc. are provided for many cities, including San Francisco,
Additionally, for each city we also have the country, latitude and longitude information in a separate file.
Kaggle hourly weather dataset¶
- Since datasets are given by a common
column, we can read all the datasets and exctract the"San Francisco"
columns in order to create a weather dataset related to our are of interest
# Pattern of weather attributes datasets: pattern
pattern = 'kaggle_data\*.csv'
# Save all matching files with glob function: weather_files
weather_files = glob.glob(pattern)
# Aggregate all the datasets in a list
# by subsetting the 'datetime' and 'San Francisco' column of each dataset
weather_df_lst = [pd.read_csv(file, usecols=["datetime", "San Francisco"]) for file in weather_files]
# Concat all the dataframes in the weather_df_lst
weather_df = pd.concat(weather_df_lst, axis=1)
# Set the first 'datetime' column as index and
# Drop the other 'datetime' columns
weather_df = weather_df.set_index(weather_df.iloc[:, 0]).drop("datetime", axis=1)
# Convert the index to datetime
# Set the column names
column_names = ['Humidity', 'Pressure', 'Temperature', 'Weather Description', 'Wind Direction', 'Wind Speed']
weather_df.columns = column_names
# Subset the date interval [2014-09-01: 2015-08-31] (interval of bike trip dates)
weather_df = weather_df["2014-09-01": "2015-08-31"]
- There is a missing value in
Wind Direction
column. -
Weather Description
column is categorical. We need to convert the categories into dummy variables
# Summary statistics of station_df
Missing values¶
# Look at the missing values in 3 datasets
# Impute the single missing value in weather data
weather_df["Wind Direction"].fillna(method='ffill', inplace=True)
Now we don't have expected missing values (the ones pandas can detect)
Categorical features¶
Let's encode the categorical columns
Weather Description
andSubscriber Type
in weather_df and trip_df respectively with dummy variables. -
We will do the encoding for the categorical variables of the trip_df after doing some plotting
# Create dummy variables from 'Weather Description' column categories
dummies= pd.get_dummies(weather_df["Weather Description"], drop_first=True)
# Add the dummy variables to the weather_df and
# Drop the original features from the weather_df
weather_df= pd.concat([weather_df, dummies], axis=1).drop("Weather Description", axis=1)
fig, ax=plt.subplots(figsize=(16, 5))
sns.countplot(trip_df["Start Station"], ax=ax);