Skip to main content

SF Bay Area Bike Data

In this notebook we will be working on the Kaggle project: SF Bay Area Bike Share

resim.png

Project In a nutshell

  • We are trying to predict the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour.

  • We have 3 datasets: station data, trip data, weather data

In [1]:
# Notebook setup
import pandas as pd
import numpy as np
import glob
from pandas.tseries.holiday import USFederalHolidayCalendar
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split


from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib

from rfpimp import *
from pprint import pprint

import folium

sns.set()

%matplotlib inline
# Set the option to display the max number of columns and rows 
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 20000)

'Pen & Paper' thinking

What can be the drivers of the bike rentals in each station?

Bike rentals in each station can flactuate by two kind of reasons:

  1. Station specific factors which are more about the location and the environment of each station, like centrality, height difference etc
  2. Dynamic factors affecting all the stations like weather and time.

Here are some main factors:

Weather:
Since bike users are exposed directly to the weather conditions during the ride it is expected to be one of the main parameters.

  • Snow
  • Rain
  • Ice etc

Time:
Since we are working with data created by humans, it is intuitive to expect different characteristics in different time periods

  • Month
  • Day of the week
  • Holiday days
  • Weekdays and weekends
  • Hour of the day

Mobility in the city:
Since bikes are an option of transportation we need to take in to account the nature of the mobility (distribution of the mobility) in the city

  • Centrality
  • Population density of the area
  • Residential or office(job) area
  • Closeness to "hot" spots like parks, universities, culture and convention centers other easy public transportation availibilities
  • Critical events' times (concert, a sport activity etc)

Comparision with other transportation alternatives:

  • Cost of other options
  • Price of gas
  • Time spend in the traffic jam

Infrastructure:

  • Safe bike lanes
  • connections

Data exploration

Loading datasets & summaries

Station data
In [2]:
# Read the `station_data` 
station_df = pd.read_csv("station_data.csv")
station_df.head(2)
Out[2]:
Id Name Lat Long Dock Count City
0 2 San Jose Diridon Caltrain Station 37.329732 -121.901782 27 San Jose
1 3 San Jose Civic Center 37.330698 -121.888979 15 San Jose
In [3]:
# Summary of station_data
station_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 6 columns):
Id            76 non-null int64
Name          76 non-null object
Lat           76 non-null float64
Long          76 non-null float64
Dock Count    76 non-null int64
City          76 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 3.6+ KB
Station locations on the map

To have an image of the stations let's see where the stations are on the map

Trip data
In [4]:
# Read the 'trip_data': trip_df
trip_df = pd.read_csv("trip_data.csv", 
                    parse_dates=['Start Date', 'End Date'], 
                    infer_datetime_format=True)

trip_df.head(2)
Out[4]:
Trip ID Start Date Start Station End Date End Station Subscriber Type
0 913460 2015-08-31 23:26:00 50 2015-08-31 23:39:00 70 Subscriber
1 913459 2015-08-31 23:11:00 31 2015-08-31 23:28:00 27 Subscriber
In [5]:
# Summary of trip_df
trip_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354152 entries, 0 to 354151
Data columns (total 6 columns):
Trip ID            354152 non-null int64
Start Date         354152 non-null datetime64[ns]
Start Station      354152 non-null int64
End Date           354152 non-null datetime64[ns]
End Station        354152 non-null int64
Subscriber Type    354152 non-null object
dtypes: datetime64[ns](2), int64(3), object(1)
memory usage: 16.2+ MB

There is an object type column: Subscriber Type. We need to encode this column's values to categories

Weather data
  • The given weather dataset provides weather measurements with daily precision however we will make our analysis with samples that are in an hour range.

  • Even though the given dataset provides zip code specific weather data, during a day-time period the measurement differences can be significant. So it would be better to have an hourly weather dataset.

  • Therefore instead of using the given dataset, we will use the weather data taken from Kaggle Datasets (Historical Hourly Weather Data 2012-2017)

  • Here, hourly weather measurements data of various weather attributes, such as temperature, humidity, air pressure, etc. are provided for many cities, including San Francisco,

  • Additionally, for each city we also have the country, latitude and longitude information in a separate file.

Kaggle hourly weather dataset
  • Since datasets are given by a common datetime column, we can read all the datasets and exctract the "San Francisco" columns in order to create a weather dataset related to our are of interest
In [6]:
# Pattern of weather attributes datasets: pattern
pattern = 'kaggle_data\*.csv'

# Save all matching files with glob function: weather_files
weather_files = glob.glob(pattern)

# Aggregate all the datasets in a list
# by subsetting the 'datetime' and 'San Francisco' column of each dataset
weather_df_lst = [pd.read_csv(file, usecols=["datetime", "San Francisco"]) for file in weather_files]

# Concat all the dataframes in the weather_df_lst
weather_df = pd.concat(weather_df_lst, axis=1)
print(weather_df.head(2))

# Set the first 'datetime' column as index and
# Drop the other 'datetime' columns
weather_df = weather_df.set_index(weather_df.iloc[:, 0]).drop("datetime", axis=1)

# Convert the index to datetime
weather_df.index=pd.to_datetime(weather_df.index)

# Set the column names 
column_names = ['Humidity', 'Pressure', 'Temperature', 'Weather Description', 'Wind Direction', 'Wind Speed']
weather_df.columns = column_names
weather_df

# Subset the date interval [2014-09-01: 2015-08-31] (interval of bike trip dates)
weather_df = weather_df["2014-09-01": "2015-08-31"]
weather_df.head(3)
              datetime  San Francisco             datetime  San Francisco  \
0  2012-10-01 12:00:00            NaN  2012-10-01 12:00:00            NaN   
1  2012-10-01 13:00:00           88.0  2012-10-01 13:00:00         1009.0   

              datetime  San Francisco             datetime  San Francisco  \
0  2012-10-01 12:00:00            NaN  2012-10-01 12:00:00            NaN   
1  2012-10-01 13:00:00         289.48  2012-10-01 13:00:00     light rain   

              datetime  San Francisco             datetime  San Francisco  
0  2012-10-01 12:00:00            NaN  2012-10-01 12:00:00            NaN  
1  2012-10-01 13:00:00          150.0  2012-10-01 13:00:00            2.0  
Out[6]:
Humidity Pressure Temperature Weather Description Wind Direction Wind Speed
datetime
2014-09-01 00:00:00 72.0 1024.0 293.995500 sky is clear 237.0 2.0
2014-09-01 01:00:00 69.0 1024.0 294.414333 sky is clear 241.0 2.0
2014-09-01 02:00:00 72.0 1024.0 293.579667 sky is clear 239.0 2.0
In [7]:
weather_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2014-09-01 00:00:00 to 2015-08-31 23:00:00
Data columns (total 6 columns):
Humidity               8760 non-null float64
Pressure               8760 non-null float64
Temperature            8760 non-null float64
Weather Description    8760 non-null object
Wind Direction         8759 non-null float64
Wind Speed             8760 non-null float64
dtypes: float64(5), object(1)
memory usage: 479.1+ KB
  • There is a missing value in Wind Direction column.
  • Weather Description column is categorical. We need to convert the categories into dummy variables
In [8]:
# Summary statistics of station_df
weather_df.describe().T
Out[8]:
count mean std min 25% 50% 75% max
Humidity 8760.0 86.355594 12.037149 26.000000 78.000 88.000000 97.000000 100.00
Pressure 8760.0 1026.396119 8.010935 985.000000 1023.000 1028.000000 1032.000000 1044.00
Temperature 8760.0 288.063666 4.538506 276.145333 284.987 287.466917 290.150625 309.63
Wind Direction 8759.0 208.150474 86.875425 0.000000 170.000 232.000000 270.000000 360.00
Wind Speed 8760.0 2.476370 1.904894 0.000000 1.000 2.000000 3.000000 13.00

Missing values

In [9]:
# Look at the missing values in 3 datasets
print(station_df.isna().values.any())
print(trip_df.isna().values.any())
print(weather_df.isna().values.any())
False
False
True
In [10]:
# Impute the single missing value in weather data
weather_df["Wind Direction"].fillna(method='ffill', inplace=True)

Now we don't have expected missing values (the ones pandas can detect)

Categorical features

  • Let's encode the categorical columns Weather Description and Subscriber Type in weather_df and trip_df respectively with dummy variables.

  • We will do the encoding for the categorical variables of the trip_df after doing some plotting

In [11]:
# Create dummy variables from 'Weather Description' column categories
dummies= pd.get_dummies(weather_df["Weather Description"], drop_first=True)
# Add the dummy variables to the weather_df and
# Drop the original features from the weather_df
weather_df= pd.concat([weather_df, dummies], axis=1).drop("Weather Description", axis=1)
weather_df.head(2)
Out[11]:
Humidity Pressure Temperature Wind Direction Wind Speed drizzle few clouds fog haze heavy intensity rain light intensity drizzle light intensity shower rain light rain mist moderate rain overcast clouds proximity shower rain proximity thunderstorm proximity thunderstorm with rain scattered clouds shower rain sky is clear smoke thunderstorm thunderstorm with heavy rain thunderstorm with light rain thunderstorm with rain very heavy rain
datetime
2014-09-01 00:00:00 72.0 1024.0 293.995500 237.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
2014-09-01 01:00:00 69.0 1024.0 294.414333 241.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

Stations departures count

Let's count the departures from each stations

In [12]:
fig, ax=plt.subplots(figsize=(16, 5))
sns.countplot(trip_df["Start Station"], ax=ax);