Thanks to Scikit-learn's instantiate-fit-tranform-predict template, we can run our formatted data through as many model as we like without having to continue to transform the data to fit each kind of model.

In Scikit-learn our main concern is optimazing the models.

For supervised problems,

we know the target so we can score the model,
we can easily throw the data at all different classifiers and
score them to find the optimal classifier.

We can automate these steps but the difficulties for automation are:

For a thightly constrained hyperparameter space optimization tools might work.
GridSearch can work if we have a known range of our hyperparameters.
As hyperparameter space get larger it turns into a blind search

So in order to enhance the

feature selection,
feature engineering,
model selection and
hyperparameter tuning we will try to utilize some tools for visualization of some machine learning steps.

Being capable of visualizing these steps helps us to get the advantage of using our intiution for ML problems.

In this notebook we will work on Bikes rental dataset to utilize the visualization tools for feature selection

Outline:¶

About Dataset
Data Summary
Feature Engineering
Outlier Analysis
Visualizing Distribution Of Data
Feature Selection
Correlation Analysis
Plot of Correlated Columns
Dealing with the Input Errors
Visualizing Riders Monthly, Daily, Hourly

About dataset¶

This dataset contains the hourly count of rental bikes between years 2011 and 2012 with the corresponding weather and seasonal information.
In bike sharing systems, like Vilo in Brussels, whole process from membership, rental and return back has become automatic.
Through these systems, user is able to easily rent a bike from a particular position and return back at another position.
The characteristics of data being generated by these systems make them attractive for the research.
The duration of travel, departure and arrival position is explicitly recorded in these systems.
This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city.
Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Kaggle terms for the dataset:¶

The training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month.
Predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

Here is the Kaggle link for the dataset

In [6]:

# Import the necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings 
warnings.filterwarnings("ignore")
%matplotlib inline

sns.set()

Data Summary¶

In [7]:

# Read the bikeshare data
bike_data=pd.read_csv("train.csv", parse_dates=["datetime"], index_col="datetime")
bike_data.head(3)

Out[7]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32

In [8]:

# Get the summary of the dataset
bike_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10886 entries, 2011-01-01 00:00:00 to 2012-12-19 23:00:00
Data columns (total 11 columns):
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8)
memory usage: 1020.6 KB

Descriptions of Some Data Fields¶

datetime - hourly date + timestamp
season - 1 = winter, 2 = spring, 3 = summer, 4 = automn
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather:
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

Feature Engineering¶

Since bike usage is very related with the hours of the day or months of the years it can be a good idea to extract these as features from the date.

Lets create new columns: "hour", "weekday" and "month"

In [9]:

bike_data["month"]=bike_data.index.month
bike_data["weekday"]=bike_data.index.dayofweek
bike_data["hour"]=bike_data.index.hour
bike_data.head(3)

Out[9]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count	month	weekday	hour
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16	1	5	0
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40	1	5	1
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32	1	5	2

Outliers Analysis¶

Let's take a look at the difference between the mean and median of some columns to detect the existence of outlier values.

In [10]:

bike_data[["temp", "atemp", "humidity", "windspeed", "count"]].describe()

Out[10]:

	temp	atemp	humidity	windspeed	count
count	10886.00000	10886.000000	10886.000000	10886.000000	10886.000000
mean	20.23086	23.655084	61.886460	12.799395	191.574132
std	7.79159	8.474601	19.245033	8.164537	181.144454
min	0.82000	0.760000	0.000000	0.000000	1.000000
25%	13.94000	16.665000	47.000000	7.001500	42.000000
50%	20.50000	24.240000	62.000000	12.998000	145.000000
75%	26.24000	31.060000	77.000000	16.997900	284.000000
max	41.00000	45.455000	100.000000	56.996900	977.000000

We notice that count column has a big mean- median difference. Now see the boxplot and distribution of count column.

Visualizing Distribution Of Data¶

In [11]:

# Create a figure and two axes for boxplot and distplot
fig, (ax_box, ax_dist) = plt.subplots(nrows=2, sharex=True, figsize=(10, 6), gridspec_kw={"height_ratios": (.15, .85)})
 
# Plot the boxplot on the ax_box axis
sns.boxplot(bike_data["count"], ax=ax_box)


# Plot the distplot on the ax_dist axis
sns.distplot(bike_data["count"], ax=ax_dist)
 
# Remove xlabel for the boxplot
ax_box.set(xlabel='');

In [12]:

# Plot the boxplot
ax=sns.boxplot(x="count", data=bike_data, width=0.2, orient="h")

# Add the violinplot on the same figure
sns.violinplot(x="count", data=bike_data, bw=.2, orient="h")

# Add the label and title
ax.set(xlabel='Riders', title="Violin & Box Plot of Riders");

Seaborn boxplot whisker by default extend to the points which are 1.5 IQR away from the upper and lower quartile. So we can see the outliers after the upper whisker. Let's move the outliers

In [13]:

# Remove the outliers
bike_data_inliers = bike_data.loc[bike_data["count"] <590, :]

# Reset the index of bike_data_inliers dataframe
bike_data_inliers.reset_index(inplace=True)

# Print the data size before and after the outliers
print ("Dataset size before outliers: ", len(bike_data))
print ("Dataset size after outliers: ", len(bike_data_inliers))

Dataset size before outliers:  10886
Dataset size after outliers:  10436

Feature selection¶

After data preperation machine learning process usually followed by the questions:

"Which features should i use?"
"Which features are predictive?"

The machine learning workflow is the art of creating model selection triples:

combination of features,
algorithm , and
hyperparameters that uniquely identifies a model fitted on a specific data set.

As part of our feature selection, we want to

identify features that have a linear relationship with each other, and
avoid using variables that have strong correlations with each other

Correlation Analysis¶

It is better to avoiding feature redundancy for a few reasons:

To keep the model simple and improve interpretability
Too many features can lead us to the risk of overfitting.
When our datasets are very large, using fewer features can speed up our computation time.

We will create the correlation matrix with pandas .corr() method

In [14]:

# The columns which will be taken into account for correlation analysis
bike_data_core= bike_data_inliers[["season", "month", "hour", "holiday", "weekday", "workingday","weather", "temp", "atemp", "humidity", "windspeed", "count"]]
# Create correlation matrix
corr_matx=bike_data_core.corr()
corr_matx

Out[14]:

	season	month	hour	holiday	weekday	workingday	weather	temp	atemp	humidity	windspeed	count
season	1.000000	0.972084	-0.010110	0.031797	-0.007486	-0.013088	0.008366	0.266754	0.273228	0.197508	-0.149698	0.167106
month	0.972084	1.000000	-0.010347	0.004129	0.000628	-0.007765	0.012485	0.264717	0.271631	0.211162	-0.153324	0.169954
hour	-0.010110	-0.010347	1.000000	0.001839	0.002452	-0.002295	-0.016116	0.128684	0.123972	-0.266379	0.143242	0.433456
holiday	0.031797	0.004129	0.001839	1.000000	-0.194707	-0.251981	-0.008518	0.003416	-0.002025	-0.001139	0.007723	0.012368
weekday	-0.007486	0.000628	0.002452	-0.194707	1.000000	-0.703308	-0.043503	-0.036063	-0.037180	-0.020524	-0.026347	0.030138
workingday	-0.013088	-0.007765	-0.002295	-0.251981	-0.703308	1.000000	0.032207	0.027674	0.022406	-0.018295	0.015851	-0.023943
weather	0.008366	0.012485	-0.016116	-0.008518	-0.043503	0.032207	1.000000	-0.047558	-0.048180	0.404363	0.008896	-0.123600
temp	0.266754	0.264717	0.128684	0.003416	-0.036063	0.027674	-0.047558	1.000000	0.985822	-0.044500	-0.024146	0.388262
atemp	0.273228	0.271631	0.123972	-0.002025	-0.037180	0.022406	-0.048180	0.985822	1.000000	-0.023709	-0.065486	0.384670
humidity	0.197508	0.211162	-0.266379	-0.001139	-0.020524	-0.018295	0.404363	-0.044500	-0.023709	1.000000	-0.319056	-0.317623
windspeed	-0.149698	-0.153324	0.143242	0.007723	-0.026347	0.015851	0.008896	-0.024146	-0.065486	-0.319056	1.000000	0.105448
count	0.167106	0.169954	0.433456	0.012368	0.030138	-0.023943	-0.123600	0.388262	0.384670	-0.317623	0.105448	1.000000

With this table, at a glance it is not very easy to detect the high correlations. Maybe it is easier to make sense of the table when we fit some style to it.

In [15]:

# Fit a style to the table
corr_matx.style.background_gradient()

Out[15]:

	season	month	hour	holiday	weekday	workingday	weather	temp	atemp	humidity	windspeed	count
season	1	0.972084	-0.0101103	0.0317974	-0.00748569	-0.0130877	0.00836561	0.266754	0.273228	0.197508	-0.149698	0.167106
month	0.972084	1	-0.0103473	0.00412879	0.000628289	-0.00776512	0.0124848	0.264717	0.271631	0.211162	-0.153324	0.169954
hour	-0.0101103	-0.0103473	1	0.00183947	0.00245159	-0.00229494	-0.0161162	0.128684	0.123972	-0.266379	0.143242	0.433456
holiday	0.0317974	0.00412879	0.00183947	1	-0.194707	-0.251981	-0.00851763	0.00341622	-0.00202474	-0.00113902	0.00772344	0.0123677
weekday	-0.00748569	0.000628289	0.00245159	-0.194707	1	-0.703308	-0.0435033	-0.0360634	-0.0371795	-0.0205236	-0.0263469	0.0301381
workingday	-0.0130877	-0.00776512	-0.00229494	-0.251981	-0.703308	1	0.0322065	0.0276741	0.0224056	-0.0182948	0.0158506	-0.023943
weather	0.00836561	0.0124848	-0.0161162	-0.00851763	-0.0435033	0.0322065	1	-0.0475582	-0.0481797	0.404363	0.00889583	-0.1236
temp	0.266754	0.264717	0.128684	0.00341622	-0.0360634	0.0276741	-0.0475582	1	0.985822	-0.0444996	-0.0241457	0.388262
atemp	0.273228	0.271631	0.123972	-0.00202474	-0.0371795	0.0224056	-0.0481797	0.985822	1	-0.0237089	-0.0654858	0.38467
humidity	0.197508	0.211162	-0.266379	-0.00113902	-0.0205236	-0.0182948	0.404363	-0.0444996	-0.0237089	1	-0.319056	-0.317623
windspeed	-0.149698	-0.153324	0.143242	0.00772344	-0.0263469	0.0158506	0.00889583	-0.0241457	-0.0654858	-0.319056	1	0.105448
count	0.167106	0.169954	0.433456	0.0123677	0.0301381	-0.023943	-0.1236	0.388262	0.38467	-0.317623	0.105448	1

It looks better but still not very descriptive
Especially with large number of features this table can be hard to use
Extra filtering can help us to select the relevant values.
We can filter out the correlation values which are greater than 0.75 and not equal to 1.
The values equal to 1 are the diagonal values.

In [16]:

# Create a boolean mask of the given conditions
# Use absolute value function to get the negative correlations also
corr_mask= corr_matx.apply(lambda col: (abs(col)> 0.75) & (abs(col)!=1))

# Count the True values in each column
corr_mask.sum()

Out[16]:

season        1
month         1
hour          0
holiday       0
weekday       0
workingday    0
weather       0
temp          1
atemp         1
humidity      0
windspeed     0
count         0
dtype: int64

By the count of True values we detected high correlations in "season, month, atemp, temp" columns.
We can identify the relations by checking the above table particularly for these columns
Filtering out the strong correlation by computing like this is always important especially for large dataset because visualization is not quantifiable.

Still having a en efficient visual tool would be great.

Here we will utilize the Rank2D visualizer of Yellowbrick library to compute Pearson correlations between all pairs of features.
Yellowbrick classes are compatible with the Sklearn import-instantiate-fit-transform-predict template
Yellowbrick visualizers are using Matplotlib under the hood so we can apply the axes and figure functions to these visualizers

In [17]:

# Import Rank2D class from yellowbrick
from yellowbrick.features import Rank2D

# Modify the figure size
fig, ax=plt.subplots(figsize=(14,7))

# Instantiate the Rank2D object with default arguments: visualizer
visualizer = Rank2D(algorithm="pearson")

# fit the visualizer
visualizer.fit_transform(bike_data_core)

# Plot the visualizer with .poof() method
visualizer.poof()

This figure shows us the Pearson correlation between pairs of features
Density of the colors displays the magnitude of the correlation.
A Pearson correlation of
- 1.0 means that there is a strong positive, linear relationship between the pairs of variables and
- -1.0 indicates a strong negative, linear relationship (a value of zero indicates no relationship).
Therefore we are looking for dark red and dark blue boxes to identify strong correlations.

In this chart, we see that the features

temp and atemp and
season and month columns have a strong correlation

This seems to make sense;

the apparent temperature we feel outside depends on the actual temperature and other airquality factors, and
the season of the year is described by the month!

In the above cells we found the same result by filtering out but having a visual table like this is awesome again especially for larger datasets

Plot of Correlated Columns¶

Now lets plot the temp and atemp columns to see closely

In [18]:

# Use seaborn jointplot with kind=reg
g = sns.jointplot(x="temp", y= "atemp", data=bike_data_core, kind="reg", height=8)

Scatter diagram of

the apparent temperature on the y axis and
the actual measured temperature on the x axis and
draws a line of best fit using a simple linear regression.

At-a-glance we we see

a very strong positive correlation of the features,
the range and distribution of each feature.
There appear to be some data errors in the dataset.
These instances may need to be manually removed in order to improve the quality of the final model

By just looking the above scatter diagram it is not possible which column caused the erreneous input.
Even though a horizontal line of data(constant y axis values), make us be suspicious about the input error in atemp column.
We will manually find it them below
We can ultimately confirm the selection of the columns to remove by training our model on either value, and scoring the results.

Errenous inputs in "temp" or "atemp" columns¶

If we observe the values of temp or atemp column without pairing to each other, it is not easy to detect data input errors becasue these values are not extreme values regarding the range of temp and atemp columns.

Lets search them in pairs.
First subset the bike_data with the condition :
- 12< atemp <13 or 25 <temp< 35 (because the outliers fall in that interval as seen in the scatter diagram)
and select the columns "hour, temp, and atemp"

In [19]:

# Subset the relevant part of the bike_data: df_err
df_err= bike_data_inliers[(bike_data_inliers["atemp"]>12) & (bike_data_inliers["atemp"] <13)]
df_err= df_err[["hour", "temp", "atemp"]]
df_err.head()

Out[19]:

	hour	temp	atemp
5	5	9.84	12.88
7	7	8.20	12.88
59	14	10.66	12.12
60	15	10.66	12.12
61	16	10.66	12.12

Lets plot and see this subset dataframe with error inputs: df_err

In [20]:

df_err.temp.plot(marker="o", linewidth=0, figsize=(10,4));

We can narrow down the index values of dataframe on the chart (x axis) by xlim() parameter of the plot method and find the index of outlier values

In [21]:

df_err.temp.plot(marker="o", linewidth=0, xlim=(8700, 9000), figsize=(10,2));

In [22]:

 df_err.temp.plot(marker="o", linewidth=0, xlim=(8710, 8740), figsize=(10,2));

Now we know where the erroneous inputs. Their index values are between 8710, 8740.

Lets extract that slice of the bike_data and
subset only the columns "hour", "temp","atemp" for the ease of observing and
fit a styling on the erroneous inputs

In [23]:

# Inspect the range 13700:13730 of bike_data_inliers df
erroneous_df= bike_data_inliers.loc[8710: 8740, ["datetime", "temp","atemp"]]

def highlight_cols(s):
    color = 'yellow'
    return 'background-color: %s' % color

erroneous_df.style.applymap(highlight_cols, subset=pd.IndexSlice[8714:8734, "atemp"])

Out[23]:

	datetime	temp	atemp
8710	2012-08-16 20:00:00	31.16	34.09
8711	2012-08-16 21:00:00	30.34	33.335
8712	2012-08-16 22:00:00	29.52	33.335
8713	2012-08-16 23:00:00	28.7	32.575
8714	2012-08-17 00:00:00	27.88	12.12
8715	2012-08-17 01:00:00	27.06	12.12
8716	2012-08-17 02:00:00	27.06	12.12
8717	2012-08-17 03:00:00	26.24	12.12
8718	2012-08-17 04:00:00	26.24	12.12
8719	2012-08-17 05:00:00	26.24	12.12
8720	2012-08-17 06:00:00	25.42	12.12
8721	2012-08-17 07:00:00	26.24	12.12
8722	2012-08-17 09:00:00	28.7	12.12
8723	2012-08-17 10:00:00	30.34	12.12
8724	2012-08-17 11:00:00	31.16	12.12
8725	2012-08-17 12:00:00	33.62	12.12
8726	2012-08-17 13:00:00	34.44	12.12
8727	2012-08-17 14:00:00	35.26	12.12
8728	2012-08-17 15:00:00	35.26	12.12
8729	2012-08-17 16:00:00	34.44	12.12
8730	2012-08-17 19:00:00	30.34	12.12
8731	2012-08-17 20:00:00	29.52	12.12
8732	2012-08-17 21:00:00	27.88	12.12
8733	2012-08-17 22:00:00	27.06	12.12
8734	2012-08-17 23:00:00	26.24	12.12
8735	2012-08-18 00:00:00	26.24	28.79
8736	2012-08-18 01:00:00	25.42	27.275
8737	2012-08-18 02:00:00	25.42	28.03
8738	2012-08-18 03:00:00	25.42	28.03
8739	2012-08-18 04:00:00	25.42	29.545
8740	2012-08-18 05:00:00	24.6	28.79

With this output we can see that erroneous inputs are in the atemp column. We can replace the erroneous input by interpolation.

In [24]:

# Extract the erroneous slice of atemp column
erroneous_slice=bike_data_inliers.loc[8712:8736, "atemp"]

# Replace the 0.2424 values with nan 
erroneous_slice.replace(12.12, np.nan, inplace=True)

erroneous_slice

Out[24]:

8712    33.335
8713    32.575
8714       NaN
8715       NaN
8716       NaN
8717       NaN
8718       NaN
8719       NaN
8720       NaN
8721       NaN
8722       NaN
8723       NaN
8724       NaN
8725       NaN
8726       NaN
8727       NaN
8728       NaN
8729       NaN
8730       NaN
8731       NaN
8732       NaN
8733       NaN
8734       NaN
8735    28.790
8736    27.275
Name: atemp, dtype: float64

In [25]:

# Interpolate the nan values
erroneous_slice.interpolate()

Out[25]:

8712    33.335000
8713    32.575000
8714    32.402955
8715    32.230909
8716    32.058864
8717    31.886818
8718    31.714773
8719    31.542727
8720    31.370682
8721    31.198636
8722    31.026591
8723    30.854545
8724    30.682500
8725    30.510455
8726    30.338409
8727    30.166364
8728    29.994318
8729    29.822273
8730    29.650227
8731    29.478182
8732    29.306136
8733    29.134091
8734    28.962045
8735    28.790000
8736    27.275000
Name: atemp, dtype: float64

In [26]:

# Replace the erroneous part of the data frame with the interpolated data
bike_data_inliers.loc[8712:8736, "atemp"]= erroneous_slice.interpolate()

In [27]:

# Plot the "temp" and "atemp" columns again
g = sns.jointplot(x="temp", y= "atemp", data=bike_data_inliers, kind="reg", height=6)

It looks better.

Visualizing Riders Monthly, Daily, Hourly¶

In the final section of this notebook we want to see the bike rental pantterns by the breakdown of time. Since it is intiutive that bike usage is related with weather and time we expect to see meaningful relations on the charts.

Here is our steps:

Since we are working with data created by humans with time it is a good practice to observe it by splitting into "weekdays" and "weekends".
If it was a dataset created by butterflies of course we would not have done it this way!

We will also start from the longer time frames and follow by the shorter time frames

We will do ploting repeatedly so it would be more efficient to define a plot function instead of writing the scripts again each time.

In [28]:

# Define a barplot function: bar_plot
def bar_plot(df, label_x, label_y, str_title):
    '''Takes a dataframe and three strings: label of x-axis, label of y-axis and title
        returns a bar plot'''
    plt.style.use('fivethirtyeight')
    fig, ax=plt.subplots()
    ax= df.plot(kind= "bar", 
                figsize=(10,5), 
                rot=70,
                fontsize=11)        
    ax.set_xlabel(label_x)
    ax.set_ylabel(label_y)
    ax.set_title(str_title)
    plt.show()

In [29]:

bike_data_inliers.dtypes

Out[29]:

datetime      datetime64[ns]
season                 int64
holiday                int64
workingday             int64
weather                int64
temp                 float64
atemp                float64
humidity               int64
windspeed            float64
casual                 int64
registered             int64
count                  int64
month                  int64
weekday                int64
hour                   int64
dtype: object

In [30]:

#  Convert the date column to the datetimeindex
bike_data_inliers.set_index("datetime", inplace=True)

# Create the weekdays and weekends dataframes
weekdays=bike_data_inliers[bike_data_inliers.index.dayofweek.isin([0,1,2,3,4])]
weekends=bike_data_inliers[bike_data_inliers.index.dayofweek.isin([5, 6])]

In [31]:

# Groupby montly average riders
monthly=bike_data_inliers.groupby(bike_data_inliers.index.month)["count"].mean()

# Plot the monthly bike rentals 
labelx= "Month"
labely="Average riders"
title= "Monthly Average Riders"
bar_plot(monthly, labelx, labely, title)

In [32]:

# Groupby the daily average bike rentals 
daily= bike_data_inliers.groupby(bike_data_inliers.index.dayofweek)["count"].mean()
daily.index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', "Sat", "Sun"]

# Plot the daily bike rentals 
labelx= "Day"
labely="Average riders"
title= "Daily Average Riders"
bar_plot(daily, labelx, labely, title)

In [33]:

# Groupby weekdays data by the hourly average bike rentals 
hourly_weekdays= weekdays.groupby("hour")["count"].mean()

# Plot the hourly weekdays bike rentals 
labelx= "Hour"
labely="Average riders"
title= "Hourly Average Riders in Weekdays"
bar_plot(hourly_weekdays, labelx, labely, title)

In [34]:

# Groupby weekdays data by the hourly average bike rentals 
hourly_weekends= weekends.groupby("hour")["count"].mean()


# Plot the hourly weekdays bike rentals 
labelx= "Hour"
labely="Average riders"
title= "Hourly Average Riders in Weekends"
bar_plot(hourly_weekends, labelx, labely, title)

We can also plot the sns.pointplot to see the hourly rentals across all the days seperately on one chart.

Here is a brief info about the sns.pointplot from the offical documentation:

A poinplot

shows point estimates and confidence intervals using scatter plot glyphs.

represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars. (in our chart we are not displaying the bars)

Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables.

They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable.

The lines that join each point from the same hue level allow interactions to be judged by differences in slope, which is easier for the eyes than comparing the heights of several groups of points or bars.

It is important to keep in mind that a pointplot shows only the mean (or other estimator) value, but in many cases it may be more informative to show the distribution of values at each level of the categorical variables.

In that case, other approaches such as a box or violin plot may be more appropriate.

In [35]:

# Plot the hourly rental means in the weekdays and weekends on poinplot
fig, ax=plt.subplots(figsize=(10, 6))

sns.pointplot(x="hour", 
              y="count", 
              hue="weekday", 
              data=bike_data_inliers, 
              errwidth=0,
              scale=.5, 
              ax=ax);

hue_labels = ["Sun", "Mon","Tue","Wed","Thu","Fri","Sat"]
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, hue_labels);

ax.set_title("Hourly Rental Mean Across the Days");

In [36]:

# Groupby by "weather" column and plot
weather_state=bike_data_inliers.groupby("weather")["count"].mean()


# Plot the average bike rentals across the weather state 
labelx= "Weather"
labely="Average riders"
title= "Average Riders Across Weather State"
bar_plot(weather_state, labelx, labely, title)

Pointplot of weather state¶

In [37]:

# Plot the hourly rentals in the weekdays and weekends on poinplot
fig, ax=plt.subplots(figsize=(10, 4))

sns.pointplot(x="weather", 
              y="count", 
              hue="weather", 
              data=bike_data_inliers, 
              scale=.5, 
              join=True,
              ax=ax);

hue_order = ["1", "2", "3", "4"]
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, hue_order);

Regarding the given description of weather column below,

weather:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

the chart above does not look like intiutive because the mean of rentals at state-4 are more than at state-3.

We should think about this and maybe question about the relaiblity of the "weather" column

Wrap Up¶

In this notebook we

explored the bike rental dataset
did outlier analysis and removed the outliers
made correlation analysis for future selection (in the next post)
utilized Yellowbrick's visiualization tool for correlation matrix
found out input errors by ploting the features and
replaced the erroroneus input by interpolatation
finally displayed the bike rentals in different time frames

During all steps we tried to get the advantage of visual tools in order to do out analysis more efficiently.

In the next post we will continue with the predictive analysis of the bike rentals and work on machine learning steps.

Sources:
https://www.youtube.com/watch?v=2ZKng7pCB5k

Feature Selection by Visualization

Tags:

Outline:¶

About dataset¶

Kaggle terms for the dataset:¶

Data Summary¶

Descriptions of Some Data Fields¶

Feature Engineering¶

Outliers Analysis¶

Visualizing Distribution Of Data¶

Feature selection¶

Correlation Analysis¶

Plot of Correlated Columns¶

Errenous inputs in "temp" or "atemp" columns¶

Visualizing Riders Monthly, Daily, Hourly¶

Pointplot of weather state¶

Wrap Up¶

Outline:¶

About dataset¶

Kaggle terms for the dataset:¶

Data Summary¶

Descriptions of Some Data Fields¶

Feature Engineering¶

Outliers Analysis¶

Visualizing Distribution Of Data¶

Feature selection¶

Correlation Analysis¶

Plot of Correlated Columns¶

Errenous inputs in "temp" or "atemp" columns¶

Visualizing Riders Monthly, Daily, Hourly¶

Pointplot of weather state¶

Wrap Up¶

Comments