Feature Selection by Visualization
Thanks to Scikit-learn's instantiate-fit-tranform-predict
template, we can run our formatted data through as many model as we like without having to continue to transform the data to fit each kind of model.
In Scikit-learn our main concern is optimazing the models.
For supervised problems,
- we know the target so we can score the model,
- we can easily throw the data at all different classifiers and
- score them to find the optimal classifier.
We can automate these steps but the difficulties for automation are:
- For a thightly constrained hyperparameter space optimization tools might work.
-
GridSearch
can work if we have a known range of our hyperparameters. - As hyperparameter space get larger it turns into a blind search
So in order to enhance the
- feature selection,
- feature engineering,
- model selection and
- hyperparameter tuning we will try to utilize some tools for visualization of some machine learning steps.
Being capable of visualizing these steps helps us to get the advantage of using our intiution for ML problems.
In this notebook we will work on Bikes rental dataset
to utilize the visualization tools for feature selection
Outline:¶
- About Dataset
- Data Summary
- Feature Engineering
- Outlier Analysis
- Visualizing Distribution Of Data
- Feature Selection
- Correlation Analysis
- Plot of Correlated Columns
- Dealing with the Input Errors
- Visualizing Riders Monthly, Daily, Hourly
About dataset¶
- This dataset contains the hourly count of rental bikes between years 2011 and 2012 with the corresponding
weather
andseasonal
information. - In bike sharing systems, like Vilo in Brussels, whole process from membership, rental and return back has become automatic.
- Through these systems, user is able to easily rent a bike from a particular position and return back at another position.
-
The characteristics of data being generated by these systems make them attractive for the research.
-
The duration of travel, departure and arrival position is explicitly recorded in these systems.
- This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city.
- Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Kaggle terms for the dataset:¶
-
The training set is comprised of the first
19 days
of each month, while the test set is the20th
to the end of the month. -
Predict the
total count of bikes rented during each hour
covered by the test set, using onlyinformation available prior to the rental period
.
Here is the Kaggle link for the dataset
# Import the necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set()
Data Summary¶
# Read the bikeshare data
bike_data=pd.read_csv("train.csv", parse_dates=["datetime"], index_col="datetime")
bike_data.head(3)
# Get the summary of the dataset
bike_data.info()
Descriptions of Some Data Fields¶
-
datetime
- hourly date + timestamp -
season
- 1 = winter, 2 = spring, 3 = summer, 4 = automn -
holiday
- whether the day is considered a holiday -
workingday
- whether the day is neither a weekend nor holiday -
weather:
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
-
temp
- temperature in Celsius -
atemp
- "feels like" temperature in Celsius -
humidity
- relative humidity -
windspeed
- wind speed -
casual
- number of non-registered user rentals initiated -
registered
- number of registered user rentals initiated -
count
- number of total rentals
Feature Engineering¶
Since bike usage is very related with the hours of the day or months of the years it can be a good idea to extract these as features from the date.
Lets create new columns: "hour", "weekday"
and "month"
bike_data["month"]=bike_data.index.month
bike_data["weekday"]=bike_data.index.dayofweek
bike_data["hour"]=bike_data.index.hour
bike_data.head(3)
Outliers Analysis¶
Let's take a look at the difference between the mean
and median
of some columns to detect the existence of outlier values.
bike_data[["temp", "atemp", "humidity", "windspeed", "count"]].describe()
We notice that count
column has a big mean- median difference. Now see the boxplot and distribution of count
column.
Visualizing Distribution Of Data¶
# Create a figure and two axes for boxplot and distplot
fig, (ax_box, ax_dist) = plt.subplots(nrows=2, sharex=True, figsize=(10, 6), gridspec_kw={"height_ratios": (.15, .85)})
# Plot the boxplot on the ax_box axis
sns.boxplot(bike_data["count"], ax=ax_box)
# Plot the distplot on the ax_dist axis
sns.distplot(bike_data["count"], ax=ax_dist)
# Remove xlabel for the boxplot
ax_box.set(xlabel='');
# Plot the boxplot
ax=sns.boxplot(x="count", data=bike_data, width=0.2, orient="h")
# Add the violinplot on the same figure
sns.violinplot(x="count", data=bike_data, bw=.2, orient="h")
# Add the label and title
ax.set(xlabel='Riders', title="Violin & Box Plot of Riders");
Seaborn boxplot whisker by default extend to the points which are 1.5 IQR
away from the upper and lower quartile. So we can see the outliers after the upper whisker. Let's move the outliers
# Remove the outliers
bike_data_inliers = bike_data.loc[bike_data["count"] <590, :]
# Reset the index of bike_data_inliers dataframe
bike_data_inliers.reset_index(inplace=True)
# Print the data size before and after the outliers
print ("Dataset size before outliers: ", len(bike_data))
print ("Dataset size after outliers: ", len(bike_data_inliers))
Feature selection¶
After data preperation machine learning process usually followed by the questions:
- "Which features should i use?"
- "Which features are predictive?"
The machine learning workflow is the art of creating model selection triples:
- combination of features,
- algorithm , and
- hyperparameters that uniquely identifies a model fitted on a specific data set.
As part of our feature selection
, we want to
- identify features that have a
linear relationship
with each other, and - avoid using variables that have strong correlations with each other
Correlation Analysis¶
It is better to avoiding feature redundancy for a few reasons:
- To keep the model simple and improve interpretability
- Too many features can lead us to the risk of overfitting.
- When our datasets are very large, using fewer features can speed up our computation time.
We will create the correlation matrix with pandas .corr()
method
# The columns which will be taken into account for correlation analysis
bike_data_core= bike_data_inliers[["season", "month", "hour", "holiday", "weekday", "workingday","weather", "temp", "atemp", "humidity", "windspeed", "count"]]
# Create correlation matrix
corr_matx=bike_data_core.corr()
corr_matx
With this table, at a glance it is not very easy to detect the high correlations. Maybe it is easier to make sense of the table when we fit some style to it.
# Fit a style to the table
corr_matx.style.background_gradient()
- It looks better but still not very descriptive
- Especially with large number of features this table can be hard to use
- Extra filtering can help us to select the relevant values.
- We can filter out the correlation values which are greater than
0.75
and not equal to1
. - The values equal to
1
are the diagonal values.
# Create a boolean mask of the given conditions
# Use absolute value function to get the negative correlations also
corr_mask= corr_matx.apply(lambda col: (abs(col)> 0.75) & (abs(col)!=1))
# Count the True values in each column
corr_mask.sum()
By the count of True
values we detected high correlations in "season, month, atemp, temp"
columns.
We can identify the relations by checking the above table particularly for these columns
Filtering out the strong correlation by computing like this is always important especially for large dataset because visualization is not quantifiable.
Still having a en efficient visual tool would be great.
-
Here we will utilize the
Rank2D
visualizer of Yellowbrick library to computePearson
correlations between all pairs of features. -
Yellowbrick classes are compatible with the Sklearn
import-instantiate-fit-transform-predict
template - Yellowbrick visualizers are using Matplotlib under the hood so we can apply the axes and figure functions to these visualizers
# Import Rank2D class from yellowbrick
from yellowbrick.features import Rank2D
# Modify the figure size
fig, ax=plt.subplots(figsize=(14,7))
# Instantiate the Rank2D object with default arguments: visualizer
visualizer = Rank2D(algorithm="pearson")
# fit the visualizer
visualizer.fit_transform(bike_data_core)
# Plot the visualizer with .poof() method
visualizer.poof()
- This figure shows us the Pearson correlation between pairs of features
- Density of the colors displays the magnitude of the correlation.
- A Pearson correlation of
-
1.0
means that there is a strong positive, linear relationship between the pairs of variables and -
-1.0
indicates a strong negative, linear relationship (a value of zero indicates no relationship).
-
- Therefore we are looking for
dark red
anddark blue
boxes to identify strong correlations.
In this chart, we see that the features
-
temp
andatemp
and -
season
andmonth
columns have a strong correlation
This seems to make sense;
- the
apparent temperature
we feel outside depends on theactual temperature
and other airquality factors, and - the
season
of the year is described by themonth
!
In the above cells we found the same result by filtering out but having a visual table like this is awesome again especially for larger datasets
Plot of Correlated Columns¶
Now lets plot the temp
and atemp
columns to see closely
# Use seaborn jointplot with kind=reg
g = sns.jointplot(x="temp", y= "atemp", data=bike_data_core, kind="reg", height=8)
Scatter diagram of
- the
apparent temperature
on the y axis and - the
actual measured temperature
on the x axis and - draws a line of best fit using a simple linear regression.
At-a-glance we we see
- a very strong positive correlation of the features,
-
the range and distribution of each feature.
-
There appear to be some data errors in the dataset.
- These instances may need to be
manually removed
in order to improve the quality of the final model
- By just looking the above scatter diagram it is not possible which column caused the erreneous input.
- Even though a horizontal line of data(constant y axis values), make us be suspicious about the input error in
atemp
column. -
We will manually find it them below
-
We can ultimately confirm the selection of the columns to remove by training our model on either value, and scoring the results.
Errenous inputs in "temp" or "atemp" columns¶
If we observe the values of temp
or atemp
column without pairing to each other, it is not easy to detect data input errors becasue these values are not extreme values regarding the range of temp
and atemp
columns.
- Lets search them in pairs.
-
First subset the
bike_data
with the condition :-
12< atemp <13
or25 <temp< 35
(because the outliers fall in that interval as seen in the scatter diagram)
-
-
and select the columns
"hour, temp
, andatemp"
# Subset the relevant part of the bike_data: df_err
df_err= bike_data_inliers[(bike_data_inliers["atemp"]>12) & (bike_data_inliers["atemp"] <13)]
df_err= df_err[["hour", "temp", "atemp"]]
df_err.head()
Lets plot and see this subset dataframe with error inputs: df_err
df_err.temp.plot(marker="o", linewidth=0, figsize=(10,4));
We can narrow down the index values of dataframe on the chart (x axis) by xlim()
parameter of the plot method and find the index of outlier values
df_err.temp.plot(marker="o", linewidth=0, xlim=(8700, 9000), figsize=(10,2));
df_err.temp.plot(marker="o", linewidth=0, xlim=(8710, 8740), figsize=(10,2));
Now we know where the erroneous inputs. Their index values are between 8710, 8740
.
- Lets extract that slice of the
bike_data
and - subset only the columns
"hour", "temp","atemp"
for the ease of observing and - fit a styling on the erroneous inputs
# Inspect the range 13700:13730 of bike_data_inliers df
erroneous_df= bike_data_inliers.loc[8710: 8740, ["datetime", "temp","atemp"]]
def highlight_cols(s):
color = 'yellow'
return 'background-color: %s' % color
erroneous_df.style.applymap(highlight_cols, subset=pd.IndexSlice[8714:8734, "atemp"])
With this output we can see that erroneous inputs are in the atemp
column. We can replace the erroneous input by interpolation.
# Extract the erroneous slice of atemp column
erroneous_slice=bike_data_inliers.loc[8712:8736, "atemp"]
# Replace the 0.2424 values with nan
erroneous_slice.replace(12.12, np.nan, inplace=True)
erroneous_slice
# Interpolate the nan values
erroneous_slice.interpolate()
# Replace the erroneous part of the data frame with the interpolated data
bike_data_inliers.loc[8712:8736, "atemp"]= erroneous_slice.interpolate()
# Plot the "temp" and "atemp" columns again
g = sns.jointplot(x="temp", y= "atemp", data=bike_data_inliers, kind="reg", height=6)
It looks better.
Visualizing Riders Monthly, Daily, Hourly¶
In the final section of this notebook we want to see the bike rental pantterns by the breakdown of time. Since it is intiutive that bike usage is related with weather and time we expect to see meaningful relations on the charts.
Here is our steps:
- Since we are working with data created by humans with time it is a good practice to observe it by splitting into
"weekdays"
and"weekends"
.
If it was a dataset created by butterflies of course we would not have done it this way!
- We will also start from the longer time frames and follow by the shorter time frames
We will do ploting repeatedly so it would be more efficient to define a plot function instead of writing the scripts again each time.
# Define a barplot function: bar_plot
def bar_plot(df, label_x, label_y, str_title):
'''Takes a dataframe and three strings: label of x-axis, label of y-axis and title
returns a bar plot'''
plt.style.use('fivethirtyeight')
fig, ax=plt.subplots()
ax= df.plot(kind= "bar",
figsize=(10,5),
rot=70,
fontsize=11)
ax.set_xlabel(label_x)
ax.set_ylabel(label_y)
ax.set_title(str_title)
plt.show()
bike_data_inliers.dtypes
# Convert the date column to the datetimeindex
bike_data_inliers.set_index("datetime", inplace=True)
# Create the weekdays and weekends dataframes
weekdays=bike_data_inliers[bike_data_inliers.index.dayofweek.isin([0,1,2,3,4])]
weekends=bike_data_inliers[bike_data_inliers.index.dayofweek.isin([5, 6])]
# Groupby montly average riders
monthly=bike_data_inliers.groupby(bike_data_inliers.index.month)["count"].mean()
# Plot the monthly bike rentals
labelx= "Month"
labely="Average riders"
title= "Monthly Average Riders"
bar_plot(monthly, labelx, labely, title)
# Groupby the daily average bike rentals
daily= bike_data_inliers.groupby(bike_data_inliers.index.dayofweek)["count"].mean()
daily.index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', "Sat", "Sun"]
# Plot the daily bike rentals
labelx= "Day"
labely="Average riders"
title= "Daily Average Riders"
bar_plot(daily, labelx, labely, title)
# Groupby weekdays data by the hourly average bike rentals
hourly_weekdays= weekdays.groupby("hour")["count"].mean()
# Plot the hourly weekdays bike rentals
labelx= "Hour"
labely="Average riders"
title= "Hourly Average Riders in Weekdays"
bar_plot(hourly_weekdays, labelx, labely, title)
# Groupby weekdays data by the hourly average bike rentals
hourly_weekends= weekends.groupby("hour")["count"].mean()
# Plot the hourly weekdays bike rentals
labelx= "Hour"
labely="Average riders"
title= "Hourly Average Riders in Weekends"
bar_plot(hourly_weekends, labelx, labely, title)
We can also plot the sns.pointplot
to see the hourly rentals across all the days seperately on one chart.
Here is a brief info about the sns.pointplot
from the offical documentation:
A poinplot
- shows point estimates and confidence intervals using scatter plot glyphs.
- represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars. (in our chart we are not displaying the bars)
- Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables.
- They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable.
- The lines that join each point from the same hue level allow interactions to be judged by differences in slope, which is easier for the eyes than comparing the heights of several groups of points or bars.
- It is important to keep in mind that a pointplot shows only the mean (or other estimator) value, but in many cases it may be more informative to show the distribution of values at each level of the categorical variables.
- In that case, other approaches such as a box or violin plot may be more appropriate.
# Plot the hourly rental means in the weekdays and weekends on poinplot
fig, ax=plt.subplots(figsize=(10, 6))
sns.pointplot(x="hour",
y="count",
hue="weekday",
data=bike_data_inliers,
errwidth=0,
scale=.5,
ax=ax);
hue_labels = ["Sun", "Mon","Tue","Wed","Thu","Fri","Sat"]
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, hue_labels);
ax.set_title("Hourly Rental Mean Across the Days");
# Groupby by "weather" column and plot
weather_state=bike_data_inliers.groupby("weather")["count"].mean()
# Plot the average bike rentals across the weather state
labelx= "Weather"
labely="Average riders"
title= "Average Riders Across Weather State"
bar_plot(weather_state, labelx, labely, title)
Pointplot of weather state¶
# Plot the hourly rentals in the weekdays and weekends on poinplot
fig, ax=plt.subplots(figsize=(10, 4))
sns.pointplot(x="weather",
y="count",
hue="weather",
data=bike_data_inliers,
scale=.5,
join=True,
ax=ax);
hue_order = ["1", "2", "3", "4"]
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, hue_order);
Regarding the given description of weather
column below,
weather:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
the chart above does not look like intiutive because the mean of rentals at state-4
are more than at state-3
.
We should think about this and maybe question about the relaiblity of the "weather"
column
Wrap Up¶
In this notebook we
- explored the
bike rental dataset
- did outlier analysis and removed the outliers
- made correlation analysis for future selection (in the next post)
- utilized Yellowbrick's visiualization tool for correlation matrix
- found out input errors by ploting the features and
- replaced the erroroneus input by interpolatation
- finally displayed the bike rentals in different time frames
During all steps we tried to get the advantage of visual tools in order to do out analysis more efficiently.
In the next post we will continue with the predictive analysis of the bike rentals and work on machine learning steps.
Comments