SF Bay Area Bike Data

2019-05-16 22:08

Harun

In this notebook we will be working on the Kaggle project: SF Bay Area Bike Share

Project In a nutshell¶

We are trying to predict the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour.
We have 3 datasets: station data, trip data, weather data

Ensemble Models - Kaggle Submission

2018-12-08 22:08

Harun

In this notebook we will continue our analysis with bike data by applying tree based models
We will implement Random Forest, Adaboost, Gradient Boosting and Stochastic Gradient Boosting algorithms
We will submit our predictions to Kaggle at the end.
Since trees and ensembles have many hyperparameters, in this notebook we try to explain some good practices regarding the usage of these hyperparameters
Also we will implement Grid Search in order to tune these hyperparameters

villo

Tree-based Models

2018-11-28 22:08

Harun

In part1 and part2 of our analyse on bikeshare dataset, we did explanatory analysis (EDA) and used Linear Regression for our prediction and Kaggle submission
We will try to improve our prediction score on the same dataset by more complex tree-based models
Before diving directly into the project maybe it would be better to remind the tree based models
So in this post we will review the trees and we will apply these models to our dataset in the next post

mona

Cyclic Nature of Time- Kaggle Submission

2018-11-16 22:08

Harun

With this notebook we continue our analysis of bikeshare dataset.
In this part we will do machine learning steps in order to predict the bike rentals for the given Kaggle test set and submit it to the Kaggle

time

Feature Selection by Visualization

2018-10-28 22:08

Harun

Thanks to Scikit-learn's instantiate-fit-tranform-predict template, we can run our formatted data through as many model as we like without having to continue to transform the data to fit each kind of model.

In Scikit-learn our main concern is optimazing the models.

For supervised problems,

we know the target so we can score the model,
we can easily throw the data at all different classifiers and
score them to find the optimal classifier.

We can automate these steps but the difficulties for automation are:

Handling Missing Values

2018-10-18 22:08

Harun

In this notebook we summarize the methods to deal with missing values
Missing values can happen because of human error, machine sensor malfunction, database failures, etc
Missing values can reduce the predictive power and lead to biased results if not handled properly.
There are different solutions for data imputation depending on the kind of the problem like time series analysis, regression etc.

tooth

ROC Curve & AUC

2018-10-16 22:08

Harun

We will talk about the classification models performance evaluation methods:
- Receiver Operating Characteristic(ROC) Curve and
- Area Under the ROC Curve (AUC)

Classifiers generally don't return a simple “Yes-or-No” answer.
Mostly, a classification procedure will return a score along a range
Ideally, the positive instances score towards one end of the scale, and the negative examples towards the other end.
It is up to the analyst to set a threshold on that score that separates what is considered a POSITIVE class or a NEGATIVE class.
The ROC Curve is a tool that helps to set the best threshold.

roc1

Metrics - Precision, Recall, F1 Score

2018-10-14 22:08

Harun

We continue with the methods of measurement of classification models' performance
The confusion matrix is very helpful to understand the details of our classifiers performance however we still need some metrics to interpret more directly to choose the best model
So in this notebook we will talk about the metrics like sensitivity (recall), specifity, precision and F1 score
These metrics can be computed from a confusion matrix and are more efficient to choose the best model.

In the first part we remind the concepts and in the second part we will apply these concepts for evaluating some classifiers' performance
Skitlearn provides all these scores by ClassificationReport

GPS Data Analysis

2018-10-12 22:08

Harun

In this project we will

analyze a gps dataset
try to infer some ideas mainly from the coordinates (latitude, longitude) of the data.
do some explanatory data analysis by filtering and grouping the data
look for some patterns in different time frames.
start analizing subject of interest from the longer time frames and follow by the shorter time frames

Metrics - Confusion Matrix

2018-10-10 22:08

Harun

In a typical data science project we try several models like (logistic regression, SVM, tree-classifiers etc) on our data.
Then we measure the predicting performance of these models to find the best performing one.
Finally we decide to implement the best performing model.
In this notebook we talk about one of the classification model evaluation tools: Confusion matrix.
They can help us to see deeper how much reliable our models are.
We are going to look at the confusion matrices of a variety of Scikit-Learn models and compare them using visual diagnostic tools from Yellowbrick in order to select the best model for our data.

thisway