Project In a nutshell¶
We are trying to predict the net change in the bike stock (bikes returned - bikes taken) at a specific station at a specific hour.
We have 3 datasets:
- In this notebook we will continue our analysis with bike data by applying tree based models
- We will implement Random Forest, Adaboost, Gradient Boosting and Stochastic Gradient Boosting algorithms
- We will submit our predictions to Kaggle at the end.
- Since trees and ensembles have many hyperparameters, in this notebook we try to explain some good practices regarding the usage of these hyperparameters
- Also we will implement Grid Search in order to tune these hyperparameters
- In part1 and part2 of our analyse on
bikesharedataset, we did explanatory analysis (EDA) and used Linear Regression for our prediction and Kaggle submission
- We will try to improve our prediction score on the same dataset by more complex tree-based models
- Before diving directly into the project maybe it would be better to remind the tree based models
- So in this post we will review the trees and we will apply these models to our dataset in the next post
- With this notebook we continue our analysis of
- In this part we will do machine learning steps in order to predict the bike rentals for the given Kaggle test set and submit it to the Kaggle
Thanks to Scikit-learn's
instantiate-fit-tranform-predict template, we can run our formatted data through as many model as we like without having to continue to transform the data to fit each kind of model.
In Scikit-learn our main concern is optimazing the models.
For supervised problems,
- we know the target so we can score the model,
- we can easily throw the data at all different classifiers and
- score them to find the optimal classifier.
We can automate these steps but the difficulties for automation are: