Skip to main content

Ensemble Models - Kaggle Submission

  • In this notebook we will continue our analysis with bike data by applying tree based models
  • We will implement Random Forest, Adaboost, Gradient Boosting and Stochastic Gradient Boosting algorithms
  • We will submit our predictions to Kaggle at the end.
  • Since trees and ensembles have many hyperparameters, in this notebook we try to explain some good practices regarding the usage of these hyperparameters
  • Also we will implement Grid Search in order to tune these hyperparameters


Tree-based Models

  • In part1 and part2 of our analyse on bikeshare dataset, we did explanatory analysis (EDA) and used Linear Regression for our prediction and Kaggle submission
  • We will try to improve our prediction score on the same dataset by more complex tree-based models
  • Before diving directly into the project maybe it would be better to remind the tree based models
  • So in this post we will review the trees and we will apply these models to our dataset in the next post


Feature Selection by Visualization

Thanks to Scikit-learn's instantiate-fit-tranform-predict template, we can run our formatted data through as many model as we like without having to continue to transform the data to fit each kind of model.

In Scikit-learn our main concern is optimazing the models.

For supervised problems,

  • we know the target so we can score the model,
  • we can easily throw the data at all different classifiers and
  • score them to find the optimal classifier.

We can automate these steps but the difficulties for automation are: