Terence Parr and Jeremy Howard
(We teach in University of San Francisco's MS in Data Science program and have other nefarious projects underway. You might know Terence as the creator of the ANTLR parser generator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible.)
Please send comments, suggestions, or fixes to Terence.
Gradient boosting machines (GBMs) are currently very popular and so it's a good idea for machine learning practitioners to understand how GBMs work. The problem is that understanding all of the mathematical machinery is tricky and, unfortunately, these details are needed to tune the hyper-parameters. (Tuning the hyper-parameters is required to get a decent GBM model unlike, say, Random Forests.) Our goal in this article is to explain the intuition behind gradient boosting, provide visualizations for model construction, explain the mathematics as simply as possible, and answer thorny questions such as why GBM is performing “gradient descent in function space.” We've split the discussion into three morsels and a FAQ for easier digestion.
As Ben Gorman points out in A Kaggle Master Explains Gradient Boosting, “This is the part that gets butchered by a lot of gradient boosting explanations.” His blog post does a good job of explaining it, but we give our own perspective here.
To get started with GBM, those with very strong mathematical backgrounds can go directly to the super-tasty 1999 paper by Jerome Friedman called Greedy Function Approximation: A Gradient Boosting Machine. To get the practical and implementation perspective, though, we recommend Ben Gorman's excellent blog A Kaggle Master Explains Gradient Boosting and Prince Grover's Gradient Boosting from scratch (written by a data science graduate student at the University of San Francisco; we also thank Prince for reviewing this article.) To go beyond basic GBM, see XGBoost: A Scalable Tree Boosting System by Chen and Guestrin. To get really fancy, you can even add momentum to the gradient descent performed by boosting machines, as shown in the recent article: Accelerated Gradient Boosting.
All of the code used to generate the graphs and data in these articles is available in the Notebooks directory at github. Warning: the code is a just good enough to generate the graphs in these articles; the code is provided purely for transparency and completeness (i.e., we ain't proud of it).