Gradient Boosting Machines

Gradient Boosting Machines as a family of methods have been the “talk of the town” in the Machine learning world for a while now, with the specific flavour of Gradient Boosting Trees has been regarded as “the best off-the-shelf” classifier in the world” (Breiman, 1986/1987, see Hastie et al. 2013) [1].  A wonderful review about GBMs was published by Natekin et al. (2013) [2].

Let’s have a look at the actual Gradient boosting algorithm as proposed by Friedman et al. (1999) [3]. The problems that the algorithm tries to solve is how to optimize a model using a wide variety of learners and loss functions, while at the same time avoiding hard optimization problems. The solution the GBM proposes is to use functional constraints, eg use only a specific type of learner h, and estimate the parameters of a single weak learner one at a time, which best reduces the loss (ie learner which is maximally correlated with the negative gradient). To avoid “over-shooting”, in addition step size parameter rho is fitted. Additionally, by using an iterative procedure for numeric optimization of these soft problems, non-convergence problems is avoided.

friedman_1999_gbmalgo

(from Natekin et al., 2013)

If you look at the algorithm, you see what input the algorithm requires. After an initialization, the algorithm will run over the given number of iterations, ie number of learners. In each estimation step, the negative gradient is calculated, which is used to fit the a new base learner. Then the incremental estimate for the learner is fine-tuned by the optimal step-size rho. Finally the overall function estimate from the ensemble of learners will be updated, including the new learner.

To use an analogy, imagine you want to find the peak of a large mountain massive, which is under heavy fog. You can either send a top mountaineer, like Reinhold Messner, on a mission and check whether he will reach the summit, or gets lost on his way or is eaten by a Yeti (non-convergence). The GBM way would be to ask 10,000 amateur hikers to find the summit, but to return when they have made some progress or are afraid to get lost. Additionally, they should report the last barrier they were facing. If one hiker returns, then you would choose the most suitable of your hikers to overcome the reported problem and tell him to follow the footsteps of his previous comrades.

So it seems, such a team effort strategy seems to be an overall very succesful strategy for challending problems.

[1] http://www-stat.stanford.edu/~tibs/ElemStatLearn/download.html

[2] Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in Neurorobotics, 7, 21. http://doi.org/10.3389/fnbot.2013.00021

[3] Friedman, J. H. (1999, updated 2000, 2001). Greedy function approximation: A gradient boosting machine. Retrieved 2016-09-27, from https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Other blogs

  • Brownlee, J. (2016-09-09), A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning, Retrieved 2016-09-27, from http://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
  • Rogozhnikov, A. (2016-06-24). Gradient Boosting explained [demonstration]. Retrieved 2016-09-27, from http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
  • Saitta, S. (2016-06-01), Gradient Boosting Machine – A Brief Introduction, Retrieved from http://www.dataminingblog.com/gradient-boosting-machine-a-brief-introduction/, 2016-09-23.