Block-distributed Gradient Boosted Trees

  title={Block-distributed Gradient Boosted Trees},
  author={Theodore Vasiloudis and Hyunsu Cho and Henrik Bostr{\"o}m},
  journal={Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high… 
2 Citations

Figures from this paper

Challenges and Opportunities of Building Fast GBDT Systems

This survey paper reviews the recent GBDT systems with respect to accelerations with emerging hardware as well as cluster computing, and compares the advantages and disadvantages of the existing implementations.

Net-DNF: Effective Deep Modeling of Tabular Data

This work presents Net-DNF, a novel generic architecture whose inductive bias elicits models whose structure corresponds to logical Boolean formulas in disjunctive normal form (DNF) over affine soft-threshold decision terms, which opens the door to practical end-to-end handling of tabular data using neural networks.



LightGBM: A Highly Efficient Gradient Boosting Decision Tree

It is proved that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size, and is called LightGBM.

Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees

QuickScorer is presented, a new algorithm that adopts a novel cache-efficient representation of a given tree ensemble, performs an interleaved traversal by means of fast bitwise operations, and supports ensembles of oblivious trees.

QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees

This paper presents QuickScorer, a new algorithm that adopts a novel bitvector representation of the tree-based ranking model, and performs an interleaved traversal of the ensemble by means of simple logical bitwise operations.

DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions

This paper conducts a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data and implements a series of optimizations to further optimize the performance of collective communications.

Communication Efficient Distributed Machine Learning with the Parameter Server

An in-depth analysis of two large scale machine learning problems ranging from l1 -regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions is presented.

Scaling Distributed Machine Learning with the Parameter Server

View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.

A Communication-Efficient Parallel Algorithm for Decision Tree

Experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the tradeoff between accuracy and efficiency.

McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

This work considers the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval, and proposes using the Expected Relevance to convert class probabilities into ranking scores.

CatBoost: unbiased boosting with categorical features

This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit and provides a detailed analysis of this problem and demonstrates that proposed algorithms solve it effectively, leading to excellent empirical results.

Practical Lessons from Predicting Clicks on Ads at Facebook

This paper introduces a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance.