
XGBoost 2.0 | XGBoosting
Key Points
- 1XGBoost 2.0 introduces significant performance enhancements by making `hist` the default tree method, improving external memory support with memory mapping, and simplifying device selection via a new `device` parameter.
- 2Key new functionalities include initial support for multi-target trees with vector-leaf outputs, a revamped learning-to-rank module with NDCG as the default objective, and the addition of quantile regression.
- 3The release also brings optimizations like CPU histogram size control, improved input handling, enhanced PySpark integration, and introduces some breaking changes requiring text input format specification.
XGBoost 2.0 represents a significant update to the gradient boosting library, introducing a wide array of new features, optimizations, and improvements aimed at enhancing performance, efficiency, and user experience.
A cornerstone development is the initial work on multi-target trees with vector-leaf outputs. Traditionally, XGBoost required building a separate model for each target in multi-target tasks such as multi-target regression, multi-label classification, and multi-class classification. XGBoost 2.0 begins to enable the construction of a single tree that simultaneously predicts multiple targets by having leaves output vectors instead of scalar values. This methodological shift offers several advantages, including the potential to prevent overfitting, produce more compact models, and inherently account for the correlations between different targets within a unified tree structure.
The user experience for device specification has been streamlined with the introduction of a new, unified device parameter. This replaces and consolidates previous, more fragmented parameters like gpu_id, gpu_hist, gpu_predictor, cpu_predictor, and gpu_coord_descent, simplifying the selection of computation devices and their ordinals.
From version 2.0, the hist tree method becomes the default choice for tree construction. In prior versions, the library dynamically selected between approx or exact methods based on input data characteristics and the training environment. By defaulting to hist, XGBoost aims to provide more efficient and consistent training performance across diverse scenarios, leveraging its histogram-based approach for faster splits.
To offer finer control over memory footprint, particularly on the CPU, a new parameter, max_cached_hist_node, has been introduced. This parameter allows users to limit the CPU cache size specifically for histograms, preventing overly aggressive caching which can be crucial when training very deep trees. Additionally, memory usage for both hist and approx tree methods in distributed systems has been reduced by halving the cache size.
Significant improvements to external memory support have been implemented, particularly for the hist tree method. This feature, while still experimental, has seen substantial performance gains due to the replacement of the older file I/O logic with memory mapping techniques. This change not only boosts training speed when data exceeds RAM but also reduces CPU memory consumption, serving as an alternative memory-saving strategy when QuantileDMatrix is insufficient.
The learning-to-rank task has received a brand-new, advanced implementation. This includes new parameters for selecting the pair construction strategy, enhanced control over the number of samples per group, and an experimental implementation of unbiased learning-to-rank. It also introduces support for custom gain functions alongside NDCG (Normalized Discounted Cumulative Gain), deterministic GPU computation, and improved metric performance facilitated by caching mechanisms. Notably, NDCG is now the default objective function for learning-to-rank tasks.
Other notable enhancements include automatic estimation of the base_score based on input labels for optimal accuracy, and the introduction of quantile regression support, allowing minimization of the quantile loss. Both L1 and quantile regression objectives now support the learning rate. Users can also export the quantile values utilized by the hist tree method. Progress has been made in federated learning with support for column-based splits, specifically for vertical federated learning. The PySpark integration has received optimizations, including GPU-based prediction and improved data initialization. Furthermore, general input handling and performance have been enhanced for common data structures like NumPy arrays.
It is important to note that XGBoost 2.0 introduces some breaking changes: users are now mandated to specify the format for text input, and the predictor parameter has been removed.