Gradient Boosting Decision Trees (GBDT) for Regression Tasks

シェア

Gradient Boosting Decision Trees (GBDT) are a highly effective machine learning method for tabular materials data with moderate sample sizes and complex non-linear relationships. This article explains the core principles of GBDT, compares it with linear models, Gaussian Processes, and deep learning, and highlights its strengths in handling heterogeneous data and feature interactions. It also discusses limitations such as lack of extrapolation and uncertainty quantification, providing practical guidance for materials scientists selecting appropriate models.

FAUSTEN Tobiasさんのプロフィール写真

FAUSTEN Tobias

MI-6 Ltd.Machine Learning Engineer

Studied economics and Japanese at the University of Duisburg-Essen. After graduating, worked as a software engineer at an MLOps-focused startup, building and operating ML products. Currently manages the ML Engineering team at MI-6, leading the development of ML infrastructure and pipelines for Bayesian optimization and machine learning.

Introduction: From Linear Models to Ensemble Learning

The Model Selection Challenge

In materials informatics, selecting an appropriate machine learning model depends critically on data characteristics and the nature of the underlying phenomena. Linear models excel when relationships are additive and interpretability is paramount. Gaussian Processes (GPs) are ideal for small datasets where uncertainty quantification is essential. Deep learning methods shine with large datasets containing many features, such as images or spectra.

However, materials scientists frequently encounter a common scenario: tabular experimental data with moderate sample sizes (hundreds to thousands of samples), many mixed-type features (continuous and categorical), and complex non-linear relationships. This is where Gradient Boosting Decision Trees (GBDT) have emerged as a powerful solution.

Why GBDT Has Risen to Prominence

Over the past decade, GBDT has become widely adopted in materials property prediction, process optimization, and composition design. Several factors explain this rise:

  • Practical balance: GBDT offers flexibility to capture intricate feature interactions without requiring the massive datasets that deep learning demands
  • Computational efficiency: GBDT scales well beyond the limitations of kernel-based methods like Gaussian Processes
  • Strong performance on structured data: GBDT excels with synthesis logs, compositional spreadsheets, and experimental databases
  • Reduced tuning burden: GBDT often delivers superior results with less hyperparameter tuning than alternatives

Modern GBDT implementations such as XGBoost (Chen & Guestrin, 2016) and LightGBM (Ke et al., 2017) have refined the original algorithm with additional regularization, histogram-based split finding, and GPU support, making them the dominant choices in practice.

Algorithmic Foundations: How Boosting Works

The Base Unit: Decision Trees

A decision tree partitions the feature space through a series of binary splits, creating regions where predictions are made. In materials science applications, these splits correspond to interpretable conditions such as "Temperature > 800°C" or "Ti content > 30%."

Each leaf node contains a prediction value, typically the mean of training samples that fall into that region.

A schematic visualization of a decision tree used in materials informatics, illustrating hierarchical feature splits such as temperature and composition thresholds to predict material properties from structured experimental data.

Fig. 1. Decision tree structure for materials property prediction

The Boosting Mechanism: Sequential Correction

A single decision tree has high variance and overfits easily. GBDT addresses this by combining many shallow trees, but with a specific strategy: rather than building trees independently (as in Random Forests), each new tree is trained on the errors of the current ensemble.

The process starts from a simple baseline — typically the mean of the training targets — then refines it step by step:

$$ F_m(x) = F_{m-1}(x) + \alpha \cdot h_m(x) $$

Here,$$ h_m $$ is a shallow tree fitted to the residuals $$ r_i = y_i - F_{m-1}(x_i) $$ , and$$ \alpha $$ is the learning rate, which controls how much each tree contributes. Each iteration reduces the remaining error, with early trees capturing the dominant patterns and later trees correcting the finer details.

Smaller $$ \alpha $$  requires more trees but provides stronger regularization against overfitting. The name "gradient boosting" comes from the fact that residuals are a special case of gradient descent on the loss function, which allows the same framework to work with any differentiable loss.

A conceptual diagram of gradient boosting showing sequential addition of shallow decision trees, where each tree corrects residual errors from previous models, demonstrating the core mechanism of GBDT in machine learning for materials science.

Fig. 2. Sequential tree addition process in boosting (simplified illustration with two trees)

Positioning GBDT Against Other Methods

GBDT vs. Linear Models

Linear models with regularization (such as LASSO) are valuable tools in materials informatics, particularly for interpretability and when relationships are approximately linear.

Advantages of GBDT:

  • Captures complex property-structure interactions without manual feature engineering
  • Handles sharp discontinuities naturally (e.g., phase boundaries, glass transition points)

Disadvantages of GBDT:

  • Predictions cannot be expressed as a simple equation, making it harder to extract physical insights

GBDT vs. Gaussian Processes (GP)

Gaussian Processes provide probabilistic predictions, making them popular for Bayesian optimization in materials discovery.

Advantages of GBDT:

  • Significantly faster training and inference on moderate-sized datasets (200–10,000 samples)
  • Better handling of high-dimensional mixed feature spaces
  • More robust to sharp discontinuities in the target function

Disadvantages of GBDT:

  • Lacks intrinsic uncertainty quantification (unlike GP's probabilistic outputs)
  • Cannot naturally incorporate prior knowledge about smoothness or periodicity

GBDT vs. Deep Learning

Deep neural networks have revolutionized computer vision and NLP, but their advantages are less clear for tabular materials data.

Advantages of GBDT:

  • Superior performance on structured, tabular data typical of materials databases
  • Requires substantially less training data than neural networks
  • More interpretable feature contributions
  • Faster training without GPU requirements

Disadvantages of GBDT:

  • Cannot naturally process raw images, spectra, or unstructured data (requires manual feature extraction)
  • Less effective with very high-dimensional feature spaces (>1000 features)

Practical Advantages for Materials Scientists

Handling Heterogeneous Experimental Data

GBDT handles mixed data types natively — continuous variables (temperature, pressure, composition) alongside categorical ones (synthesis method, crystal structure) — without the normalization or one-hot encoding required by neural networks and Gaussian Processes.

Interpretability Through Feature Importance

Despite being an ensemble of potentially hundreds of trees, GBDT provides accessible methods for understanding which features drive predictions. The three most common approaches each measure "importance" differently, and they do not always agree (Fig. 3):

  • Split count: measures how frequently each feature is selected for a split across all trees. This is the simplest metric and is computationally cheap, but it systematically favors features with high cardinality (many distinct values) since these offer more candidate splits. A feature used in many shallow, low-quality splits can appear more important than one used rarely but at critical decision points.
  • Gain-based importance: measures the average reduction in loss (e.g., mean squared error) achieved by splits on each feature. Features that produce large, consistent improvements in prediction accuracy score highly. However, this metric is biased toward continuous features with many possible split points, and can overstate the importance of features that appear in only a few but very effective splits.
  • SHAP values: rather than examining the tree structure, SHAP (SHapley Additive exPlanations) computes the contribution of each feature to each individual prediction using principles from cooperative game theory. Unlike split count and gain, SHAP values satisfy formal fairness and consistency properties, making them the most theoretically sound importance measure. Global importance is the mean of absolute SHAP values across all samples.
Comparison of feature importance metrics in a GBDT model, including split count, gain-based importance, and SHAP values, highlighting differences in interpretability and bias when analysing materials property prediction models.

Fig. 3. Comparison of three feature importance metrics in a red wine quality prediction model(https://doi.org/10.1016/j.dss.2009.05.016) (top 8 features)

In miHub®'s GBDT implementation, feature importance is computed using the global SHAP methodology.

Limitations and When Not to Use GBDT

The Extrapolation Boundary

Tree-based models cannot extrapolate beyond the range of their training data — decision trees assign the mean of training samples in each leaf, so predictions are bounded by observed values.

Concrete example: A model trained on alloys with 4–11% Ti content cannot reliably predict properties at 13% Ti when considering Ti content alone.


Note: In practice, materials are represented in a high-dimensional feature space. If samples exist that contain a similar element at around 13% with comparable descriptors, the model may still produce reasonable predictions. In this sense, the limitation is not extrapolation in a single variable, but coverage of the overall feature space.

Reference Article

An illustration of interpolation and extrapolation behaviour in tree-based models, showing how GBDT predictions remain bounded within training data ranges and fail to extrapolate beyond observed feature space in materials modelling tasks.

Fig. 4. Comparison of extrapolation failure and interpolation accuracy in a one-dimensional feature space

This contrasts with linear models, which will extrapolate trends (though often erroneously). The practical implication is that GBDT excels at interpolation within explored parameter space but should be used with caution when predicting outside the training data range.

Hyperparameter Sensitivity and Data Requirements

Key hyperparameters include the learning rate (smaller values improve generalization but require more trees), tree depth (controls interaction complexity vs. overfitting risk), number of estimators, and L1/L2 regularization on leaf weights.

For small-sample datasets, GBDT can be prone to overfitting unless hyperparameters are tuned with particular care. The most suitable model is determined not only by sample size, but also by the balance between the number of samples and feature dimensionality, as well as the degree of regularization. As a practical guideline, for small datasets (for example $$ N < 50 $$), alternatives such as Gaussian Processes or regularized linear models may be preferable.

Conclusion

GBDT has established itself as a high-efficiency "workhorse" algorithm for intermediate-sized materials datasets with complex non-linear relationships. Its key strengths include computational speed, flexibility in handling heterogeneous data, interpretability through feature importance, and native handling of mixed data types.

However, materials scientists should remain aware of its limitations: the inability to extrapolate beyond training data ranges, sensitivity to key hyperparameter settings, and the lack of intrinsic uncertainty quantification. 

References