Advanced Mass Spectrometry Analysis: Parameter Optimization in XCMS for Untargeted GC-MS and LC-MS Data Processing (Part 2)

シェア

This article explores advanced optimization strategies for XCMS parameter tuning, focusing on metaheuristic methods such as Genetic Algorithms (GA) and Bayesian Optimization (BO). Unlike IPO, which relies on isotopologue patterns, these methods navigate complex parameter spaces through iterative search and probabilistic modeling. We compare their strengths and limitations, highlighting trade-offs between exploration capability and computational efficiency. Finally, we discuss emerging deep learning–based approaches that aim to predict optimal parameters directly, suggesting a shift from iterative search toward data-driven automation in metabolomics workflows.

Sivakorn Kanharattanachaiさんのプロフィール写真

Sivakorn Kanharattanachai

MI-6 Ltd.Data Scientist

He is an experienced nanomaterial engineer who then transitioned to computer engineering with expertise in text mining, deep learning,
and imbalanced data learning. He previously served as a data scientist at Charoen Pokphand Group (CP) Thailand, where he specialized in time series analysis,
satellite image processing, optical character recognition (OCR) development, and automated data systems implementation.
He is currently working as a data scientist at MI-6, focusing on advanced feature extraction and spectral data analysis through deep learning methodologies.

Introduction

In our previous miLab article, Parameter Optimization in XCMS for Untargeted GC-MS (Part 1), we focused on Isotopologue Parameter Optimization (IPO), a label-free method that leverages natural ¹³C isotopologues to tune XCMS parameters for peak picking, retention time correction, and grouping. IPO provides a systematic and reproducible alternative to manual parameter tuning, reducing analyst bias and improving the consistency of peak tables.

Building on that foundation, this article shifts toward advanced optimization strategies for XCMS parameter tuning. In such cases, parameter tuning must be approached more generally as a search problem, rather than relying on specific chemical features. In particular, we explore Metaheuristic Search Methods, which represent a powerful class of algorithms designed to efficiently navigate large and complex parameter spaces. These approaches include Genetic Algorithms (GA) and Bayesian Optimization (BO), among others. Unlike IPO, which relies on isotopologue patterns as a scoring principle, metaheuristic approaches provide more flexible frameworks that can incorporate a wide range of objective functions and are especially useful when direct analytical solutions are not feasible.

By examining how these algorithms can be applied to XCMS, we aim to highlight their potential to improve parameter selection in both untargeted GC-MS/LC-MS workflows, where the diversity of data types and experimental designs makes robust and automated optimization strategies increasingly necessary.

1. What are Metaheuristic Search Methods?

Metaheuristic search methods are high-level strategies used to find, generate, or select a heuristic that can provide a sufficiently good solution to an optimization problem, especially when the search space is large or complex. Unlike exact algorithms that guarantee finding the optimal solution, metaheuristics aim to find a good solution in a reasonable amount of time. They are particularly useful for problems where finding the absolute best solution is computationally expensive or impossible.

The core idea behind metaheuristics is to intelligently explore the vast landscape of possible solutions. They do this by balancing two key components: intensification and diversification. Intensification involves focusing the search on promising areas of the solution space, while diversification encourages the exploration of new, unvisited areas to avoid getting stuck in local optima.

Fitness Evaluation

A central step in applying metaheuristic search methods for XCMS optimization is the fitness evaluation process, where each candidate parameter set is scored for performance. Commonly used metrics include:

  • Peak Picking Score (PPS): Quality of isotopologue pair detection.
  • Retention Time Correction Score (RCS): Consistency of aligned retention times.
  • Grouping Score (GS): Accuracy of feature grouping across replicates.

These metrics are typically combined into a weighted fitness function to balance detection accuracy, alignment quality, and grouping completeness. This unified evaluation framework ensures that optimization in GA and BO improves multiple aspects of the XCMS pipeline simultaneously.

Key Characteristics

Several characteristics define metaheuristic search methods:

  • Stochasticity: Many metaheuristics incorporate randomness into their search process. This helps in exploring different parts of the solution space and escaping local optima.
  • Approximation: They do not guarantee finding the global optimum but aim to find a near-optimal solution.
  • Flexibility: Metaheuristics can be adapted to solve a wide variety of optimization problems with little modification.
  • Inspiration from Nature: A significant number of metaheuristic algorithms are inspired by natural phenomena, such as the foraging behavior of ants, the process of natural selection, or the physical process of annealing.

In this article, we focus on two of the most widely applied metaheuristic search methods: Genetic Algorithms (GA) and Bayesian Optimization (BO). These methods represent distinct approaches to navigating complex parameter landscapes: GA draws inspiration from natural selection and BO leverages probabilistic modeling to guide efficient searches.

2. Genetic Algorithms (GA)

Genetic Algorithms (GA) are among the most widely used metaheuristic methods for optimization problems with complex, non-linear search spaces. Inspired by the principles of natural selection and genetics, GA iteratively evolves a population of candidate parameter sets until it converges toward optimal or near-optimal solutions. In this article, we demonstrate the basic process of applying GA within XCMS parameter optimization, while a more comprehensive introduction to the theory and practice of GA can be found in our miLab article “Basic Introduction to Genetic Algorithms and Hints for Applications”.

How does a GA work?

Flowchart of genetic algorithm optimization applied to XCMS parameters, representing heuristic search in AI for Science and measurement informatics workflows.

Figure 1. Workflow of GA for XCMS parameter optimization.

The GA begins with XCMS default settings or a randomly initialized population of parameter sets (e.g., values for ppm, peakwidth, snthresh, and mzwid). Each set is passed into Fitness Evaluation, where XCMS runs with those parameters and produces a score (PPS, RCS, GS, or a composite). The algorithm then performs a Convergence Check: if the score has plateaued (no improvement over recent generations), the process stops and outputs the XCMS optimized settings. If not, GA applies Genetic Operators:

  • Selection: chooses high-performing parameter sets as “parents.”
  • Crossover: recombines parent parameter values to create new “offspring” candidates.
  • Mutation: introduces random perturbations to maintain diversity and avoid local optima.

This loop of evaluate → select → recombine → mutate continues until convergence is achieved, ensuring that XCMS parameters are systematically optimized toward the best possible settings.

Strengths and Limitations

The strength of GA lies in its ability to escape local optima, making it suitable for high-dimensional parameter landscapes where traditional grid search or IPO might struggle. However, the trade-off is computational cost: running many generations of XCMS on large datasets can be time-intensive.

3. Bayesian Optimization (BO)

Bayesian Optimization (BO) takes a fundamentally different approach to optimizing XCMS parameters compared to GA. Rather than searching the parameter space through evolutionary search, BO builds a probabilistic model of the objective function and uses it to strategically select the most informative parameter sets to test. Here, we illustrate the basic process of using BO within XCMS parameter optimization, while a deeper theoretical explanation can be found in Basic Concepts of Bayesian Optimization. Within MI-6’s framework, we have developed proficiency in applying BO, and this expertise allows us to extend the method to metabolomics workflows such as XCMS.

How does a BO work?

At the core of BO is a surrogate model, often a Gaussian Process (GP), which estimates the relationship between XCMS parameters (e.g., peakwidth, snthresh, ppm) and performance metrics (e.g., PPS, RCS, GS). The surrogate predicts not only the expected score at untested points but also the uncertainty of those predictions. This dual knowledge allows BO to balance:

  • Exploration — testing uncertain parameter regions.
  • Exploitation — focusing on promising areas.

The decision of where to sample next is governed by an acquisition function, such as Expected Improvement (EI) or Upper Confidence Bound (UCB), which formalizes the trade-off between exploration and exploitation.

In XCMS optimization, BO is particularly powerful when the parameter space is large and evaluations are computationally expensive. Instead of exhaustively testing hundreds of parameter combinations, BO can identify near-optimal settings with relatively few iterations. For example, BO can efficiently narrow in on the ideal peakwidth or snthresh values by modeling how isotopologue pair quality changes across the search space.

Diagram of Bayesian optimization using surrogate models to efficiently tune XCMS parameters in data-driven mass spectrometry and materials informatics.

Figure 2. Workflow of BO for XCMS parameter optimization.

In Figure 2, the BO begins with a few randomly chosen parameter sets to build an initial surrogate model. At each iteration, the surrogate predicts the most promising new parameter set using the acquisition function. XCMS is then run with those parameters, and the resulting score is fed back into the model, refining the surrogate. Over time, the algorithm homes in on the optimal parameter set while minimizing unnecessary trials. A typical BO diagram shows the surrogate curve fitting the objective function, with acquisition guiding new sampling points toward the global optimum.

Strengths and Limitations

BO offers a highly efficient strategy for tuning XCMS parameters by building a surrogate model that predicts the performance of unexplored settings, allowing it to reach high-quality solutions with relatively few evaluations compared to brute-force methods. This balance of exploration and exploitation makes BO particularly useful when computational resources are limited or when parameter evaluation is costly. However, its effectiveness depends on the accuracy of the surrogate model, which may degrade in very high-dimensional or highly irregular parameter spaces such as those in XCMS. Moreover, while BO reduces the number of experiments needed, each iteration requires computational overhead for updating the surrogate model, which can become demanding for large-scale metabolomics datasets.

Aspect

Genetic Algorithms (GA)

Bayesian Optimization (BO)

Search Mechanism

Evolutionary process with genetic operators: selection, crossover, mutation.

Probabilistic model-based search using surrogate models (e.g., Gaussian Processes).

Exploration vs. Exploitation

Strong exploration via mutation and crossover; may converge slowly.

Uses acquisition function to balance exploration and exploitation efficiently.

Scoring Metrics

Evaluates population of parameter sets using fitness scores (e.g., PPS, RCS, GS).

Builds a surrogate model of the score function and updates after each experiment.

Convergence Speed

Moderate; requires many generations for stability.

Very fast; fewer experiments required due to model-guided search.

Computational Cost

High, due to repeated fitness evaluation across large populations.

Low; focuses experiments only on promising regions of parameter space.

Interpretability

Easy to understand but stochastic results vary between runs.

High interpretability; surrogate model provides insight into parameter–performance landscape.

Best Use Case in XCMS

Suitable for broad, complex parameter spaces where global exploration is needed.

Best for expensive optimization tasks with limited evaluations, ensuring efficiency.

Despite their strengths, both GA and BO still rely on iterative evaluation of parameter sets, which can be computationally demanding for large-scale datasets. To address this limitation, MI-6 is currently developing a deep learning–based optimization framework for XCMS. This strategy leverages large-scale data simulation to generate diverse chromatographic scenarios, which are then used to train neural networks capable of learning complex parameter–performance relationships. Once trained, the model can rapidly predict optimal parameter settings for new datasets, reducing the need for extensive iterative evaluations. By combining simulation, training, and prediction, this approach aims to deliver a faster, more adaptive, and highly scalable solution for untargeted GC-MS and LC-MS data processing.

References

  1. C. Jirayupat, “Advanced Mass Spectrometry Analysis: Machine Learning Applications in GC-MS and LC-MS Data Processing”, miLab, MI-6, https://mi-6.co.jp/milab/article/t0025en/#h5828db6fc9
  2. E. H. Houssein, M. K. Saeed, and G. Hu, “Metaheuristics for Solving Global and Engineering Optimization Problems: Review, Applications, Open Issues and Challenges,” Archives of Computational Methods in Engineering, vol. 31, pp. 4485–4519, 2024. DOI: 10.1007/s11831-024-10168-6.
  3. H. Treutler and S. Neumann, “Prediction, Detection, and Validation of Isotope Clusters in Mass Spectrometry Data,” Metabolites, vol. 6, no. 4, p. 37, 2016. DOI: 10.3390/metabo6040037.
  4. S.S. Saima, “Hints for introducing and applying genetic algorithms”, miLab, MI-6, https://milab.mi-6.co.jp/article/t0049
  5. L. Scrucca, “GA: A Package for Genetic Algorithms in R,” Journal of Statistical Software, vol. 53, pp. 1–37, 2013. DOI: 10.18637/jss.v053.i04.
  6. MI-6 MLR Team, “Basic concepts of Bayesian optimization”, miLab, MI-6, https://milab.mi-6.co.jp/article/t0013
  7. T. Huan, H. Yu, P. Biswas, E. Rideout, and Y. Cao, “Bayesian optimization of separation gradients to maximize the performance of untargeted LC-MS,” 2023. DOI: 10.21203/rs.3.rs-3338667/v1.