Advanced Mass Spectrometry Analysis: Parameter Optimization in XCMS for Untargeted GC-MS and LC-MS Data Processing (Part 1)

シェア

This article has addressed the critical challenge of XCMS parameter tuning for GC-MS/LC-MS by introducing Isotopologue Parameter Optimization (IPO) as a systematic and data-driven solution to the subjectivity of manual workflows. By leveraging natural ¹³C isotopologue patterns as an internal quality metric and employing an intelligent search strategy with Design of Experiments and Response Surface Modeling, IPO automates the optimization of peak picking, retention time correction, and grouping. This process yields a more robust and reproducible peak table, providing a stronger foundation for subsequent statistical analysis and compound identification. 

Sivakorn Kanharattanachaiさんのプロフィール写真

Sivakorn Kanharattanachai

MI-6 Ltd.Data Scientist

He is an experienced nanomaterial engineer who then transitioned to computer engineering with expertise in text mining, deep learning,
and imbalanced data learning. He previously served as a data scientist at Charoen Pokphand Group (CP) Thailand, where he specialized in time series analysis,
satellite image processing, optical character recognition (OCR) development, and automated data systems implementation.
He is currently working as a data scientist at MI-6, focusing on advanced feature extraction and spectral data analysis through deep learning methodologies.

Introduction

In our previous miLab article, Advanced Mass Spectrometry Analysis: Machine Learning Applications in GC-MS and LC-MS Data Processing, we outlined the key challenges in untargeted MS workflows. One of the most important intermediate outcomes before molecular identification is the peak table, which provides the foundation for quantification, annotation, and interpretation.

However, generating accurate and reproducible peak tables from GC-MS and LC-MS data has long been hindered by several issues, including variability in peak shapes, retention time drifts across runs, and overlapping chromatographic signals, as discussed previously. These challenges often result in analyst-dependent variability and limited reproducibility, particularly when manual parameter tuning is required.

To address these limitations, the XCMS framework has become one of the most widely used open-source tools for preprocessing untargeted GC-MS and LS-MS data. XCMS offers a comprehensive pipeline for peak detection, retention time correction, grouping, and peak filling, contributing substantially to improved reproducibility. However, its performance depends on the careful tuning of key parameters; suboptimal settings can compromise both accuracy and consistency. In this article, we focus specifically on the use of XCMS for untargeted GC-MS data analysis. We describe the challenges and limitations encountered in applying XCMS and review optimization strategies that enhance its robustness and reproducibility in generating peak tables.

To address these challenges, XCMS has become one of the most widely used open-source tools for preprocessing untargeted GC-MS and LC-MS data. XCMS provides an integrated workflow covering peak detection, retention-time correction, grouping, and peak filling, contributing substantially to improved reproducibility. However, its performance depends strongly on the tuning of key parameters; suboptimal settings can compromise both accuracy and consistency. In this series, we discuss the practical challenges and constraints encountered when applying XCMS, and present systematic approaches for obtaining high-precision, reproducible peak tables.

This article is presented as Part 1 of our series on XCMS optimization methods, where we introduce Isotopologue Parameter Optimization (IPO) as a systematic approach for parameter tuning. In the following parts, we will extend the discussion to other strategies such as metaheuristic search methods and AI-driven optimization.

XCMS: An Open-Source R Package for GC-MS/LC-MS Data Processing

XCMS is an open-source R package originally developed for LC-MS but has since become one of the most widely adopted platforms for untargeted mass spectrometry data processing. Its modular design provides automated workflows for peak detection, retention time correction, and peak grouping, all of which are critical steps in generating reproducible peak tables.

For peak detection, XCMS implements algorithms such as centWave, which is specifically designed for high-resolution MS data. The centWave algorithm uses a continuous wavelet transform (CWT) to detect chromatographic peaks with high sensitivity, even when peaks are narrow, overlapping, or buried in noise. For retention time correction, algorithms like obiwarp correct systematic drift across samples to ensure consistent feature matching.

Afterwards, XCMS has been successfully extended to GC-MS, where similar challenges arise. Because of its flexibility and open-source nature, XCMS has been widely applied across diverse fields, including metabolomics, environmental analysis, food chemistry, toxicology, and clinical diagnostics, making it a cornerstone tool for untargeted mass spectrometry research.

The Challenge of using XCMS in GC-MS data analysis

Although XCMS provides a powerful and flexible framework for preprocessing GC-MS/LC-MS data, its performance is highly dependent on parameter settings. Parameters such as peak width, signal-to-noise threshold and m/z tolerance  directly influence the quality of the resulting peak table.

Historically, parameter tuning has been carried out manually by expert analysts, who iteratively adjust values and validate results through visual inspection of extracted ion chromatograms (EICs). While this approach allows flexibility, it also introduces several major problems:

  • Subjective – Results depend heavily on the analyst’s experience and intuition.
  • Time-consuming – Manual adjustment and validation are impractical for large datasets.
  • Not reproducible – Different analysts may arrive at different peak tables from the same raw data.

When parameters are poorly set, the performance of XCMS can lead to missed peaks, false positives, or misaligned retention times. This sensitivity makes parameter optimization one of the critical bottlenecks in applying XCMS to GC-MS data, and highlights the need for systematic and automated methods to replace manual trial-and-error.

Fig.1 XCMS’s peak picking example code

The reasons why it is difficult to set the parameters.

XCMS provides a wide range of meta-parameters that govern peak detection, integration, and alignment. Below we summarize important parameters of the main functions of XCMS.

The table below is overview of key XCMS parameters for peak picking, retention time correction, and grouping

Table 1. Key XCMS parameters for peak picking, retention time correction, and grouping, and the typical effects when values are set too low or too high.

Practical  peak picking interpretation:

  • ppm and mzdiff control m/z accuracy and peak separation.
  • peakwidth and snthresh strongly affect peak detection sensitivity.
  • prefilter and noise reduce false positives but risk discarding weak analytes.

In the Figure 2, we show the effect of XCMS parameter tuning on peak detection in a GC-MS sample of bergamot organic essential oil. Increasing ppm raises the peak count, but excessive values inflate false positives (Figure 2, top-right). Increasing peakwidth (min) lowers the peak count, where too small values over-segment peaks (false positives) and too large values miss narrow peaks (false negatives) (Figure 2. bottom-left). Raising snthresh (S/N threshold) reduces peaks, with low thresholds including noise and high thresholds discarding true peaks (Figure2, bottom-right). The red markers highlight the optimal points, which were manually labeled by field experts.

Fig.2 Effect of XCMS parameter settings on peak detection in a GC-MS sample.

Optimizing XCMS Parameters: Isotopologue Parameter Optimization (IPO)

Despite its flexibility, XCMS lacks a standardized or automated procedure for parameter selection. Several strategies have emerged to address this, ranging from manual tuning to machine learning-based optimization.

IPO is an automated package designed to tune parameters for liquid chromatography-high-resolution mass spectrometry (LC-HRMS) workflows. Its primary goal is to maximize the detection of genuine chemical features while minimizing noise. The core of its strategy is leveraging the predictable relationship between the natural abundance of carbon isotopes, specifically ¹²C and ¹³C, as a powerful internal reference. The principle is that any real analyte peak will have a predictable isotopologue pattern, so the quality of a parameter set is judged by how well it detects these patterns. Improving peak detection accuracy yields a cleaner feature set for downstream processing, which in turn enhances the robustness and reproducibility of retention time correction and grouping.

Use of Isotopologue Patterns in IPO

Virtually all organic molecules contain carbon. An isotopologue composed only of the abundant ¹²C creates the primary, most intense peak (M) which is the main signal you want to find. However, random noise can sometimes mimic this signal. To validate that a signal is real, IPO looks for a tell-tale "watermark" that is the smaller, but predictable M+1 peak, produced by an isotopologue with one ¹³C atom.

Fig.3 Explanation of carbon isotope

In conventional XCMS workflows, analysts manually inspected extracted ion chromatograms (EICs) to adjust parameters. IPO replaces this subjective judgment with an objective, isotope-based metric. As noted above, a genuine compound peak is always accompanied by an M+1 peak derived from ¹³C. IPO quantifies this isotopic pattern as a score, allowing the quality of parameter settings to be evaluated quantitatively. This approach enables consistent condition selection without relying on individual experience or intuition, thereby improving both reproducibility and efficiency. Let us now examine how this mechanism works in detail.

IPO optimizes XCMS parameters by following the three ordered stages of the XCMS workflow (see Figure 4):

  1. Peak Picking
  2. Retention Time Correction
  3. Grouping

At each stage, IPO runs an intelligent search to find the best parameter settings. Instead of relying on brute-force grid searches, IPO uses a “try → score → zoom-in → pick best” strategy that balances efficiency and accuracy. This intelligent strategy is powered by two statistical techniques: Design of Experiments (DoE) and Response Surface Modeling (RSM).

  • Design of Experiments (DoE):

IPO begins by testing a small but carefully chosen set of parameter combinations. Instead of exhaustively evaluating all possible values, DoE ensures that the selected trials are representative of the entire parameter space. This gives IPO an informed “map” of where the promising regions lie.

  • Response Surface Modeling (RSM):

Using the results of the initial trials, IPO constructs a mathematical response surface. Imagine a 3D landscape where:

  • The horizontal plane represents parameter combinations (e.g., peakwidth and snthresh).
  • The vertical axis represents the quality score (e.g., PPS, RCS, or GS depending on the stage). RSM interpolates this surface to predict which regions of the parameter space are most likely to contain the optimum.

IPO then focuses further trials on those high-potential regions, continuously refining the surface until it converges on the optimal parameter set.

Fig.4 IPO workflow

Optimization Workflow at Each Stage

  1. Peak Picking

    The goal here is to accurately detect as many genuine chemical features as possible while ignoring noise. A real chemical signal will always have this predictable ¹²C and ¹³C pattern, but random noise will not. For how it works, IPO first tries to find the main peak by identifying a potential signal, the tall M peak. Then, it evaluates the current parameter set based on how well it also finds the small M+1 peak in its expected location and with the correct relative intensity. If both the main peak and its ¹³C "watermark" are found, the signal is considered real, and the parameter set gets a high score. On the other hand, if the main peak is found without its expected M+1 partner, it's likely noise, and the settings get a low score. IPO focuses on the key parameters of the peak detection algorithm (centWave) such as peakwidth, ppm, snthresh and prefilter. For each combination of parameters it tests, IPO calculates a Peak Picking Score (PPS). By running dozens of trials, IPO builds a model of how the parameters affect the PPS. It then selects the parameter set that maximizes this score, ensuring that the settings are optimized to detect real, chemically-consistent signals.

  2. Retention Time Correction 

    Like the process in peak picking, IPO applies the same “try → score → zoom-in → pick best” strategy, but here the focus shifts to aligning peaks across runs. Even with a stable GC-MS/LC-MS system, retention time drift can occur, making identical features appear at slightly different RTs in different samples. IPO tests alignment settings such as profStep, center, and span, then scores them based on how tightly features line up in pooled QC samples. The scoring metric, Retention Time Correction Score (RCS), is higher when RT variance across replicates is minimized. The end result is a set of parameters that produce the most consistent and reliable alignment possible.

  3. Grouping 

    The final stage is grouping, which, like the earlier stages, uses the same optimization cycle but with a different goal: building a clean feature matrix. After peaks are picked and aligned, IPO evaluates grouping parameters such as bw, mzwid, and minfrac. It scores these settings using the Grouping Score (GS), which reflects how well QC samples contribute exactly one peak to each group. A high GS indicates feature groups free from missing or duplicate peaks. By maximizing GS, IPO ensures that the final data matrix is accurate and ready for downstream analysis.

Summary and Outlook

IPO represents one of the most systematic and widely adopted approaches for automating XCMS parameter optimization. By exploiting natural ¹³C isotopologue information, it provides a powerful way to fine-tune peak picking, retention time correction, and grouping in a structured and reproducible manner. However, IPO is not the only strategy available.

Key Highlights of IPO

  • Label-free: relies on natural isotopologues, no spiking needed.
  • Efficient: avoids massive grid search.
  • Best use case: good-quality LC-MS/GC-MS data.
  • Limitations: if data quality is poor, isotopologue signals may be noisy → IPO might suggest unrealistic settings. Always validate with a few EICs afterward.

In the next part of this series, we will turn to Metaheuristic Search Methods, such as Genetic Algorithms, Particle Swarm Optimization, and Bayesian Optimization. Unlike IPO, which leverages isotopologue consistency as its guiding principle, these methods rely on iterative search strategies inspired by nature or probability theory. They open the door to flexible optimization even when isotopologue information is sparse or when datasets exhibit more complex parameter landscapes.

Reference

  1. C. Jirayupat, “Advanced Mass Spectrometry Analysis: Machine Learning Applications in GC-MS and LC-MS Data Processing”, miLab, MI-6, https://mi-6.co.jp/milab/article/t0025en/#h5828db6fc9
  2. G. Baccolo, B. Quintanilla-Casas, S. Vichi, D. Augustijn, and R. Bro, "From untargeted chemical profiling to peak tables – A fully automated AI driven approach to untargeted GC-MS", TrAC Trends in Analytical Chemistry, vol. 145, Elsevier, 2021, p. 116451, DOI: 10.1016/j.trac.2021.116451.
  3. R. Tautenhahn, G. J. Patti, D. Rinehart, and G. Siuzdak, "XCMS Online: A Web-Based Platform to Process Untargeted Metabolomic Data", Analytical Chemistry, vol. 84, no. 11, pp. 5035–5039, American Chemical Society, 2012, DOI: 10.1021/ac300698c.
  4. H. Gowda, J. Ivanisevic, C. H. Johnson, M. E. Kurczy, H. P. Benton, D. Rinehart, T. Nguyen, J. Ray, J. Kuehl, B. Arevalo, P. D. Westenskow, J. Wang, A. P. Arkin, A. M. Deutschbauer, G. J. Patti, and G. Siuzdak, "Interactive XCMS Online: Simplifying Advanced Metabolomic Data Processing and Subsequent Statistical Analyses", Analytical Chemistry, vol. 86, no. 14, pp. 6931–6939, American Chemical Society, 2014, DOI: 10.1021/ac500734c.
  5. C. Jirayupat, K. Nagashima, T. Hosomi, T. Takahashi, W. Tanaka, B. Samransuksamer, G. Zhang, J. Liu, M. Kanai, and T. Yanagida, "Image Processing and Machine Learning for Automated Identification of Chemo-/Biomarkers in Chromatography–Mass Spectrometry", Analytical Chemistry, vol. 93, no. 44, pp. 14708–14715, American Chemical Society, 2021, DOI: 10.1021/acs.analchem.1c03163.
  6. O. E. Albóniga, O. González, R. M. Alonso, et al., "Optimization of XCMS parameters for LC–MS metabolomics: an assessment of automated versus manual tuning and its effect on the final results", Metabolomics, vol. 16, p. 14, 2020, DOI: 10.1007/s11306-020-1636-9.
  7. Fundamentals of GC/MS: Mass Number and Isotope, SHIMADZU, https://www.shimadzu.com/an/service-support/technical-support/analysis-basics/gcms/fundamentals/what/mass_number_isotope.html
  8. G. Libiseller, M. Dvorzak, U. Kleb, et al., "IPO: a tool for automated optimization of XCMS parameters", BMC Bioinformatics, vol. 16, p. 118, 2015, DOI: 10.1186/s12859-015-0562-8.