2026.07.02

Automating Spectral Peak-Detection Parameter Selection Through Effective Parameter Space Discovery

Peak detection turns raw spectra into interpretable information. Threshold-based methods are easy to interpret and flexible, but the main challenge is choosing detection parameters that suit each spectrum — a choice without fixed guideline that typically relies on expert judgement. MI-6 developed a model combining a threshold-based detector with Random Forest model to automate parameter selection based on spectral characteristics. Trained on only 2,000 generated spectra and validated on over 100 unseen experimental spectra, it reaches F1-score of 0.94 on generated spectra and as high as 0.952, 0.946, and 0.893 on XRD, GC–MS, and Raman respectively.

I am a master's student at Institute of Science Tokyo specializing in time series data analysis and machine learning. I am currently working as a data scientist intern at MI-6, where I focus on developing spectral analysis techniques including preprocessing, peak detection, and peak grouping for integrating miHub and LA. My experience encompasses analysis of various spectral data types, including XRD, FTIR, and Raman spectra.

1. The challenge of setting peak-detection parameters
2. Discovering the “effective parameter space” and building the model
3. Performance and benchmarking
4. Conclusion
References

1. The challenge of setting peak-detection parameters

Peak detection is the step that turns a raw spectrum into interpretable information — peak positions, intensities, and widths that can be linked to structure, composition, or concentration. A threshold-based detector works by identifying candidate local maxima and keeping only those that satisfy criteria defined by detection parameters. The strengths of this approach are that it is transparent, flexible, and does not require a large volume of training data.

The real difficulty, however, lies in setting the parameters themselves. A widely used threshold-based peak-detection function offers up to eight tunable parameters, covering criteria such as minimum peak height, minimum distance between peaks, peak prominence, and peak width. These parameters are continuous real values, and there is no universal guideline or default that tells you which value a given spectrum requires.

More importantly, the parameters interact with one another: the optimal value of one parameter depends on the values of the others. For example, when prominence is changed, the optimal width may shift as well, so the parameters cannot be tuned one at a time in isolation. Finding suitable values is therefore a continuous, multi-dimensional search that depends on the specific character of each spectrum, which makes manual tuning difficult and impossible to scale when many spectra must be processed.

Among these parameters, the ones with the greatest influence on detection results are:

Prominence — the vertical distance between a peak’s apex and the lowest contour line surrounding it. Because it measures how far a peak stands out relative to its surroundings (relative rather than absolute height), it is effective at separating real peaks from noise.
Width — the horizontal extent of the peak, measured at a given height, reflecting how broad the peak is (its full width at half maximum, FWHM).
Relative height — the fraction of the prominence at which the width is measured, ranging from 0 (at the apex) to 1 (at the base). It does not describe the peak directly but controls where the width is read, and so is coupled to the width parameter.

A major strength of this model is its ability to propose parameters that accurately detect both high- and low-intensity peaks within the same spectrum while minimizing false positives (mistaking noise for real peaks)—a balance that is notoriously difficult to achieve with other methods. Striking a balance between capturing faint, true materials signals and rejecting background noise is a constant challenge. This capability is crucial for untargeted analysis, where it is impossible to know beforehand which peaks will be important. By reliably capturing all true peaks up front, the model ensures that no critical data is missed before undergoing downstream screening and feature selection during post-processing.

How to set these values automatically for each spectrum — without manual tuning — first requires understanding what an “optimal” value actually looks like, which is the subject of the next section.

Schematic diagram illustrating the three main peak properties used as detection parameters in automated spectral analysis — prominence, width, and relative height. This machine learning approach to peak detection supports materials informatics and AI for Science workflows by formalizing how SciPy's find_peaks parameters characterize spectral peaks.

Fig.1 Schematic illustrations of the three main peak properties used as detection parameters: Prominence, Width, and Relative height. The figure is partly modified from Reference [4].

2. Discovering the “effective parameter space” and building the model

By studying the relationship between parameter values and detection performance, MI-6 found a key point: each spectrum does not have just one workable set of parameters, but many. In other words, near-optimal performance does not sit at a single point in parameter space but spans a region — what we call the effective parameter space. This region has a complex shape and shifts or changes size according to the spectrum’s characteristics, with simpler spectra tending to have broader regions.

F1-score performance maps showing how different spectra occupy distinct effective parameter spaces across parameter combinations such as prominence versus width. This visualization underpins automated, machine-learning-driven peak detection for materials informatics and AI for Science, demonstrating data-driven parameter optimization for SciPy's find_peaks rather than manual tuning.

Fig.2 (a,c) Different spectra have (b,d) distinct effective parameter spaces within the F1-score maps across combinations of parameters, such as prominence vs. width. The F1-score reflects peak detection performance, with 0 being the worst and 1 being perfect detection. The figure is partly modified from Reference [4].

This finding has two implications. On one hand, it reflects the complexity of the problem, because a single “best” value cannot fully represent the truth, and the relationship between a spectrum’s characteristics and its effective region is non-linear and complex. On the other hand, it is helpful for building a model, because the model does not need to predict an exactly correct value — it only needs to choose a value that falls inside the effective region in a way that matches the character of each spectrum.

This understanding led to the development of an RF model that learns the relationship between a spectrum’s shape and distinctive characteristics (such as noise level and average peak width) and the suitable parameter values. When given a new spectrum, the model analyses its characteristics and sets the parameters automatically. The trends the model learns are also physically sensible: for example, it sets a higher prominence for noisier spectra in order to suppress noise-driven maxima, and sets a width consistent with the peak breadth of that spectrum, indicating that the model is learning a meaningful relationship between spectral characteristics and parameters rather than guessing at random.

3. Performance and benchmarking

On generated spectra, the MI-6 model reaches an F1-score — the balance between false positives and false negatives — of about 0.94, whereas using the default parameters without any tuning gives nearly unusable results because it accepts every local maximum as a peak, producing a large number of false positives. This comparison shows that choosing parameters appropriate to the spectrum is a decisive factor in detection quality.

More significantly, the MI-6 model was trained on a small dataset — about 500 times smaller than the deep-learning model it was compared against (2,000 spectra versus roughly one million simulated spectra). Both models were then validated on more than 100 unseen experimental spectra not encountered during training, spanning XRD, Raman, and GC–MS, which have different characteristics (for example, XRD shows sharp peaks while Raman shows broader, often overlapping peaks). Peak positions were established as ground truth through manual analysis by expert researchers.

Peak detection results on experimental XRD, GC–MS, and Raman spectra, comparing human annotation (ground truth) with a machine learning model and a CNN baseline. This benchmark highlights AI for Science and materials informatics applications, showing robust automated spectral peak detection across diverse spectroscopy and chromatography techniques.

Fig.3 Peak detection results in experimental spectra of (a) XRD, (b) GC–MS, and (c) Raman, comparing peaks identified by human annotation (True peaks), our method, and the CNN model. Triangles below each spectrum mark detected peaks. The figure is partly modified from Reference [4].

Table 1. Peak detection performance (precision, recall, and F1-score) of the MI-6 model and the deep-learning baseline on unseen experimental XRD, GC–MS, and Raman spectra, evaluated against expert-annotated ground truth.

Spectral type	Method	Precision	Recall	F1
XRD	MI-6 model	0.925	0.984	0.952
XRD	Deep learning	0.515	0.252	0.327
GC–MS	MI-6 model	0.965	0.929	0.946
GC–MS	Deep learning	0.378	0.176	0.236
Raman	MI-6 model	0.890	0.915	0.893
Raman	Deep learning	0.092	0.140	0.105

The MI-6 model generalises well across all three techniques despite their very different characteristics and despite being trained on far less data. Its detected peak positions also agree closely with those established by expert researchers across all three spectral types — showing that, for the specific task of locating peaks, the model reproduces expert annotations reliably on this test set.

However, the comparison should be viewed fairly; the deep-learning model's low scores primarily reflect poor cross-domain transfer. In contrast, the clear strength of the MI-6 approach lies in its use of an interpretable detector, which enables it to transfer effectively across different spectral types. Because the experimental test set is modest in size and performance varies across techniques, these absolute numbers are best treated as indicative. Looking ahead, we expect performance to improve even further, as our model can be easily updated and redesigned for specific tasks and highly specific spectral datasets.

4. Conclusion

This approach preserves the interpretability and flexibility of threshold-based detection while removing its main practical bottleneck — per-spectrum parameter tuning — making it well suited to high-throughput analysis where detected peaks feed downstream property-prediction models. The effective-parameter-space view also points out that, in threshold-based detection, the correct mental model is a viable region of parameter combinations rather than a single optimal value, and the same idea extends naturally to more complex data such as two-dimensional spectra.

For an overview of how peak-based and full-spectrum methods fit into materials-informatics workflows, see the companion article “Spectral Data Analysis in Materials Science” ; for the deep-learning side of peak detection, see “Deep Learning for Spectral Peak Detection”.

References

A. Kensert et al., Convolutional neural network for automated peak detection in reversed-phase liquid chromatography, J. Chromatogr. A 1672 (2022) 463005. https://doi.org/10.1016/j.chroma.2022.463005
B. Lafuente et al., The power of databases: the RRUFF project, Highlights in Mineralogical Crystallography (2015) 1–30.
J. Sundberg, Extended characterization of petroleum aromatics by off-line LC-GC-MS (dataset), 2021. https://doi.org/10.5281/zenodo.5121065
T. Yoongsomporn, S. Kanharattanachai, P. Lueangratana, T. Yamashita, C.H. Chen, M. Irie, C. Jirayupat, Automated spectral peak detection with machine learning: Parameter optimization and effective parameter space analysis with SciPy's find_peaks, Chemometrics and Intelligent Laboratory Systems 265 (2026) 105651. https://doi.org/10.1016/j.chemolab.2026.105651

あわせて読みたい

2026.03.30

Advanced Mass Spectrometry Analysis: Parameter Optimization in XCMS for Untargeted GC-MS and LC-MS Data Processing (Part 2)

記事一覧へ戻る

いずれ、世の中の素になる知恵。

Automating Spectral Peak-Detection Parameter Selection Through Effective Parameter Space Discovery

1. The challenge of setting peak-detection parameters

2. Discovering the “effective parameter space” and building the model

3. Performance and benchmarking

4. Conclusion

References

あわせて読みたい

人気の記事

役割から探す

トピックから探す

人気の記事

お役立ち資料

役割から探す

トピックから探す

キーワードから探す

メール
マガジン

お問い合わせ

お役立ち資料

Automating Spectral Peak-Detection Parameter Selection Through Effective Parameter Space Discovery

1. The challenge of setting peak-detection parameters

2. Discovering the “effective parameter space” and building the model

3. Performance and benchmarking

4. Conclusion

References

あわせて読みたい

人気の記事

役割から探す

トピックから探す

人気の記事

お役立ち資料

役割から探す

トピックから探す

キーワードから探す

メールマガジン

お問い合わせ

お役立ち資料

メール
マガジン