Advanced mass spectrometric and spectroscopic methods coupled with machine learning for in vitro diagnosis

In vitro diagnosis (IVD) is one vital component of medical tests that detects biological samples of tissues or bio‐fluids. Recently, mass spectrometry and spectroscopy have been increasingly employed in the field of IVD, due to their high accuracy, facile sample preparation, and rapid detection. Notably, the large datasets generated by these two technology methods provide a wealth of information but subsequently involve complex and time‐consuming processing works. Machine learning (ML), an important branch of artificial intelligence (AI), has emerged as a promising solution for the decoding of big data. ML imitates the human brain to process data, significantly improving accuracy and efficiency compared with traditional processing methods. In this review, we first introduce the commonly used ML algorithms and advanced mass spectrometry and spectroscopy techniques in the field of IVD, respectively. The ML algorithms are summarized as four aspects according to different learning tasks. Then, the combinations of ML with mass spectrometry, spectroscopy, and multi‐modal analysis for IVD are presented, and the roles of ML in these combinations are elucidated by some representative examples. This review aims to provide a systematic and comprehensive summary of the literature on ML‐assisted mass spectrometry or spectroscopy. We believe that it will facilitate researchers to select suitable ML algorithms for supplementing existing detection techniques or to develop the potential of coupling more detection techniques with ML, thus promoting the development of mass spectrometry and spectroscopy in IVD.


INTRODUCTION
In vitro diagnosis (IVD) is an important subfield of medical tests. It is performed on the biological samples from the living body, such as blood plasma, 1 serum, 2 urine, 3 and tissues. 4 IVD accounts for about two-thirds of all clinical diagnoses. 5 The purpose of IVD is to provide information on the current or future status of the patient, often used in tasks such as risk assessment, 6 drug monitoring 7 and prognostic evaluation. 8 IVD shows great strengths in medical procession. First, IVD is usually non-or mildly-invasive, causing less harm to patients and greatly reducing their suffering compared with in vivo diagnosis. 9 For example, some conventional invasive imaging modalities, such as gastroscope and enteroscope, 10 can cause significant discomfort to the patient. Second, IVD owns high throughput, simple pretreatment, and rapid detection by carrying out advanced analytical techniques. Therefore, doctors can quickly obtain accurate and pivotal information on patient status and make treatment decisions as soon as possible. 9 Based on the above two points, IVD has emerged as an integral part of the early diagnosis of disease clinically. It is playing an increasingly important role in medical technologies.
Traditional technologies for IVD mainly include immunoassays such as enzyme-linked immunosorbent assay (ELISA), electrochemical-based methods, surface plasmon resonance (SPR)-based methods, and so on. ELISA is regarded as a clinical gold standard for protein detection, which combines immunoreactions between antigens and their specific antibodies with enzyme catalysis. 11 It is one of the most commonly used methods for IVD due to its high sensitivity (limit of detection: 0.1-1.5 ng/ml), 9,[12][13][14] high specificity, 15 and ease of operation. 11 However, the detection time of ELISA usually achieves 2-6 h, 16,17 which is not favorable for the rapid diagnosis of bio-samples. Electrochemical approaches have demonstrated great potential for their sensitivity (limit of detection: 0.17-5 ng/ml), [18][19][20][21] rapid diagnosis (5-10 min), 22,23 and low cost. 24 Electrochemical sensor has been widely applied in the detection of glucose, 25 lactic acids, 25 enzymes, 26 miRNA, 24 and other biomolecules. However, the high sensitivity to the surrounding environment of electrochemical sensors could reduce the selectivity to biosamples, 27 resulting in wrong decisions for diseases. SPR is a label-free biosensing analysis technique that can obtain specific signals of interactions between biomolecules through refractive index changes. 28 It can be used for kinetic analysis of biological processes and biomarker identification like exosomal proteins from nonsmall cell lung cancer (NSCLC), 29,30 due to its high sensitivity (limit of detection: 0.2-100 ng/ml). 31,32 However, limited throughput and high cost limit wider applications of SPR in the field of IVD. 33 Recently, optical spectroscopy and mass spectrometry are more and more widely applied in the field of IVD, such as surface-enhanced Raman spectroscopy (SERS) and matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS). Analysis strategies based on these two techniques have the potential to enable high-detection sensitivity and high throughput, which are beneficial for rapid and early disease diagnosis so that the patients can be treated as soon as possible. For example, SERS has been developed for a sensitive and reliable determination of prostate-specific antigen protein based on a target-triggered and self-calibration aptasensor, with limit of detection of 0.536 ng/ml in blood samples. 34 Also, MALDI-MS has been developed with precisely engineered metal-organic frameworks as the matrix, allowing extraction of the serum metabolic fingerprints (SMFs) of 110-250 metabolic signals by using only 0.1 µl of serum per sample within 1 min. 35 However, optical spectroscopy and mass spectrometry usually generate tedious and complex datasets. 36 These large volume datasets not only lead to time-consuming data processing but also obscure in the back-end analysis, which can result in the omission of valid information. Therefore, novel data collection and analysis methods are crucial to promote the application of spectroscopy and mass spectrometry in IVD. To analyze the redundant signals, researchers have introduced some statistical algorithms to analyze the data, such as regression algorithms and clustering analysis, which are usually combined with artificial intelligence (AI) platforms for automated processing and analysis of big data ( Figure 1).
Machine learning (ML) is one of the most important subsections of AI, which is increasingly capable of interpreting a massive amount of datasets and correctly evaluating complex patterns. [36][37][38] ML can teach a computer to learn automatically based on a given dataset and algorithms. The learning process itself refers to finding the best set of model parameters to transform the features in the input data into accurate predictions of the labels. 39,40 Thus, ML can use optimal models to make the best decisions or predictions for unknown data. ML can mainly be classified into four forms according to different learning tasks, including clustering algorithms, dimensionality reduction (DR), regression analysis, and classification algorithms. ML becomes more and more popular in the field of IVD mainly due to the following two reasons. First, to ensure the accuracy and reproducibility of the results in the process of IVD, a massive number of samples need to be collected and analyzed by mass spectrometry and optical spectroscopy. These datasets require ML to extract the key information hiding within them. Second, IVD may require a combination of mass spectrometry and spectroscopic techniques to obtain more F I G U R E 1 An overview of machine learning-assisted mass spectrometry, spectroscopy, and multi-modal analysis for in vitro diagnosis (IVD) comprehensive molecular information. However, the mass spectrometry and spectroscopic data cannot be directly compared with each other due to the incompatibility of datasets. To overcome this limitation, researchers use ML that can process the two datasets separately and visualize them, and finally, the corresponding information can be derived from the ML-processed spectrum. In short, ML can reduce the dimensionality of datasets and improve the accuracy of prediction, providing a powerful method of information extraction and classification for the field of IVD.
As the importance of ML algorithms increases in the helping process and simplify mass spectrometry and spectroscopic datasets, more and more literature on MLassisted mass spectrometry or spectroscopy for IVD has been published. However, to date, most articles only introduce one or several algorithms in combination with mass spectrometry or spectroscopy, and previous reviews related to ML are incomprehensive as they only focus on a certain kind of technology. 36,41 Here, we present an authoritative review of the literature in this field that converges two common categories of methods (mass spectrometry and optical spectroscopy) and ML toward IVD. We first introduce different ML algorithms and then focus on the combination of ML with mass spectrometry or spectroscopy, respectively. Finally, we discuss the multi-modal analysis of mass spectrometry and spectroscopy in combination with ML and followed it with a brief outlook on future developments. We hope that this comprehensive review can demonstrate how ML handles different types of datasets with high efficiency and accuracy by introducing some representative examples and provide a better understanding of the recent advances of ML combined with mass spectrometry and optical spectroscopy in the field of IVD.

MACHINE LEARNING ALGORITHMS USED FOR IN VITRO DIAGNOSIS
IVD based on spectroscopy and mass spectrometry methods often generates large, unstructured, and complex data that requires data analysis using ML. Depending on the diagnostic needs, different ML approaches can be selected purposefully. In this section, we will review several commonly used ML algorithms in the field of IVD and give some examples to help readers understand the role of ML.

Clustering algorithms
Clustering algorithms, including hierarchical clustering analysis (HCA), k-means, and so on, belong to unsupervised learning methods, which can differentiate a given dataset according to defined classification parameters. HCA is the most commonly used clustering algorithm. It first performs a grouping of a given dataset, with each object of the dataset initially considered as a single group. Then, similar groups with the shortest distance are merged into a new group, and finally, all are merged into one large group. 42 HCA has been applied to similarities in the fingerprints of the analytes from cells. 43 K-means is another popular clustering algorithm used in many contexts due to its simplicity and availability. 44 K-means first randomly select k targets as initial clustering centers, then measure the distance between every object and every initial clustering center, and finally, each object is divided into the clustering center closest to it. 45 Cluster analysis always tends to classify objects with similar characteristics into one group, and then analyze differences between the different categories. For instance, it can be used to analyze a significant difference between malignant and healthy cells. 46

Dimensionality reduction
DR is an information processing of transforming highdimensional data into low-dimensional data, aiming at visualization and differentiation of data. 47 Typical DR includes principle component analysis (PCA) and linear discriminant analysis (LDA). PCA is a popular unsupervised DR algorithm. The main idea of PCA is to map n-dimensional features to k-dimensions (dimensions contain key information). The k-dimensions' features belong to orthogonal linear space and they are also known as principal components. 48 PCA can eliminate the correlation between features in the initial dataset as well as explain the potential relationships between features. 49 LDA is a supervised linear dimension reduction technique. LDA can also project high-dimensional data into low-dimensional data by a linear transformation matrix. 50 LDA aims to realize the minimization of between-class distances and the maximization of between-class distances. 51 But LDA has the disadvantage that it ignores local data changes. 50 DR can be combined with clustering analysis to avoid overfitting for better ML models.

Regression analysis
Regression analysis belongs to supervised learning methods and is often applied to developing predictive models. Regression analysis is widely used in the field of IVD. Its algorithms consist of principal component regression (PCR), partial least squares (PLS), the least absolute shrinkage and selection operator (LASSO), orthogonal PLS (OPLS), ridge regression, and so on. Regression analysis has been a very important technique in statistics. Its target is to study how a response variable depends on one or more predictor variables. PCR is an essential yet powerful multivariate analytical method. PCR combines PCA and least-squares regression. 52 It can find out the obvious variables for the target of regression by analyzing the absolute value of the regression coefficients. 49 PLS analysis is similar to PCA. But PLS belongs to supervised learning, while PCA belongs to unsupervised learning. PLS can correlate several variables in the multi-label datasets via a weighted manner to discover the relationship between the variables and obtain a latent variable space. 53 Partial least squares-discriminant analysis (PLS-DA) is developed from PLS. PLS-DA begins by regrouping and coding the variables, which is useful for finding similarities and differences between groups. 54 Then, PLS-DA performs the construction of PLS components and the construction of predictive models. 54 PLS-DA is often used to deal with prediction and discrimination problems. 55 LASSO has been widely used in regression analysis. It can both perform variable selection and regularization to improve prediction accuracy. 56 LASSO models are more and more popular in the field of IVD since they can find out key predictors for diagnosis and simulate clinical outcomes. 57 OPLS is a new multivariate data-processing method developed based on PLS. OPLS adds an orthogonal signal correction filter to PLS, to distinguish the variations in the data. 58 OPLS can further be developed into the orthogonal partial least squares-discriminant analysis (OPLS-DA). OPLS-DA is considered a simple model for data interpretation since it focuses on predictive information. OPLS-DA has been widely used for feature selection and classification in the field of IVD. 59 Ridge regression is proposed on top of the ordinary least squares regression. 60 It belongs to linear regression and can offer reliable evaluation of regression coefficients, suitable for interpreting high-dimensional data. 61 In addition to the regression algorithms mentioned above, recent deep learning models further improve the performance of regression prediction by using multiple levels (or layers) of representation that allow models to get complex relationships from their inputs. 36 For application in the field of IVD, regression analysis is usually used for the prediction of individual health status.

Classification algorithms
Classification algorithms also belong to supervised learning methods. With the appropriate spectral pre-processing procedure, different classical ML algorithms, including support vector machine (SVM), k-nearest neighbor (kNN), decision tree, random forest (RF), artificial neural networks (ANN), etc., have been developed to classify complex datasets. SVM is an optimization-based discrimination model. The main idea of SVM is to find optimal hyperplanes or decision boundaries to best separate different objects in a multi-dimensional space. The hyperplanes are typically constructed by analyzing data points near the candidate hyperplanes. These data points, referred to as support vectors, are then iteratively weighted during the training phase to maximize the distance between the classes separated by the hyperplane. 62 The kNN is a simple data classification method. For classification purposes, kNN can classify different objects by computing the distance between different feature values. The kNN can cluster similar objects in close locations. And then the unknown sample is clustered into the space, where the k samples most similar to it are located. 63 The decision tree is a supervised data mining technique. It creates a tree-like structure and gives a given attribute by the non-leaf node tests. 64 RF can be regarded as a classifier with multiple decision trees, which builds a hyperplane at every nonterminal node to make it easier to split at child nodes than in a certain decision tree. 64 Neural network, also known as ANN, is a subfield of ML. The essence of ANN is the parallel information processing functions through network transformation and dynamic activities. The specific operation of ANN is to map artificial neurons or nodes to a basic processing unit and then it can imitate the information processing styles of the human brain and nervous system. 65 ANN has been a hotspot algorithm to solve classification problems for IVD. For example, Alafeef et al. used the ANN algorithm to identify cancer cell types from 36 unknown breast cancer samples with an overall accuracy of >98%. 66 Currently, ANN has been extended to some branches, such as the multi-layer (MLP) and convolutional neural network (CNN), which also have been widely used in the field of IVD. 36 Some classification algorithms also can be used for regression analysis, for example, SVM and RF which can be applied for prediction of mortality risk prediction. 67 In general, depending on the specific purpose, algorithms used for ML can be mainly divided into four types: clustering, DR, regression, and classification. As different algorithms have different emphases, they should be rationally selected and optimized according to the purpose of different IVD segments and the characteristics of the data. In addition, a combination of several different ML algorithms, such as DR plus classification, is a promising approach in some scenarios with multiple purposes. ML contributes to the interpretation of the data, and in turn, the increase in the amount of data allows the benefits of ML to be fully developed. In recent years, detection technologies have developed rapidly, and the quantity and quality of the obtained data have been significantly improved. Among them, mass spectrometry and spectroscopy have received plenty of attention in the field of IVD. Since the data collected by mass spectrometry and spectroscopy contain a large amount of information, ML-assisted analysis is especially needed.

ADVANCED MASS SPECTROMETRIC METHODS FOR IN VITRO DIAGNOSIS
Several advanced technologies, such as ELISA, electrochemical techniques, and SPR, have facilitated the development of IVD, but the technologies still face limitations in clinical practice (Table 1). Mass spectrometry has emerged as a promising technology for IVD because it can provide both qualitative and quantitative analysis with high resolution. Researchers have developed a series of mass spectrometry techniques, including liquid chromatography-mass spectrometry (LC-MS), MALDI-MS, and paper spray ionization-mass spectrometry (PSI-MS). Mass spectrometry-based methods can potentially improve the detection speed and detection throughput of IVD tests. Furthermore, researchers have combined these mass spectrometry technologies with ML for better data interpretation to improve the precision and accuracy of IVD.

Liquid chromatography-mass spectrometry
LC-MS is a common technique for profiling complex biological samples owing to strong separation and adaptability for almost all compounds. 82 In the field of IVD, LC-MS is mostly used for the acquisition of omics data, such as proteomics and metabolomics. Among them, the filtering of features in raw data (biomarker discovery) is one main research direction. Anderson et al. presented a simple strategy used for unknown feature identification in untargeted metabolomics by modifying the chromatography parameters of LC-MS. They built an LDA model and classified a total of 576 and 749 unidentified features in the hydrophilic interaction liquid chromatography (HILIC) data and the reversed-phase liquid chromatography (RPLC) data by this model (Figure 2A). 83 Wang et al. developed an LC-MS-based targeted assay and applied SVM to analyze LC-MS datasets, building a diagnostic panel for early pancreatic ductal adenocarcinoma with the optimization of 17 characteristic metabolites for early detection of pancreatic ductal adenocarcinoma, and achieved an accuracy of 85.00% with an area under curve (AUC) value of 0.9389 in the clinical cohort. 84 Aicheler et al. reported a retention time model based on support vector regression (a branch of SVM) and combined this retention time model with reversed-phase ultra-high pressure liquid chromatography area under curve mass spectrometry (UHPLC-MS) for identification in nontargeted lipidomics with 94.7% of the correct candidates ( Figure 2B). 85 Cui et al. employed broad-spectrum metabolomics profiling with LC-MS/MS followed by multiple ML algorithms, achieving the identification of metabolic signatures associated with future angina recurrence risk in two large cohorts. The accuracy, sensitivity, and specificity in the discovery cohort were 97.6%, 98.6%, and 97.2%, respectively, while those in the additional discovery cohort were 93.2%, 90.0%, and 94.4%, respectively. 86 Poss et al. introduced RF and LASSO regression in LC-MS/MS sphingolipid analysis for a candidate biomarker for coronary artery disease (CAD). They generated a novel sphingolipid-inclusive CAD risk score that included the highest-performing sphingolipid RF-and LASSO-generated components (AUC = 0.79), and finally

3.2
Matrix-assisted laser desorption/ionization mass spectrometry MALDI-MS has received a wide range of interest in recent years because of its advantages of fast detection speed, small-sample consumption, high throughput, and high sensitivity. MALDI-MS can analyze samples with volume down to 100 nl and detect biomolecules within seconds. 93 As previously mentioned, this technique generates large and complex datasets, which requires ML to provide key insights in data mining. Thus, the combination of MALDI and ML creates powerful tools for applications such as biomarker screening, antimicrobial resistance prediction, and single-cell classification. MALDI-MS are usually used for the detection of biological samples, such as tears, urine, serum, and aqueous fluid ( Figure 3). 94 Wu et al. used nanoparticle-enhanced MALDI-MS to rapidly collect tear metabolic fingerprint information from down to 10 nl of tears and combined ridge regression and other algorithms to build a glaucoma analysis platform to contrast a biomarker panel of six metabolites for glaucoma characterization (including screening, subtyping, and early diagnosis) with an AUC value of 0.827-0.891. 74 Li et al. built a multi-shelled hollow Cr 2 O 3 sphere (MHCSs) assisted laser desorption/ionization mass spectrometry platform for direct metabolic profiling of biofluids toward Schizophrenia (SZ) diagnostics. This platform identified urine and serum metabolites (≈1 µl) with enhanced LDI efficacy in seconds and finally discriminated SZ patients (SZs) from healthy controls (HCs) with an AUC value of 1.000 for the blind test by using PCA and OPLS-DA. 3 Shu et al. constructed a plasmonic chip with Au nanoparticles deposited on a dopamine-bubble layer as a new LDI-MS matrix for clinical metabolic fingerprints. Metabolites with a concentration down to 0.005 mg ml −1 can be detected successfully by this platform and differentiated cervical cancer patients from healthy controls by using OPLS-DA. 95 Besides, MALDI-MS has shown great utility for rapidly identifying microbial species. Weis et al. used MALDI-MS for pathogens validation and applied three ML approaches to predict resistance to each antimicrobial within 24 h. They evaluated to what extent the respective best model was capable of predicting resistance to antibiotics. For 31 antibiotics, an AUC value of above 0.80 was reached, implying highly accurate predictions. 96 MALDI-MS can also be employed for cell classification. Xie et al. used MALDI-MS for multiplexed chemical analysis of single cells and developed a supervised ML workflow, including PCA, RF, and other algorithms, to classify single cells like neurons and astrocytes according to their mass spectra based on cell groups of interest (GOI) with an accuracy of over 80%. 97 Overall, due to the rapid detection speed, high-detection throughput, and small-sample consumption, MALDI-MS has become a popular technique that has great potential for a variety of tasks in the field of IVD. When combined with ML, the data mining and knowledge discovery of MALDI-MS datasets can be efficiently carried out, leading to better-presented diagnosis results. However, the detection performance of MALDI-MS mainly relies on the physical and chemical properties of matrix materials and is rarely used for real-time detection. Therefore, developments of novel MALDI-MS matrices with improved metabolite ionization efficiency and novel real-time detection techniques are in great demand.

Ambient ionization mass spectrometry
Ambient ionization mass spectrometry (AIMS) is another pioneering mass spectrometry technique that enables rapid sample analysis with practically free sample pretreatment. 98 Compared with MALDI-MS, AIMS can be conducted under ambient conditions (open-air sampling) rapidly to acquire information on the analyte in almost real-time. Different variants of AIMS have been developed, including desorption electrospray ionization mass spectrometry (DESI-MS), paper spray ionization mass spectrometry (PSI-MS), touch spray ionization mass spectrometry (TSI-MS), and so on. With the help of ML, DESI-MS can be applied to differentiate cancerous and normal tissue. Kerian et al. first used DESI-MS to establish the relationship between mass spectrometry features and pathology and then used touch spray to characterize unknown tissue samples. They evaluated the two methods by LDA and PCA with an accuracy of over 95% for correct sample identification. 99 DESI-MS can also be used for mass spectrometry imaging in the field of IVD, which will be discussed in detail in the next section. PSI-MS is usually combined with ML for disease prediction and biomarker screening. Huang et al. constructed a classification model for the classification and prediction of datasets from PEI-MS and showed an overall accuracy of 87.5% for an instantaneous differentiation between cancerous and noncancerous breast tissues ( Figure 2C). 100 Mahmud et al. used PSI-MS-based global metabolomics of urine liquid biopsies to classify healthy and progressive prostate cancer states by multivariate partial least-squares-discriminate analysis (PLS-DA) and demonstrated a specific metabolic pattern associated with progressive disease ( Figure 2D). 101 Other ambient ionization techniques, such as direct analysis in real-time (DART), have been developed into proper portable DART instrumentation for drug testing since they can be used in the open environment to analyze samples. 102 However, the combination of these techniques and ML is seldom reported in the field of IVD compared with LC-MS or MALDI-MS. Although AIMS can perform almost real-time detection, they still face some limitations. For example, one of the challenges of DESI-MS is the accuracy and reproducibility of quantitation analysis. 103 DESI-MS is usually used for the detection of solid substances on a surface, so the detection effects of DESI-MS mainly depend on the surface, substrate homogeneity, and matrix effects. To date, surface design has been innovated to improve the reproducibility of DESI-MS like electrospun nanofiber mats. 104 In addition, the development of the liquid sample DESI-MS can reduce interference from sample unevenness. 105 DESI-MS is expected to have further applications in IVD through the emergence of novel surfaces and improvements in DESI-MS instruments.

Mass spectrometry imaging
Mass spectrometry imaging (MSI) is a rapidly evolving molecular imaging method, which can map simultaneously the spatial distribution of hundreds of biomolecules (metabolites, lipids, proteins, etc.) across the tissues of the living body. 41,106 The two most common MSI techniques are DESI-MSI and MALDI-MSI. MALDI-MSI is often used for the visualization of the metabolite distribution and cancer tissue classification with a combination of ML. Mittal et al. distinguished colorectal cancer from normal tissue with an overall accuracy of 98% and predicted the presence of lymph node metastasis in primary cancer of endometrial cancer with an overall accuracy of 80% by combining MALDI-MSI with extended supervised ML. 107 Bakker et al. used MALDI-MSI to characterize the lipid profiles of 3D pellets of human primary chondrocytes in normoxia (20% oxygen) and hypoxia (2.5% oxygen). Then, LDA and PCA were applied to data processing of MALDI-MSI, revealing different lipid types in hypoxic and normoxia conditions ( Figure 4A). 108 Saigusa et al. developed conductive adhesive films for MALDI-MSI and measured the differences in the ion intensity of cryosections obtained from a mouse brain by using multivariate analysis (PCA and OPLS-DA), achieving lipids high localization of cryosections obtained from a mouse ( Figure 4B). 109 With the intervention of ML techniques, DESI-MSI enables the identification of disease markers and the metabolic study of heterogeneous cells. Zhu et al. built an organ-specific, metabolite, database-driven annotation approach for whole-body MSI based on airflow-assisted desorption electrospray ionization (AFADESI)-MSI. They revealed organ metabolism remains highly specific ( Figure 4C). 110 Yan et al. combined DESI-MSI and immunofluorescence to realize cell-typespecific and in situ metabolic profiling in tissue samples. They performed PLS-DA based on detected metabolic features and obtained deconvolved cell-type-specific metabolic profiles by convex optimization, finally differentiating neurons and astrocytes in the external septal nucleus of the adult mouse forebrain ( Figure 4D). 111 Owning the refined spatial resolution (∼20 µm) and the ability to extract multiplex molecular information, MSI has emerged as a powerful medical imaging technique for disease diagnosis. In summary, mass spectrometry has emerged in the application of IVD for its high sensitivity, high specificity, and ability to obtain rich molecular structure information. However, there are still some drawbacks, such as the structural libraries of mass spectrometry are not abundant enough and causing difficulties in the analysis of substances with unknown structures that do not exist in current structural libraries. 112 To overcome this problem, conjunction of mass spectrometry with other techniques, such as nuclear magnetic resonance, is required. In addition, another drawback is the underutilization of mass spectrometry in various clinical settings due to the highcost of mass spectrometry instruments and low degree of automation. 113 To address this limitation, low-cost, and highly automated equipment should be developed to promote their large-scale clinic application. Furthermore, another limitation of clinical mass spectrometry is the identification of blood biomarkers for the detection of a disease at a very early stage, since the biomarker levels are extremely low in the early stages of the disease. Therefore, detection methods of higher sensitivity need further development. The methods to improve the sensitivity of mass spectrometry mainly include optimization of instrument parameters, design of analytical strategies, and excellent sample pretreatment. For instance, the development of mass tag-based mass spectrometry improves ionization efficiency by transforming detection signals from biomolecules to mass tags. 114 Besides, nanofluidic analytical pretreatment methods have been constructed. It can downsize chemical unit operations to fl-pl volumes by nanochannels to achieve ultrahigh sensitivity. 115 As the developments of ML-coupled mass spectrometry techniques, significant improvements can be brought to the field of IVD.

ADVANCED SPECTROSCOPIC METHODS FOR IN VITRO DIAGNOSIS
Besides mass spectrometry, spectral analyses are also widely used in IVD, for their rapid analysis speed, simple operation, high selectivity, and high sensitivity. Both mass spectrometry and spectroscopy can be used to identify the structure of substances. Mass spectrometry can reflect the relative molecular mass of substances, while spectroscopy can determine the main groups of substances. Typical spectroscopic techniques in IVD include infrared spectroscopy, fluorescence spectroscopy, and Raman spectroscopy. These techniques especially Raman spectroscopy often produce a large amount of data in label-free detection, therefore ML is also needed in spectrum data analysis to extract meaningful information.

Raman spectroscopy
Raman spectroscopy is a sensitive, rapid, and nondestructive detection technique, based on the inelastic scattering of photons of specific molecules activated by a laser source. 116 SERS uses the surface of the substrate to achieve enhancement of the signal from nearby molecules. The emergence of SERS further improves the detection sensitivity of Raman spectroscopy with the inelastic light scattering by molecules enhanced by factors up to 10 8 or even larger in some cases. 117 The SERS technique increases the Raman signal intensity by several orders of magnitude and can be applied to single-molecule level detection, making SERS popular in a wide range of application scenarios.
In vitro SERS technique based on a nanostructured surface is often used for early cancer screening, monitoring of cellular metabolism, and rapid antimicrobial susceptibility testing. Lin et al. embedded Ag nanoparticles in multi-layer black phosphorus nanosheets (Ag/BP-NS) as an SERS sensor and classified different tumor exosomes by SVM, with a sensitivity of 100% for the trained model and 99.17% for the test set, respectively. 118 Lussier et al. reported a combination of SERS and ANN as a nondestructive and label-free method, revealing metabolite gradients for a series of characterized cells and a panel of metabolites ( Figure 5A). 119 Thrift et al. built a platform composed of SERS sensors combined with both deep neural network models and unsupervised Bayesian Gaussian mixture analysis (a classification ML model). The deep neural network models can discriminate the responses of two bacteria to antibiotics in SERS data in 10 min with greater than 99% accuracy, and the unsupervised Bayesian Gaussian mixture analysis achieved 99.3% accuracy in discriminating between susceptible versus resistant to antibiotic cultures in SERS ( Figure 5B). 120 Huang et al. developed a novel SERS-in-a-capillary platform for the diagnosis of heparin-induced thrombocytopenia (HIT). They employed principal components-linear discriminant analysis (PC-LDA), PLS-DA, and LASSO-DA to evaluate the capability of SERS for differentiating between HIT-positive and neg-ative samples, and achieved the highest sensitivity of 86% and specificity of 81% for LASSO-DA ( Figure 5C). 121 In addition, SERS can be used as an important tool for virus detection and future outbreak preparedness. Paria et al. reported a large-area and label-free testing platform based on SERS for rapid and accurate detection of SARS-CoV-2, and four different kinds of "enveloped" RNA viruses could be identified with an accuracy of over 83% by PCA and RF classification ( Figure 5D). 122 As mentioned above, SERS has become a common tool in the IVD field due to its advantages of rapid detection speed, non-destructive detection, and high sensitivity. However, SERS still needs to overcome some challenges, such as reproducibility and sensitivity, 123 before it can be used on a large clinical scale. These challenges might be overcome by better-designed substrate or post-optimization of ML algorithms. In addition, another limitation is the lack of an open-source standard SERS database to attain quantitative validation parameters of current clinical standards. 124 To address this issue, efficient classification of algorithms and in-depth research on spectral preprocessing with ML is necessary. Overall, the development of SERS substrates and ML can enable more accurate, standardized, and credible analysis results, which will advance SERS as a powerful diagnostic tool in the field of IVD.

Fluorescence spectroscopy
Fluorescence spectroscopy has been a typical detection method for many years due to its sensitivity and selectivity. In recent years, fluorescence spectroscopy-based IVD has developed with suitable fluorescent probes or fluorescent quantum dots used for biomarker identification and early diagnosis of diseases. To better discover biomarkers, feature extraction, and classification are necessary based on the biochemical signatures of samples. However, the key signal features are difficult to extract for fluorescence spectra due to the noise background from the instrument itself and the interference of external molecules. Therefore, it is vital to process and simplify the obtained spectra using suitable ML algorithms to eliminate background noise and external interference. With ML algorithms, fluorescence-based detection, including laser-induced fluorescence (LIF) spectrum and fluorescence imaging has been applied in the field of IVD. Raghushaker et al.
reported an integrated approach that uses SVM analysis to analyze the fluorescence and photoacoustic spectral properties of tryptophan, achieving clear differentiation between mitochondria isolated from normal and cancer tissues for fluorescence (86.6% sensitivity and 90% specificity) and photoacoustic (86.6% sensitivity and 96.6%

F I G U R E 5 Surface-enhanced Raman spectroscopy (SERS) strategies based on machine learning (ML) for in vitro diagnosis (IVD). (A)
An SERS nanoprobe combined with convolutional neural network (CNN) to reveal multiplexed metabolite gradients near cells. 119 Copyright 2019, American Chemical Society. (B) An SERS nanosensor platform combined with ML for rapid antimicrobial susceptibility testing. 120 Copyright 2020, American Chemical Society. (C) Rapid and label-free SERS for the diagnosis of heparin-induced thrombocytopenia (HIT). 121 Copyright 2020, Wiley-VCH. (D) A large-area and label-free testing platform that combines SERS and ML for rapid and accurate detection of SARS-CoV-2. 122 Copyright 2022, American Chemical Society specificity) measurements. 125 Tan et al. developed an explainable deep learning-assisted visualized fluorometric array-based sensing method and investigated the efficient qualitative and quantitative analysis of six aminoglycoside antibiotics. They used a CNN algorithm to build models and predicted the categories or concentrations of the six aminoglycoside antibiotics with a 100% prediction accuracy rate ( Figure 6A). 126 Xu et al. built a dual-emission fluorescence/colorimetric sensor array for the determination of nine antibiotics and fabricated a unified SX-model by "stepwise prediction" strategy to break bottleneck dangled in array detection, discriminating different concentrations of antibiotics in bio-samples with a classification accuracy of 98.21%. 127 Another clinical application of the fluorescence technology is fluorescence imaging, which utilizes consumer-or laboratory-level imaging instruments to detect fluorescence intensity on a two-dimensional scale and present the spatial distribution of different fluorescence intensities. Fluorescence imaging, in collaboration with ML, enables the screening of disease markers and the identification of antibiotic molecules at the spatial scale. Squire et al. demonstrated a photonic crystal-enhanced fluorescence imaging immunoassay biosensor to quantitatively detect N-terminal pro-B-type natriuretic peptide (NT-proBNP) and classified the NT-proBNP levels by SVM with a specificity of 93% and an accuracy of 78% ( Figure 6B). 128 It is worth noting that there is not much work on IVD using fluorescence imaging alone since fluorescence imaging is often used as an adjunct tool to other techniques. In general, fluorescence-based techniques, including fluorescence spectroscopy and fluorescence imaging, have been implemented for disease diagnosis and antibiotic identification with the assistance of ML. However, fluorescence techniques still face the challenge of obtaining important information from living organisms sensitively and accurately. Novel fluorescent probes based on fluorescence resonance energy transfer can effectively improve the sensitivity of fluorescence technologies. For example, a fluorescence resonance energy transfer (FRET)based nanoprobe modified by CdSe/ZnS quantum dots can specifically measure the human neutrophil elastase with an excellent sensitivity of 7.15 pM in an aqueous solution. 129 In addition, fluorescent probes that are responsive to dual or multiple cancer biomarkers are being progressively developed to improve the accuracy of F I G U R E 6 Fluorescence and Fourier transform infrared spectroscopy (FTIR) strategies based on machine learning (ML) for in vitro diagnosis (IVD). (A) Schematic illustration of using the convolutional neural network (CNN)-assisted fluorometric array-based sensing method for qualitative and quantitative analysis of aminoglycoside antibiotics (AGs). 126 Copyright 2022, American Chemical Society. (B) Schematic view of a diatom-based immunoassay for NT-proBNP detection. 128 Copyright 2019, Elsevier. (C) General workflow that combined FTIR with support vector machine (SVM) for probing molecular changes in disease. 131 Copyright 2021, Wiley-VCH. (D) FTIR spectroscopy in combination with ML as a powerful and effective tool for susceptibility determination of Klebsiella pneumoniae. 134 Copyright 2021, American Chemical Society disease diagnosis. 130 Furthermore, the accuracy of disease diagnosis can also be improved by reducing interference from background signals of the fluorescence spectra, which could be achieved by applying appropriate ML algorithms for data processing. Thus, the development of novel fluorescent probes and suitable ML algorithms may be a promising research direction for the application of fluorescence spectroscopy in the field of IVD.

Infrared spectroscopy
Infrared spectroscopy is a well-established method of studying chemical substances via analyzing the vibrational transitions that are characteristic of their molecular structure. Infrared light can be divided into near-infrared, mid-infrared, and far-infrared according to the wavelength range. A typical infrared spectra spectroscopy technique is the Fourier transform infrared spectroscopy (FTIR). This technology can detect molecules in living bodies at different wavelengths and provide corresponding molecular spectra reflecting the structures of molecules. These spectra can be further analyzed by suitable ML algorithms to obtain information about the state of the human body. FTIR is one of the most commonly used techniques in infrared spectroscopy since it is a fast, simple and non-destructive technique for obtaining structural information of molecules. FTIR is often combined with ML to reveal changes in the abundance of targeted molecules during cancer progress or to distinguish cancer subtypes. Voronina et al. characterized the proteomics information by FTIR spectral information and combined it with ML, demonstrating that 12 highly abundant proteins dominated the infrared molecular fingerprints (IMFs), and the disease-related differences in IMFs could reach an AUC of 0.82 ± 0.1 ( Figure 6C). 131 Butler et al. used attenuated total reflection (ATR)-FTIR spectroscopy along with ML for the detection of brain cancer, enabling the classification of cancer and control patients at a sensitivity and specificity of 93.2% and 92.8%, respectively. 132 Besides, FTIR can also be combined with ML algorithms to explore the mechanism of a drug reaction or detect the susceptibility of bacteria to antibiotics. Mizera et al. used transmission FTIR (tFTIR) and complementary ATR-FTIR to determine the formation of complexes of β-lactam antibiotics with cyclodextrins (CDs) and the interactions involved in this process. Further, they developed a model through ML to distinguish samples with formed complexes from uncomplex samples at a high cross-validation accuracy of 90.4%. 133 Sharaha et al. used FTIR microscopy combined with RF and extreme gradient boosting classifiers to analyze 1,190 different isolates of Klebsiella pneumoniae confirming the possibility to classify the sensitive and resistant isolates with a success rate higher than 80% ( Figure 6D). 134 In conclusion, although infrared spectroscopy is a powerful analytical tool due to its ability to detect small changes in characteristic molecules, it still has some limitations in the detection of biological fluids. One main challenge is that infrared spectroscopy is difficult to annotate every peak since some of them are difficult to match with the functional groups. To solve this problem, infrared spectroscopy is often coupled with other techniques, such as infrared-mass spectrometry or infrared-UV absorption spectroscopy, to identify the structure of molecules. In addition, the infrared spectra may overlap because of the high complexity of molecules in biological fluids. The multiplicity and overlap of infrared spectra may make it difficult to establish correlation models for spectral data. Therefore, advanced data preprocessing methods need to be established. For example, applying appropriate ML algorithms to infrared spectroscopy may improve the robustness of spectral data, which is critical for disease classification and prediction.

MULTIMODAL ANALYSIS
The disease is a complex physiological process that involves many molecular changes in the body. Different techniques of detection have their specific focus and preclude a comprehensive analysis of bio-samples. To achieve a comprehensive analysis of biological samples, the structures and contents of biomolecules (lipids, proteins, amino acids, etc.) need to be obtained as much as possible. 135 Therefore, multi-modal analysis is crucial for the application in the field of IVD, due to the ability to perform a comprehensive analysis of biological samples and improve diagnostic accuracy. For this review, we focus on the combination of mass spectrometry and spectroscopic techniques. Currently, there are three common strategies to combine mass spectrometry and spectroscopic techniques: (1) MS/spectroscopy, (2) MS/spectral imaging, and (3) MSI/spectral imaging. Compared with single-modal analysis, multi-modal analysis has two main advantages, which can provide multi-dimensional complementary information and better ML performance.
Thus, multi-modal analysis not only can reveal the complex mechanisms in the disease progression or treatment proceedings, 136 but also enable more accurate and personalized diagnosis. On one hand, as mentioned earlier, multi-modal analysis enables information complementation in the disease progression from different dimensions. More specifically, the combination of mass spectrometry and spectroscopy can enhance the understanding of biomolecules by combining the mass information with the structural information, facilitating the recognition of molecular changes in disease progression. For example, Ali et al. coupled SERS measurements with metabolomics and proteomics based on LC-MS to identify the chemical species responsible for the observed changes in SERS band intensities during plasmonic photothermal therapy. They found that the integration of the SERS with LC-MS-based metabolomics and proteomics can assist in assigning signals in SERS spectra, revealing the power of combining SERS with MS for studying cellular processes following the photothermal therapy. 137 On the other hand, multi-modal analysis can improve ML performance by fusing datasets of different dimensions. ML algorithms are applied for obtaining better feature representations, and thus improving the accuracy of classification and prediction. Shu et al. designed a multi-modal serum profiling protocol using a PdAu@Au alloy by integrating SERS and LDI-MS for precision diagnosis of stroke. This multi-modal platform predicted the probability of stroke with an AUC value of 0.911 in testing cohort by the polynomial regression, which was higher than the AUC values of 0.755 and 0.822 of SERS and LDI-MS single-modal diagnostic, respectively. Single-modal data of SERS and LDI-MS were performed using the RF and LASSO, respectively. This work demonstrated the emerging role of a highly efficient multi-modal serum profiling platform in precision medicine. 138 Han et al. integrated SERS and MALDI-MS for rapid identification of osteosarcoma. They used PCA to divide fused data into two groups and then built a classification modal by PLS-DA using all the training data to discriminate the osteosarcoma patients from the healthy controls. The PCs accounted for 20.1% for the PCA on SERS spectra, 33.4% for the PCA on MALDI-MS spectra, and 55.5% on the combined spectra. The multi-modal analysis was more concrete than either technique at profiling plasmaderived exosomes for the identification of osteosarcoma. 139 Neumann et al. devised a protocol for performing MALDI-Fourier transform ion cyclotron resonance-MSI, followed by infrared spectroscopic imaging on the same mouse brain. The fused dataset can show additional anatomical structures within the hippocampus that cannot be identified by single-model imaging, and significant differences between four types of phosphatidylcholine lipid abundances were detected between relevant structures within F I G U R E 7 Multimodal analysis that combined MALDI-MSI and infrared spectroscopic imaging for localization of lipids in hippocampus. 140 Copyright 2018, American Chemical Society the hippocampus (Figure 7). 140 141 In summary, multimodal analysis plays a critical role in the field of IVD. Mass spectrometry and spectroscopy have their specific advantages in providing complementary information. As an example, the fluorescence imaging technology can provide the location information of cells, 142 while the current metabolic state of cells is provided by MSI, together contributing to in situ metabolomics research of cells. Besides, different types of datasets can be fused to improve the accuracy of predictions or judgments, which helps to discover more potential biomarkers. For example, multimodal imaging in the same tissue can provide precise spatial alignment from different dimensions, enabling advanced image processing to discover more feature molecules. 140 However, there are still some limitations of multimodal analysis because of the incompatibility between substrates, sample preparation methods, 139 and pre-processing of data. 143 Recently, one nanostructured substrate has been developed to rule surface-enhanced laser desorption/ionization mass spectrometry imaging (SALDI-MS) and SERS multimodal imaging. This method can require molecular images by SERS and SALDI-MSI, respectively, from the same sample using the same nanostructured silicon substrate. 143 Nonetheless, new substrate materials for data collection, new ML algorithms for data integration, and new instrument combinations for multimodal imaging remain to be further developed.

CONCLUSION
Developments in mass spectrometry and spectroscopy have advanced IVD along with ML. In this review, we summarized the features of different types of mass spectrometry and spectroscopy and their applications in IVD coupled with ML. Detailed techniques include LC-MS, MALDI-MS, AIMS in mass spectrometry, SERS, fluorescence spectroscopy, FTIR in spectroscopy, and the corresponding imaging techniques. These classical or advanced techniques have been widely used in the detection of bio-samples and would generate large datasets containing massive amounts of sample information. Large datasets result in time-consuming information processing, and they may obscure critical information. To better decode these data and understand their regularities, ML has been introduced into this analysis process. Many strategies for ML to process big data have been studied, including clustering analysis, DR, regression analysis, and classification algorithms. In addition, the multi-modal analysis further improves the accuracy of diagnosis and differentiation. Through these means, the information contained in biological samples is more fully analyzed in clinical applications, such as biomarker identification, disease diagnosis, and cell classification. However, some challenges should be taken into account before these techniques are widely applied in large-scale clinical diagnosis.
Mass spectrometry and spectroscopy face common limitations: (1) data reproducibility. 124 Environment, instrument, and human handling may all cause batch differences; (2) the detection sensitivity and accuracy. Ion suppression and background noise from solvent or other components are the main reasons for insufficient sensitivity in complex bio-samples of mass spectrometry and spectroscopy, respectively; 123,144 and (3) the accessibility in the clinical laboratory. To address the above limitations, first, the quality control of reagents is necessary and stability of the instrument should be maintained to improve the reproducibility of data; second, the optimized instrument parameters, designed analytical strategies, and optimized sample pretreatment are crucial to the sensitivity of mass spectrometry and spectroscopy; Third, small, low-cost, and highly automated equipment should be developed to promote their large-scale clinic application.
Besides, mass spectrometry and spectroscopy also have their specific limitations. For mass spectrometry, the structural libraries of mass spectrometry need to be further enriched. Information on isomeric molecules needs to be supplemented and annotated. 112 To overcome this challenge, mass spectrometry can be combined with other techniques like ion mobility to provide multidimensional information for identifying substances. 145 For spectroscopy, overlapping spectral peaks make the analysis of complex biological fluids difficult. The resolution of the instrument should be improved to reduce the number of overlapping spectral peaks and thus improve the selectivity of the analysis.
In addition, sample preparation and pre-processing of data are challenges in multi-modal analysis. For sample preparation, performing multiple analysis methods on the same sample is favored to avoid heterogeneity among samples in multi-modal detection. For pre-processing of data, MS and spectroscopy usually need different types of perprocessing of data due to their different natural datasets. Therefore, standardized processes for data processing need to be established.
As for ML, the most challenging problem is the lack of universality of algorithms. Different algorithms with their own characteristics correspond to different data types. It is tough to develop an algorithm that can be perfectly adapted to all models. For example, in DR analysis, PCA focuses on the global structure (i.e., the associations of different categories), 146 while t-SNE focuses on the local structure (i.e., the associations within a category). 147 Also, the size and structure of the dataset affect the choice of algorithms. Therefore, it is essential to choose the appropriate algorithms and optimize their parameters in the IVD application for better performance.
To sum up, as an emerging combination strategy, mass spectrometry/spectroscopy and ML are gradually acknowledged and applied in the field of IVD. Bearing in mind that the development of mass spectrometry and spectroscopy instruments, more data of high quality and modality can be collected. These data are expected to be processed by ML with optimized algorithms that can yield more accurate information for IVD. We envision that the spectroscopic and mass spectrometric methods coupled with ML will become solid tools in IVD and will accelerate the development of the IVD industry.

A C K N O W L E D G M E N T S
We gratefully acknowledge the financial support from Project 22074044, 22122404 by National Natural Science Foundation of China (NSFC) and Project KF2105 by State Key Laboratory of Oncogenes and Related Genes.

C O N F L I C T O F I N T E R E S T
The authors declare no conflict of interest.