Metabolomics and Machine Learning Approaches Combined in Pursuit for More Accurate Paracoccidioidomycosis Diagnoses

Paracoccidioidomycosis (PCM) is a fungal infection typically found in Latin American countries, especially in Brazil. The identification of this disease is based on techniques that may fail sometimes. Intending to improve PCM detection in patient samples, this study used the combination of two of the newest technologies, artificial intelligence and metabolomics. This combination allowed PCM detection, independently of disease form, through identification of a set of molecules present in patients’ blood. The great difference in this research was the ability to detect disease with better confidence than the routine methods employed today. Another important point is that among the molecules, it was possible to identify some indicators of contamination and other infection that might worsen patients’ condition. Thus, the present work shows a great potential to improve PCM diagnosis and even disease management, considering the possibility to identify concomitant harmful factors.


RESULTS
Selection of potential biomarkers through machine learning. The machine learning method for biomarker determination described in Materials and Methods was applied over the spectrum data as follows.
The collection of 1,708 spectrum vectors of m/z intensities, resulting from the spectrometry quintuplicate measurements of biological samples of 343 individuals, was normalized dividing each intensity by the highest absolute intensity on the vector (normalization where maximum equals 1), and patients' samples were randomly split into fit partition (Pfit) and test partition (Ptest) in the proportion of 80% and 20%, respectively. Classifiers were trained and validated in all steps of the method using 10 experiments of Pfit randomly shuffled and divided into training partition (Ptrain) and validation partition (Pval) in the proportions of 80% and 20%, respectively. Figure 1 depicts the evolution of metrics as the vector shrinks by discarding the less important features. Statistical metric definitions are shown in Table 1. The best results were achieved with the length of 28 features (Table 2). Table 3 shows the metrics for the most-discriminant feature point and also for the marker-selected ones. Even though 28 features were identified by the classifier as responsible for maximizing the prediction result, some of them were not considered actual PCM markers (Table 2) by the ΔJ criterion, by which a marker should have a higher probability to present higher intensities on the PCM-infected patients. Using the ΔJ criterion, 19 PCM candidate biomarkers were selected. Although the highest values of accuracy, sensitivity, and specificity were achieved during validation testing with 28 best-length features (Table 3), there was no statistically significant difference for the same metrics when only the 19 PCM candidate biomarkers were evaluated in the final test. In this way, we focused on elucidation of these 19 features intending to understand the PCM pathophysiology and looking for a specific yeast biomarker. In Fig. 2 6. However, we were not able to elucidate these species using the available tools. Considering the fact that this work is nontargeted metabolomics, the lack of database matches for some of the selected markers is not uncommon (11). Therefore, these features were classified as unknown biomarkers, albeit they remain as important metabolites in PCM detection for the machine learning method. After metabolomics analysis, three biomarkers were selected as the most compatible ones with the metabolites of Paracoccidioides brasiliensis. Therefore, we have performed an oriented classifier training with 756.6 and 758.6 ( Table 5), and only these two biomarkers were able to predict paracoccidioidomycosis with most of the metrics above 85%.

DISCUSSION
The present research allowed use of a machine learning method to select potential PCM biomarkers aimed at achieving better accuracy, sensitivity, and specificity metrics than the routine available methods. This approach enabled the identification of diverse biomarkers which are discussed below.
Among them, a mycotoxin was selected in PCM patients' serum samples that was known as fumonisin, a toxin typically produced by Fusarium fungal species, which are frequently found in maize kernels (15). Economically, fumonisin B1 is considered the most harmful mycotoxin among fumonisins in Brazil, the third largest maize producer worldwide (16). Due to climatic characteristics, especially in Brazil's central and southern regions, Fusarium spp. are commonly found in maize fields, or even after harvest or during storage (17,18). Interestingly, the areas of prevalence of PCM cases are coincident with the main agricultural area for grains, especially maize (5); therefore, infection by P. brasiliensis concomitant with fumonisin contamination has been proposed (19,20). Besides, fumonisins are known as modulators of mammals' immune responses, downregulating phagocytic activity and increasing antibody specificity against Paracoccidioides brasiliensis (20,21). Consequently, the cooccurrence of PCM and fumonisin contamination might negatively influence cellular immune response and worsen patients' clinical manifestations.
Among the identified PCM metabolites, cerebroside D and a glucosylceramide (GlcCer) were selected as important glycosphingolipids (GSLs) in patients' serum. Different studies have shown that glucosylceramide backbones, present in these biomarkers, are involved in host-pathogen interaction and may be associated with P. brasiliensis antigenicity (22). Some characteristics observed in both the sphingolipids cerebroside D and GlcCer are typical in glucosylceramide from fungi, such as a methyl linked to the sphingosine chain and an (E)-Δ8-unsaturation. Cerebroside D presents another fungal characteristic which is the presence of another Δ4-unsaturation in the ceramide moiety (23,24). In addition, cerebrosides are neutral glycosphingolipids widely found in pathogenic fungi and are involved in many cellular processes, as well as signaling, differentiation, growth (25), and antigenic activity (23,26). Different immunological and analytical methods have also evaluated Paracoccidioides brasiliensis glucosylceramides and observed that cerebroside D was identified not only in mycelium but also in yeast samples, the pathogenic form (23,27). Therefore, these data corroborate our findings and indicate that cerebroside D and GlcCer are P. brasiliensis biomarkers present in patients' serum samples. Still in the sphingolipid class, phosphoethanolamine-ceramide (PE-Cer) was also selected as an important marker for detection of paracoccidioidomycosis. PE-Cer is a byproduct of sphingolipid metabolism through the sphingomyelinase pathway, where sphingomyelin, hydrolyzed into N-acylsphingosine (ceramide), is a precursor of PE-Cer (28,29). Sphingomyelin represents one of the most abundant sphingolipids that assemble the cellular membrane (30,31). Once its metabolites, especially ceramide derivates, are significantly present in biological samples, that indicates that cellular membranes are seriously damaged. Some studies have shown that conversion of sphingomyelin into ceramide and its derivatives is associated with apoptosis (32, 33), especially C 16 ceramide, which corresponds to our selected PE-Cer (34). Besides, healthy lungs present lower ceramide levels than lungs with chronic obstructive pulmonary disease (35), which corroborates our findings that paracoccidioidomycosis affects sphingolipid metabolism and increases ceramide derivatives which, through cell death, reduce lung function and induce pulmonary manifestations.
Apart from sphingolipids, polyprenyl lipids and phosphorylated derivatives represent a small portion of glycerophospholipids in cellular membranes mainly found in bacteria, fungi, and plants. The selected phosphorylated polyprenyl lipid (m/z 1273.7)  corresponds to a lipid carrier, undecaprenol, which is a 55-carbon-chain isoprenol (C 55 -P) used by prokaryotes in sugar carrier processes to build polysaccharide structures such as, for example, peptidoglycan (36). PCM patients may present genetic disorders that predispose to coinfections, for example, infections caused by mycobacteria, known as Mendelian susceptibility for mycobacterial diseases (MSMD) (5,37). Facing some genetic polymorphisms, PCM patients may present decreased levels of some cytokines, such as interferon gamma (IFN-␥) (38,39). Taking that into account, it is expected that the individual's immune system might be impaired and therefore be susceptible to coinfections. Therefore, it is plausible to find bacterial biomarkers in PCM patients, as coinfection is a possible condition in PCM since IFN-␥ has an essential role in resistance to bacteria and resolution of infections as a macrophage activator (40,41). Although C 55 -PP is typically found in bacteria, polyprenyl-phosphate lipids are involved in protein glycosylation in all kingdoms of life, even eukaryotes (36,42). Further studies are necessary to evaluate the possibility of considering undecaprenol as a glycan lipid carrier in yeasts, which may be associated with glycosylation of Paracoccidioides brasiliensis' main antigen, a 43,000-Da glycoprotein (GP43).
Last, but not least, triacylglycerides were also considered relevant biomarkers in our study. Paracoccidioides brasiliensis is known to reprogram its metabolic pathways to improve the energetic supply. During mouse lung infection, the yeast cells showed upregulation of enzymes involved in lipid oxidation. For example, acetyl coenzyme A (acetyl-CoA) and propionyl-CoA (both derived from lipid catabolism) are used during P. brasiliensis infection to fuel the glyoxylate cycle and provide a supply for synthesis of biomolecules (43). Besides, proteomic studies have also shown that yeast cells, once internalized by macrophages, present fatty acid degradation and its usage as fuel for survival inside phagocytic cells (44). Therefore, PCM might increase the need for systemic triacylglycerides, probably induced by the energetic yeast demand.
Analyzing all the results, it was possible to identify a set of 28 features from which the applied method could select 19 PCM biomarkers, among them 13 molecules and their respective isotopes. Together, these markers are reliable indicators of PCM with 100% specificity, 94.1% sensitivity, and 97.1% accuracy. The 19 PCM markers were then elucidated according to metabolomics analysis. Next, it was observed that, among the 19 features, three biochemical markers were the most significant ones in our screening, and their specificity and accuracy were greater than 95%. These data show that, independently of fungal disease form and according to the predetermined set of discriminant biomarkers, it is possible to reach metrics such as 100% specificity, 94.1% sensitivity, and 97.1% accuracy, higher indexes than traditional microbiological and serological methods. In addition, some of the 19 features may be essential indicators of cooccurrence of infection or contamination, which opens a new alternative for applying metabolomics analysis to improve diagnosis and therapeutic approaches and ensure treatment confidence.
The present work consists of a biomarker screening test for PCM diagnosis through the association of different techniques: machine learning and mass spectrometry. The next step for further test refinement is the inclusion of more patients, especially with different systemic mycoses and lung infections (i.e., tuberculosis) to strengthen the validation for laboratory diagnosis. Microbiologists and other health professionals will be able to use this method easily and cheaply. We intend to make it possible through a software solution that will be combined with mass spectrometers which, together, will predict samples positive or negative for PCM. Therefore, we propose to complete the sample set, increasing diversity and quantity and intending to magnify test confidence; retrain the mathematical model; and validate it for further application in clinical laboratories. In this way, this neglected disease might have a chance to be rapidly identified and recognized, enabling patients to receive proper medical care and reducing the sequelae and its social impact, mainly in individual work capacity and quality of life.

MATERIALS AND METHODS
Ethics statement. This work was approved under number 1.850.251 by the Research Ethics Committee of the University of Campinas. The patients were informed about the study through an approved consent form, and this study was conducted according to the principles expressed in the Declaration of Helsinki.
Research participants and specimen collection. In total, 343 individuals were included in this study, regardless of age and gender, in two main groups: the test group, consisting of PCM patients (n ϭ 85), and the control group (n ϭ 258). Aiming to increase the diversity in the control group, the latter was formed of healthy volunteers (n ϭ 47) and patients with different infectious diseases-candidemia (n ϭ 36), dengue (n ϭ 47), Zika virus (ZIKV) infection (n ϭ 65), and finally a group of people with fever symptoms but not dengue or Zika virus infection (n ϭ 63)-comprising a total of 258 samples, all negative for Paracoccidioides brasiliensis. The PCM group was established according to previous patient's serological tests with a positive reaction, done by the Adolfo Lutz Institute-Laboratory of Mycosis Immunodiagnosis, independently of paracoccidioidomycosis form. All the other infections were diagnosed according to the gold standard methods recommended for each one, including real-time PCR and microbiological culture tests.
PCM detection by AGID. Paracoccidioidomycosis was previously detected by a serological test based on agar gel immunodiffusion (AGID), according to the method of Kamikawa (45).
HRMS preparation and analysis. Starting from 20 l of serum samples, prepared according to the method of Melo et al. (46,47), patients and control groups were evaluated in quintuplicates through direct injection into a high-resolution mass spectrometer (HRMS; ESI-LTQ-XL Orbitrap Discovery instrument [Thermo Scientific, Bremen, Germany]). Instrumentation parameters were set as follows: sample flow of 10 l/min, sheath gas at 10 arbitrary units, source voltage of 5 kV, and capillary temperature of 280°C. Analytical quintuplicates were prepared and analyzed for each sample, from which metabolic fingerprints were captured in the mass range of 750 to 1,700 m/z in the positive ion mode.
Intending to confirm our findings, tandem mass spectrometry (MS/MS) was applied in the same instrument mentioned above. The collision gas used was helium, with collision-induced dissociation energy ranging from 30 to 60 (arbitrary units). The obtained experimental mass fragmentation spectra were collected and compared to in silico mass fragmentation profiles of each marker, simulated with Mass Frontier software (v. 6.0; Thermo Scientific, San Jose, CA).
Database search. The selected metabolic features were elucidated through a search on METLIN (Scripps Center for Metabolomics, La Jolla, CA), on the Lipid Maps database, and in literature.
Machine learning method. Forests of decision trees are one of the best prediction algorithms in different areas of knowledge (48,49). They were proposed by Breiman (50), who developed and trademarked them as Random Forests. The method consists of combining results of many trained decision trees (bagging strategy) (51) using a subset of the data space (bootstrap strategy). The data's subspace is selected for training each decision tree through a random subset of the variables (dimensional subset) and a random subset of the data vectors (points subset). Each node in the decision trees tests one variable against a cutting decision value. The cutting value determines a plan in the hyperspace, which is orthogonal to the variable's dimension and splits the space into two subspaces. The algorithm searches for the cutting values that increase the information gain on each decision node achieving a prediction value when a leaf is reached. A complete review of decision tree classifiers can be found in reference 52, and a probabilistic (Bayesian) explanation of them can be found in reference 53.
The Random Forest algorithm deals with multivariate nonlinear problems with simple parameterization. Its parameters can be adjusted to enhance the prediction performance and computation footprint, such as the number of trained trees in the forest, the size of the variable subspace used in each training, depth of each tree, and pruning strategy, among others (54).
With the classifier trained, the classification is a simple sequence of tests traversing the decisions in the forest and combining the results by majority voting (mode of the classification results) or by another aggregation method. With this fusion strategy, Random Forest classifiers yield a robust-to-noise performance in the prediction for new data.
Another advantage of using the Random Forest algorithm is the ability to identify which variables contribute more to the prediction results, i.e., what variables have more determinant impact in the forecast performance statistics (accuracy, precision, and others). This property, known as variable importance (or feature importance) (55,56), is particularly essential on the metabolomics studies conducted in this paper, as the starting point for the metabolomics analysis is to discover which molecules represented in the spectrum data drive the successful predictions of the algorithm.
The machine learning approach used in this paper is similar to the method applied and already described in a paper on ZIKV detection (57), which successfully identified in the blood serum molecules associated with the virus metabolic process.
The analysis method consists of training a Random Forest classifier using labeled data: the alreadydiagnosed condition of PMC infection (positive samples) and noninfection (control, negative samples), refining the process and selecting m/z variables until the best prediction performance is achieved.
The variable selection process uses the Random Forest feature importance process to discard less discriminant features and consequently to identify the most important ones that drive the method to the best classification results. The optimization algorithm searches for the maximization of the cost metric (e.g., the F1 score measure [ Table 1]) discarding, in each iteration, the 10% least discriminant ones. The feature importance calculation used in this process employs the variable permutation algorithm, which is the best way (56) to compute the contribution of each feature in the classification result.
As the most discriminant m/z values are determined, a statistical distribution analysis over the intensity of the corresponding ions determines which ones are more frequently present in the infected patients than the control ones, identifying the possible biomarkers for the disease. This is an essential step in the preparation for the metabolomics analysis, as it narrows down the possible biomarkers to a small number of molecules that makes the biochemical analysis and metabolic process determination feasible and faster.
Possible biomarkers are the most discriminant features (determined by the learning process) for which the intensity value cumulative distribution function (CDF) of the negative patients computed at the intensity value of the median of positive patients is over ΔJ% of the CDF's positive patients. It indicates that the probability of finding a higher intensity of the m/z in the spectra of positive patients is much higher than in the negative ones, which we consider the evidence of a possible biomarker, which will be validated by the subsequent metabolomics investigation.
F j is a marker feature, if where y j is an F j value for a positive patient, m j is the median of F j values of all positive patients, y j is an F j value for a negative patient, p(y j ) is the probability distribution function of positive patients, q͑y j ͒ is the probability distribution function of negative patients, P(y j ) is the cumulative distribution function (CDF) of y values, and Q͑y j ͒ is the CDF of y j ; and 0 Ͻ ␤ Ͻ 0.5 is CDF difference over median of the feature j for the positive patients (e.g., ␤ ϭ 30%).
For the robustness and stability evaluation in each machine learning step, and the whole process, the data are divided into two primary partitions, one for the fitting process (determination of the variables and parameters for the best result), called fit partition (Pfit), and the second one separated for the final evaluation of the model, called test partition (Ptest). It is important that test partition is kept apart from the entire process, so that it will reflect how the algorithms will deal with entirely new data. It is also important that partitions do not have the same patient's data spread on different partitions, to avoid the learning process being contaminated with the information of patients separated from the final test. In other words, the algorithms learn the whole process with patients in the training set who will never be present in the test set.
During the fitting process, the patient's data in the fit partition are randomly shuffled and sliced into two new partitions, the training partition (Ptrain), which is used for the training of the classifier, and the validation partition (Pval), which serves to measure the classifier prediction performance. The fitting process is repeated 10 times with the shuffled training and validation partition in such a way that the same patient's data participate in both sides. It is also important to note that all replicates from the same patient are always inside the same partition to avoid data cross-contamination between training and validation, or final test. A low variance between the classification results in the 10 experiments shows that the model trained is stable and generalizes over the fit partition data.
Data availability. The data used in this paper can be divided into two sets: mass spectrometry data from PCM patients and healthy volunteers (here referred to as raw data) and the machine-learningderived data calculated on top of the former. The raw mass spectrometry data from PCM patients and healthy volunteers that support the findings of this study are available upon request of the corresponding authors, A.R.R. and R.R.C. These data are anonymized due to participants' privacy restrictions and are available free of charge, but due to constraints in the acquisition protocol, the data need to be available only upon request. As the machine-learning-derived data do not involve any sensitive information, they are available directly through the Zenodo open-access repository at https://doi.org/10.5281/zenodo .3763768. The institutional review board (IRB) authorization for the data acquisition was registered under the number CAAE ZIKA 053407/2016 at the University of Campinas, Brazil (58).