Limits and Prospects of Molecular Fingerprinting for Phenotyping Biological Systems Revealed through In Silico Modeling

Molecular fingerprinting via vibrational spectroscopy characterizes the chemical composition of molecularly complex media which enables the classification of phenotypes associated with biological systems. However, the interplay between factors such as biological variability, measurement noise, chemical complexity, and cohort size makes it challenging to investigate their impact on how the classification performs. Considering these factors, we developed an in silico model which generates realistic, but configurable, molecular fingerprints. Using experimental blood-based infrared spectra from two cancer-detection applications, we validated the model and subsequently adjusted model parameters to simulate diverse experimental settings, thereby yielding insights into the framework of molecular fingerprinting. Intriguingly, the model revealed substantial improvements in classifying clinically relevant phenotypes when the biological variability was reduced from a between-person to a within-person level and when the chemical complexity of the spectra was reduced. These findings quantitively demonstrate the potential benefits of personalized molecular fingerprinting and biochemical fractionation for applications in health diagnostics.

provides sample size, age, sex, and body mass index (BMI) descriptions of the experimental lung and prostate cancer cohorts. These cohorts were used for both -calibrating the biological variability of the simulated samples and for validating the statistical properties of the simulated cohorts. Details about the study design, sample collection and handling, and measurements of the samples are described in a previously published work. 1 Lung cancer cohort Prostate cancer cohort

Analysis code
To facilitate the use of the model, the Python code used for generating the artificial spectra is deposited in a GitHub repository. 2 There, we further provide Python scripts (on the Jupyter Notebook framework) to perform and reproduce all simulations shown in the results of the main text of the article.

Data access
Making use of the proposed model, we provide a dataset of generated lung and prostate cancer cohorts of FTIR spectra which model the data used within this work (n = 100000 for each cancer entity) in a GitHub repository. 2 The dataset was created using our experimental blood serum measurements for each cancer entity and the matched non-symptomatic controls. The artificial cohorts were created by following the described procedure for calibrating the biological variability using experimental measurements and the measurement noise using repeated water measurements. The water measurements (n = 400) used for calibrating the measurement noise are also deposited, along with the generated artificial cohorts. This data can serve as an input for generating additional artificial cohorts for in silico investigations, as well as be employed as a basis for gaining insights when limited numbers of serum-based measurements are at hand.

Machine learning analysis
Classification performance was evaluated by calculating the area under the receiver operating characteristic curve (ROC-AUC) of trained L2-penalized logistic regressions. To estimate the ROC-AUC score on unseen test samples, the data was first split according to a 10-fold stratified cross-validation. On the training splits of each fold, an exhaustive grid search was carried out using an inner 3fold stratified cross-validation to determine an optimal value for the cost parameter C of the logistic regression 3the value which maximizes the ROC-AUC estimated on the inner validation sets. Once an optimal value was determined, the logistic regression was re-fit on all samples of a given training split and tested on the corresponding outer set of testing samples to provide an unbiased estimate of performance on unseen data. The outer 10-fold cross-validation was repeated for some problems and was mentioned in the corresponding section of the results. The ROC-AUC scores across all folds were averaged and reported along with their standard deviation. When investigating the effect of the cost parameter C (Section 11 in the Supporting Information), the inner cross-validation and grid search steps were skipped and the results with all C values were shown. Data standardizationdone by subtracting the mean and dividing by the standard deviation of each spectral featurewas applied as a preprocessing step before training and testing the logistic regression models. All mean and standard deviation values were calculated from the training splits and applied to their corresponding testing splits. The logistic regression, cross-validation, and grid search algorithms were used as implemented in the Scikit-Learn open-source Python package (version 0.24.1). 4 To determine a realistic estimate for the measurement noise of the FTIR measurement device, a total of 400 water samples were repeatedly measured with the FTIR spectrometer. The standard deviation between the performed measurements across the spectral features was subsequently calculated. Fig. S1A depicts water measurements performed on the FTIR spectrometer. These measurements were used to calibrate the additive measurement noise introduced into the model (illustrated in Fig. S1B). Figure S1: Calibration of the measurement noise using repeated water FTIR measurements. (A) Water reference absorbance spectra (n = 400) with the mean of all water measurements subtracted. (B) A set of random vectors generated based on the standard deviation of the repeated water measurements across the wavenumbers on an arbitrary y-axis. The vectors in (B) illustrate the calibrated measurement noise coefficient in the simulation model and were scaled by a factor of 9.15 to account for the L2 vector normalization in the preprocessing of the sera measurements.         Fig. 2D in the main text and S2D. Here, larger simulated cohorts were created to validate a model trained on simulated data and tested on experimental data. In comparison to Fig. 2D and S2D, the ROC-AUC values could be fully recovered using the larger sample sizes for both the lung (Fig. S6A) and prostate cancer (Fig. S6B) applications. Regularization plays a pivotal role in the predictive modeling of ill-posed problems such as the classifications explored within this study. Techniques of regularization help predictive models mitigate the effects of multicollinearity and noise present in the training data. 6,7 By introducing an L2 penalty, a supervised machine learning model works to solve the following generalized optimization problem, given a set of instance-label pairs ( , ):

Validation of simulated data for the prostate cancer application
where ξ( ; , ) is a defined error, or loss, function used in a predictive model. In other words, the optimization works to find the weight vector that minimizes the error between predicted outcomes and the known outcomes. To avoid a solution where becomes unduly large in magnitude, overfitting to peculiarities in the training data, the L2 penalty term is introduced and the extent of its involvements is controlled by the free parameter > 0. The closer the value of is to 0, the sparser the solutionwhere the entries of are closer to 0. The optimal value for can be determined by cross-validating to find the value which strikes the balance between allowing the model to learn the best possible value for and not overfit.
We used our simulation model to investigate the effects of regularization on the performance of detecting lung and prostate cancer using L2-regularized logistic regression models at varying measurement noise and biological variability levels (Fig. S7A-D). We simulated artificial cohorts at increments of increasing measurement noise, holding the biological variability at the calibrated level, and subsequently cross-validated on each simulated dataset with different values for the parameter . A similar evaluation was repeated to examine the effect of increasing biological variability while keeping the measurement noise at the calibrated level.
We found that with heavy classifier regularization (e.g., = 2 −12 ), the classification performance on unseen data samples was relatively robust to increasing levels of measurement noise ( Fig. S7A and C). Such heavily regularized models, however, did not provide optimal classification performance at the explored levels of measurement noise and led to "underfit" models that failed to capture the complexity of the problem. With little-to-no measurement noise, classifiers with weaker regularization strengths (e.g., = 2 12 ) consistently outperformed in testing performance. Since such weakly regularized models are very susceptible to overfitting, increasing measurement noise quickly led to the loss of generalization and the classifier under-performed on the testing sets while perfectly separating the training samples.
Comparatively, the optimal regularization parameter was less sensitive to varying levels of biological variability ( Fig. S7B and D). The optimal regularization strength ( = 2 6 ), remained optimal in the explored biological variability domain. Moreover, for this type of noise, all explored values for suffered in a similar way from increasing noise. This revealed that the optimal penalty does not depend on the biological variability and thus the effects of increased biological variability cannot be mitigated by tuning a better classifier. Figure S7: Influence of measurement noise and biological variability on classifier regularization. Binary classifiers, with changing regularization strengths (parameter ), were fit to detect cancer on spectral cohorts simulated at different levels of measurement noise (first row) and biological variability (second row). (A-B) Depict the effect on detecting lung cancer while (C-D) depict the effect on detecting prostate cancer. Performance, as measured by the ROC-AUC, was estimated by cross-validating and averaging the score on each cohort created. Fig. S8 repeats the validation of the numerical simulation shown in Fig. 2 in the main text and supporting Fig. S2, but by introducing a single discriminant feature vector to describe the case measurements and only considering the biological variability of the nonsymptomatic references to calibrate the model (equation 7 in the main text). Figure S8: Validation of the calibration procedure for the model variant that uses a single discriminant vector to distinguish between cases and controls. In this investigation, the biological variability was calibrated based on the control measurements for each cancer cohort. (A-B) The differential fingerprint, defined as the difference between the mean of the case and control spectral features, for the experimental cohort and the simulated cohorts as averaged across 10 simulation repetitions. This differential fingerprint was introduced to each simulated case measurement to model the difference between the classes using a single vector for the lung (A) and prostate (B) cancer applications. (C-D) The standard deviation of the control spectral features for the experimental cohort and simulated cohorts as averaged across 10 simulation repetitions for both lung cancer (C) and prostate cancer (D) applications. (E-F) The standard deviation of the case spectral features for the experimental cohort and simulated cohorts was averaged across 10 simulation repetitions for both lung cancer (E) and prostate cancer (F) applications. For the simulated case measurements (E-F), the biological variability was scaled to minimize the squared error between the simulated standard deviation spectrum and the experimentally derived one. (G-H) The receiver operating characteristic curves (ROCs) for binary case-control classifications of lung cancer (G) and prostate cancer (H) when training and testing on experimental samples (blue), training and testing on simulated samples (cyan), and training on simulated samples and testing on experimental samples (gray). The areas under the curves are listed in the figure legends along with their standard deviations across the cross-validation splits.