Smart-Plexer: a breakthrough workflow for hybrid development of multiplex PCR assays

Developing multiplex PCR assays requires extensive experimental testing, the number of which exponentially increases by the number of multiplexed targets. Dedicated efforts must be devoted to the design of optimal multiplex assays ensuring specific and sensitive identification of multiple analytes in a single well reaction. Inspired by data-driven approaches, we reinvent the process of developing and designing multiplex assays using a hybrid, simple workflow, named Smart-Plexer, which couples empirical testing of singleplex assays and computer simulation to develop optimised multiplex combinations. The Smart-Plexer analyses kinetic inter-target distances between amplification curves to generate optimal multiplex PCR primer sets for accurate multi-pathogen identification. In this study, the Smart-Plexer method is applied and evaluated for seven respiratory infection target detection using an optimised multiplexed PCR assay. Single-channel multiplex assays, together with the recently published data-driven methodology, Amplification Curve Analysis (ACA), were demonstrated to be capable of classifying the presence of desired targets in a single test for seven common respiratory infection pathogens.

the wet laboratory 14,37 .For  " multiplexed targets, if  #$ candidate primer sets are designed for each of them (which is trivial progress for well-designed singleplex assays), the total number of possible multiplex assay combinations is  % =  #$ ' ( (e.g. % = 16,384 when  #$ = 4 and  " = 7).The  % increases exponentially with  " , making it impractical to find the optimal combination by wet-lab experiments in highlevel multiplexing.Therefore, an in-silico simulation method is required for fast screening and for narrowing down selections of multiplex assays.
To address this problem, here we present the Smart-Plexer, a mathematical algorithm capable of simulating thousands of possible multiplex assay combinations based on singleplex real-time digital PCR (qdPCR) data.We aim to demonstrate the use of this new methodology by developing a TaqMan-based multiplex assay, in a single fluorescent channel, for the specific and sensitive detection of seven common respiratory tract infection (RTI) pathogens.This work is two-folded: First, we validated the Smart-Plexer by comparing the performance of all possible simulated and empirical combinations in 3-plex, showing a strong correlation between in-silico and lab-tested multiplexes; Second, we assessed the proposed pipeline in highlevel multiplex (7-plex) by evaluating the ACA classification performance on synthetic DNA and clinical samples.We demonstrated that, out of 4,608 simulated combinations, an optimal multiplex assay could be developed using this novel framework to detect seven common respiratory pathogens accurately in qdPCR.

Results
Smart-Plexer design framework.We developed the Smart-Plexer, a framework that uses singleplex PCR reactions as a 'card deck' to generate a 'winning combination' of the multiplex assay.After deciding the number of targets intended to multiplex, the Smart-Plexer takes as input a dataset generated from real-time PCR reactions with a single primer set (or singleplex assay) and a single target.Given the desired number of targets to multiplex in a single channel PCR, sigmoidal curves generated from all the singleplex/target interactions can be combined to simulate curves from a multiplex assay (Fig. 1).These simulations of assay combinations are then empirically tested in wet-lab multiplex tests for each target to evaluate changes in the curve shape of the amplification reaction during the transition from singleplex to multiplex environment (empirical multiplex).Moreover, to identify multiple targets with empirical multiplexes, this framework was coupled and evaluated with the ACA methodology.
As the ACA is a classifier recognising clusters from different amplification shapes (which, in our case, represent different targets), it is crucial to maintain differences among sigmoidal trends in silico.Therefore, those differences across targets can be computed using the Smart-Plexer method through distance measurements (such as Euclidian distance).This novel framework is capable of distance calculation from either the entire amplification curve or its sigmoidal features.The average of computed distances among all the targets is used to rank each combination of singleplex (or simulated multiplex) from high to low intercurve similarity values.Moreover, the ranking system takes the minimum distance between the two closest targets to ensure that simulated multiplex with high average values is not dependent on the high difference of only a group of curves.When two amplification curves have high similarity, hence a small distance value, the ACA classifier will not work efficiently to identify either target.Therefore, the rank of the combination depends on both average and minimum distance scores.A set of singleplex assays from the top ranks were selected as simulated multiplex for the empirical validation in the laboratory, and the ACA performance was assessed.
To compute distances between amplification curves, the Smart-Plexer requires a filtering process where the amplification data generated undergo the following steps: (i) subtraction of curve background to remove the fluorescence signal noise at the starting cycles, (ii) removal of late amplification curves to exclude nonplateau reactions, (iii) removal of noisy curves to exclude non-sigmoidal shapes resulted by operator or instrumentation faults 38 .The following step comprised of a fitting equation using the 5-parameter model proposed by Spiess et al. 39  Secondly, the framework was evaluated using curves normalised with the final fluorescence intensity (FFI) as input to assess performance changes by removing the absolute fluorescence information.To further investigate changes related to different curve representations and different levels of data abstractions (feature dimensions) provided to the Smart-Plexer, sigmoidal parameters generated from a fitting model were also used as input to assess the influence on this framework.
To evaluate the best fitting model, primary efforts have been focused on the selection of an appropriate equation.Several methods have been proposed to efficiently model the real-time PCR sigmoid, such as four, five, and six-parametric functions 40,39,41 .As a case study, we retrieved the amplification curve data previously reported by Moniri et al., 2020 30 .Using raw curves as input, after sigmoidal fitting, we calculated the Mean Square Error (MSE) between the raw and the fitted curves for the entire dataset.As shown in Supplementary  where  is the amplification time (or PCR cycle), () is the fluorescence at time ,  is the maximum fluorescence,  is the baseline of the sigmoid,  is related to the slope of the curve,  is the fractional cycle of the inflection point, and  allows for an asymmetric shape (Richard's coefficient).
The three different curve representations (raw curves, FFI normalised curves and fitted parameters) were further used to evaluate the transferability from singleplex to multiplex reactions in the Smart-Plexer.
Average Distance Score (ADS) and Minimum Distance Scores (MDS) based on curve distances to rank multiplex assays.We developed two distance metrics to measure transferability from simulated to empirical multiplexes, since it is hypothesised that distances between amplification curves should be maintained during the transition from singleplex to multiplex environments.
It is possible to calculate distances between two distinct curves by considering them as two data points in the multidimensional space and quantify their distances using various metrics (i.e., Euclidian, Cosine and Manhattan).In a single channel multiplex assay, the number of primer sets present in the reaction equals the number of targets ( " ), therefore the number of distances (  ) among curves of different targets is represented by the following formula: The average of all the distances is used to assign a score to the multiplex assay called Average Distance Score (ADS).The ADS provides information on the overall distances across targets, and the higher its values are, the more distant the curves are, and better ACA performance is expected (as distances are related to data point clusters).A high ADS does not guarantee a large distance between every two targets of the multiplex.To overcome this limitation, we considered a second metric called Minimum Distance Score (MDS), the distance value of the two closest curves (minimum value of the given  5 distances).
The ADS and MDS narrow down the selection of empirical testing for the highest performing multiplexes using a ranking system.Moreover, they are used to validate that inter-curve distance information is maintained during the transition from simulated to empirical multiplexes, and they can be used to develop assays in silico more suitable for ACA, skipping costly and timely laboratory testing.
Smart-Plexer validation using a 3-plex assay.To assess the performance of the Smart-Plexer for both in-silico multiplex development and ACA classification accuracy, we designed three primer sets for three selected targets using synthetic DNA and tested them in real-time digital PCR (qdPCR): Adenovirus (HAdV), Human coronavirus HKU1 (HCoV-HKU1) and Middle East respiratory syndrome-related coronavirus (MERS-CoV).As shown in Fig. 1, the number of combinations to test using  " targets ( " = 3) and  #$ assays for each target ( #$ = 3) is 27 ( % =  #$ ' ( = 27 combinations, listed in Supplementary Table 2).Three targets were chosen to validate the Smart-Plexer because a complete comparison of all the 27 simulated and empirical multiplex assays can be experimentally conducted as the number of wet-lab experiments is achievable ( % ×  " = 81 tests).
The wet-lab testing of each primer set (or singleplex assay) was conducted, and the resulting raw data were combined in a total of 27 simulated multiplexes as explained before.Similarly, experiments were carried out on combinations of primer sets (or empirical multiplex assays) in a single channel reaction.A group of amplification curves, which can be considered as data points in multidimensional spaces, were generated from a unique interaction between each assay and its specific target.The median of these data points was calculated to represent each group of curves.Furthermore, distances among all the curve medians were used to generate the ADS and MDS of all the possible combinations Fig. 2a-b visually represent the correlation between the in-silico and wet-lab tested assays using ADS and MDS in simulated and empirical multiplexes.Pearson coefficients were reported for both ADS as 0.301, 0.972 and 0.607, and MDS as 0.092, 0.761 and 0.686, for raw curve, normalised curve and fitted parameters, respectively (visual representations of each curve type/parameters are depicted in Fig. 2c, and ADS and MDS for all the curve types/combinations are reported in Supplementary Table 3).
It can be observed that normalised curve correlations scored higher than the rest in both ADS and MDS, showing that simulated and empirical multiplex are correlated if FFI is discarded.It is also important to note that the use of all the five curve parameters worsens the correlation as the bimodal distribution of parameter "e" negatively influences the correlation, as discussed by Miglietta et al., 2022 38 .Moreover, the correlation from singleplex to multiplex might be affected by the fact that the "d" parameter is related to the cycle threshold (Ct) of the amplification curve.Target concentration can be influenced by instrumentation, operator, and experimental errors; therefore, variabilities of Ct can easily mislead the correlation of the five parameters using "d".Moreover, the scope of conducting this correlation is to compare purely sigmoidal shapes, and concentrations of the nucleic acid targets should not affect the distance values of two curves.
In addition, the use of parameter "a" and "b" is redundant as: (i) "a" is related to the FFI, and as shown in the middle plot of Fig. 2a-b, FFI is not relevant to the distance correlation and (ii) all curves present in this dataset were processed with a background removal (baseline correction) and all "b" parameters were levelled to almost zero.
These discoveries on the correlation between simulated and empirical multiplex distances inspired us to seek a more representative feature which would maintain the information of distances during the translation from a singleplex to a multiplex environment.As mentioned before, the parameter "a", "b", "d" and "e" can negatively influence the correlation for both ADS and MDS; therefore, we focus on the "c" parameter.(after data processing), normalised curve (computed based on the FFI) and fitted curve/parameters are presented from left to right.The fitted curve is computed with a 5-parameter Sigmoid function using raw curves.As a result of this, we can obtain both fitted parameters ("a", "b", "c", "d", "e") and fitted curve (predicted fluorescence values corresponding to each cycle from the 5-parameter Sigmoid model with fitted parameters).
The key parameter for curve distances correlation in multiplex assays: the "slope".The previous section reported all the correlation coefficients for ADS and MDS between simulated and empirical multiplexes, in concomitance with different curve representations: raw curves, normalised curves, and fitting parameters.
Both ADS and MDS showed the maximum correlation values when considering normalised curves.Those results, along with our discussion on the fitted parameters in the previous section, indicate that reducing the information contained in the amplification curve is beneficial.This section explores how the "c" parameter preserves distance information from singleplex to multiple environments of each primer set/target reaction.
In the 3-plex validation, each singleplex assay was tested against its specific target (N=9), resulting in 27 different combinations of simulated multiplexes.Moreover, the "c" parameters were fitted and extracted from 27 empirically tested multiplex assays (81 tests).Supplementary Fig. 1 shows the correlation between simulated and empirical ADS and MDS calculated from "c" parameters with correlation coefficients of 0.973 and 0.774, respectively.To further evaluate whether "c" parameter distributions were maintained in the translation to empirical multiplexes, their three distributions (where three is equal to the number of multiplexed targets) from the singleplex reaction were compared with their corresponding distributions in empirical multiplex reactions.As illustrated in Fig. 3a-c, distributions of three different multiplex assays are visualised with their relative mean values represented by the dashed/dotted lines.The figures show the capabilities of the "c" parameter to maintain distance information going from simulation to empirical test.
It can be observed that in most cases, the location of the parameter distribution for each target is maintained.In other situations, the distribution may be shifted from the singleplex events; however, the relative distance relationship of "c" values is kept.Fig. 3a illustrates the "c" parameter distribution of a lowrank ADS/MDS multiplex, showing overlaps for all the three singleplex assays in both simulated and empirical multiplexes.As distances among amplification curve shapes can significantly affect the ACA classifier, reduced performance is expected for multi-target identification.Another distribution trend among multiplex assays is represented in Fig. 3b, where the selected Primer Mix (PM3.01) has a high simulated ADS value (0.117) but low MDS (0.003).Moreover, we reported that the ADS value for distributions in Fig. 3c equals 0.138, which differs only 0.21 from the combination PM3.01.However, PM3.12 has an MDS value of 0.075, representing an increase of 0.072 compared to PM3.01.This highlights the importance of considering minimum distances between "c" parameter distributions of the two closest targets: a small MDS value indicates a less separable group of target clusters, resulting in low ACA accuracies for multi-pathogen identification in a single fluorescent channel reaction.To numerically report how distributions are related in the translation from simulated to empirical multiplexes, we calculated the Rooted Mean Squared Error (RMSE) as follows: where   and   are vectors for distances among targets in singleplex and multiplex, respectively.RMSE values of all the 3-plex combinations range from 0.003 to 0.050, which are negligible considering the range of the "c" parameters.The ADS, MDS and RMSE values for all the 3-plex combinations are reported in Supplementary Table 4.These results emphasise that distances between simulated and empirical multiplex share high similarity across different ranks, ensuring that our scoring system (based on ADS and MDS) is not affected whether in singleplex or multiplex environments.Accuracy of all the possible combinations in 3-plex assays.One of the aims of the Smart-Plexer is to improve the classification of multiplex assays, in our case, related to the ACA method.As demonstrated in the previous section, distances among amplification curves of empirical multiplex assays are similar to those generated in simulated multiplexes.Therefore, leveraging ADS and MDS, simulated multiplexes can be used to rank each combination and find the optimal assays with the largest inter-target distances for the ACA classifier.To further demonstrate that the ADS and MDS are crucial to improving multi-target identification in single well PCR reactions, we assess the classification performance of the ACA method by using 10-fold cross-validation and the k-Nearest Neighbors (k-NN) algorithm.Fig. 4a shows a 3-D graph where both ADS and MDS of the "c" parameters are correlated to the ACA accuracy.Accuracy percentages ranged from 98.63% to 100% for each multiplex.The rainbow plane, which is fitted with linear regression on all the visualised data points, represents the gradient of the classification accuracy, showing an upward trend as ADS and MDS increase, which is consistent with our hypothesis that the ACA classification performs better with larger inter-target distances.Moreover, the plane on the left of Fig. 4a has a grey highlight zone called Vacuumed Area, where data points cannot fall inside as it is mathematically impossible to have an average distance value smaller than the minimum distance.We also defined another area called Forbidden Area, as visualised in the rotated 3-D plot on the right of Fig. 4a, where it is expected that no point will be founded, provided high values for ADS and MDS.  an increase of 1.84% is observed for the top ADS/MDS data point compared to the bottom one.
Furthermore, as depicted in Fig. 4b-e, by applying 2-D t-distributed stochastic neighbor embedding (t-SNE) 42 visualisation on curves generated by the top and bottom-ranked primer combinations, more condensed target clusters and better separated inter-target boundaries can be seen for top-ranked assays.This results in more distinguishable curve shapes and larger curve distances among targets, which benefits the ACA classification.Numerical analysis of the visualised clusters was assessed using the Mean Silhouette Scores (MSS).As reported by Kaufman et al. 2009, Silhouette scores between 0.51-0.70 are considered more effective in cluster separation than values below 0.50 43 .The reported MMS scores show significantly larger inter-cluster distances for the top combinations, with values higher than 0.61 as opposed to the bottom ones of less than 0.27 (in Supplementary Table 4, we also report ADS, MDS, MSS and ACA accuracies for each combination of the 3-plex experiment).This finding proves that the ADS and MDS metrics are valid indicators for predicting optimal primer set combinations for the ACA classifier.Relying on the Smart-Plexer for selecting multiplex assays from singleplexes, the likelihood of accurate multi-target identification in a single fluorescent channel reaction is significantly increased using the ACA methodology.
As mentioned above, Fig. 4a highlights the presence of outlier combinations where small ADS/MDS with high ACA accuracy are reported (instead, low accuracy for the ACA classifier is expected).However, the existence of such data points does not deny the effectiveness of the proposed method.It is important to emphasise that the overall ACA accuracy for 3-plexes is inherently high because of the low levels of multiplexing.Classifying three different curve shapes does not represent a major challenge for this Machine Learning method, and targets with minor curve-shape differences can be easily separated in the feature space.Considering this, along with the prevalent randomness that exists in the ACA method for 3-plex, accuracies higher and lower than expected may occur in the given dataset.In fact, in the area with low ADS/MDS, we can observe a large standard deviation for accuracies among data points which fall beneath and above the fitted plane.Regardless of the accidentally high accuracies and low ADS/MDS caused by randomness, Figure 4f-g evidence that these outlier combinations will face more challenges when used for multi-target identification in larger scale multiplexes (or high-level multiplexing).In the outliers, the mapped target clusters are largely overlapped with unclear boundaries and small MSS even in 3-plex assays.
Therefore, we will demonstrate in the next section that the higher the level of multiplexing is, the more difficult the target separations are in the feature space when using these outliers.
Although low ADS/MDS combinations may occasionally show good performances, the proposed method ensures that all predicted optimal multiplex assays with high ADS/MDS show high accuracies in ACA and never the opposite.As illustrated in the 3-D plots of Figure 4a, the forbidden area (the red triangular prism) has no data point falling in, which highlights the effectiveness of the ADS/MDS ranking system.This is a first ever demonstration that multiplex assays tailored to the ACA method can be in-silico developed starting from singleplex PCR reactions.This not only increases the likelihood of accurate multi-pathogen identification, but also allows for a higher level of multiplexing in a single fluorescent channel.To demonstrate the capabilities of the Smart-Plexer in developing optimal high-level multiplex assays for datadriven approaches, in the following section, we assess its performance with seven different targets.Smart-Plexer for development of 7-plex assays.In the previous section, our focus was on using a small number of targets to demonstrate that the developed ADS and MDS used to correlate distances between curves in both simulated and empirical multiplex assays were maintained.Moreover, accuracies among all the different combinations were evaluated using the ACA methodology, where high ADS/MDS multiplex assays show the highest likelihood of correct multi-target classification.These previous results indicate that the Smart-Plexer is a promising technique for optimal selection of primer set combinations in data-driven multiplexing.
Next, we challenged the Smart-Plexer to develop an optimal 7-plex assay, which through the ACA method, .We designed at least two different assays for each target, for a total of 24 singleplexes across the seven pathogens, as shown in Supplementary Table 5.Each primer set was tested using synthetic DNA of its correspondent pathogenic target.Following the previous 3-plex experimental workflow, the resulting raw curves were processed, fitted, and passed to the Smart-Plexer to calculate all possible 7-plex combinations (N=4608) and compute their ADS/MDS.Based on "c" parameter distances from fitted simulated multiplexes, Figure 5a shows how the ADS and MDS can be visualised in a two-dimensional space.By considering the mean and standard deviation of the two scores, we set up boundaries to the ADS/MDS distribution for all the combinations and divided the space into four separate regions, with the purpose of showing how empirical multiplexes would perform for the ACA method depending on their ADS/MDS.The black horizontal segmented line in Figure 5a divides high and low MDS, and the vertical one separates the two ADS regions, resulting in four distinct areas.By testing different multiplexes from each of these regions, we are aiming to further demonstrate that chance of developing a reliable multiplex can vary based on the selected regions or selection criteria.Therefore, we chose multiplex assays from different areas and categorised them into five classes, which were empirically tested with synthetic DNA in qdPCR: BOT (N=6), MID (N=6), BEST (N=6), TOP-ADS and TOP-MDS (N=6) values (detailed selection criteria are reported in the methodology section).
After the empirical testing, the distances of the "c" parameters of each selected multiplex were compared to the simulated one, resulting in a correlation coefficient of 0.99, as shown in the middle graph of Figure 5b.Moreover, empirical multiplex amplification events were visualised using 3-D t-SNE, and distances across target clusters were calculated with the MSS.As shown in the left plot of Figure 5b, clusters of the selected BOT combination have an MSS of 0.12, whereas for the BEST one the score is 0.67.It can be observed that there is a clear difference in clustering between the two selected multiplex assays, where the BEST one shows clear separation among different targets (in line with the 3-plex results), and is expected to converge in better ACA classification.The opposite scenario is shown in the BOT combination.
We validated that in higher level multiplexing, distance distributions of the "c" parameters were still maintained from simulated to empirical testing; therefore, we computed the RMSE of the chosen tested combinations.Figure 5c-d illustrate side-by-side "c" parameter distributions for each target in both simulated (left) and empirical (right) multiplexes, showing a small RMSE for both BOT and BEST assays (0.012 and 0.031), and confirming the distance-maintaining hypothesis validated in the 3-plex experiments.
Moreover, we evaluate the ACA accuracy using training and testing datasets obtained in different experimental settings (different days, operators, and reagents) to ensure the reproducibility of the methodology.As expected, the performance of the BEST combination was significantly higher than the BOT one, with a 39.42% increase in accuracy.Furthermore, in Supplementary Table 6, we reported the ADS, MDS and accuracy values for the 24 selected multiplex assays.In Supplementary Fig. 2, we also visualised the standard curve for each target using the BEST 7-plex assay to evaluate primer sensitivity and specificity.The chosen multiplex reached a limit of quantification equal to 10 2 for all the respiratory pathogens using synthetic DNA in real-time PCR.
As described before, ACA performances were evaluated using training and testing datasets from different experimental settings with the same sample size.All the selected 24 multiplexes were empirically tested, and their multi-target identification performances were assessed.In Figure 5e, accuracies and standard deviations of each group of multiplexes were reported and visualised as box plots.The best-combination group scored an average (± standard deviation) classification performance of 95% (± 0.04%) using a k-NN classifier, which is the highest average and the lowest standard deviation among all the groups.There is a decreasing trend in the average accuracy, and an increasing trend in the standard deviation as the ADS/MDS values become smaller.Previously, the 3-plex validation showed the presence of outliers in low ADS/MDS rank with high ACA classification accuracy, which is also observed in these 7-plex tests.However, the standard deviation indicates that the Smart-Plexer does provide a robust and solid solution (even at highlevel multiplexing) to significantly increase the likelihood of choosing an optimal multiplex for data-driven multiplexing (i.e.ACA methodology).Clinical validation results.The final step was to validate that the Smart-Plexer is capable of easing the laboratory workload in developing multiplex assays.After testing six potential best combinations based on ADS/MDS, we selected the one with the highest ACA classification accuracy on synthetic DNA (PM7.2151).
To clinically validate the selected 7-plex for multi-pathogen identification, inactivated clinical samples were purchased from Randox Laboratories (UK) and extracted using a gold standard kit (QIAGEN mini amp).The extracted samples were used as the testing dataset (7,638 positive amplification reactions), while curves resulting from synthetic DNA amplification reactions (5,207 positive amplification reactions) were the training.The classifier used was a k-NN with the number of neighbours equal to 10.As shown in Table 1, a total of 14 positive samples were classified in qdPCR using the ACA methodology.The predicated label of a sample is given by selecting the most predictable label within all the in-sample curves.The confidence level was given as the percentage of the amplification curves with the most predicted label.We correctly identified all pathogens using the Smart-Plexer selected candidate assay, where most of the targets were recognised with high confidence (median = 95.46%).
It is important to note that in this study we faced a seven-class classification problem, where the accuracy of a "random guess" (or a random classifier as convention) equals 14.3% under a balanced dataset.All the confidence levels were much higher than the random guess accuracy, indicating solid and robust predictions with the selected optimal multiplex assay.Although the number of clinical samples was limited by the number of pathogens provided by the manufacturer, the proposed framework, in combination with the ACA methodology, achieved a highly accurate identification of multiple pathogens by using an optimal multiplex assay in a single fluorescent channel reaction.The Smart-Plexer can leverage the capability of the datadriven multiplexing to an easy-to-develop, robust, and cost-effective molecular diagnostic solution.

Discussion
In this work, we developed the Smart-Plexer, an innovative framework which combines wet-lab experiments and computational algorithms to generate optimal multiplex assays for data-driven approaches using real-time PCR data.The method leverages mathematical metrics to construct an advanced ranking system to increase the throughput of conventional molecular tests by optimising their chemical peculiarities.
To reveal the potential of this powerful approach, we demonstrated it with a recently reported machine learning method, named Amplification Curve Analysis (ACA), which is capable of identifying multiple nucleic acid targets in a single fluorescent channel with conventional PCR instruments.As the ACA leverages kinetic information encoded in the amplification curve, multiple targets can be classified based on the unique interaction with their assigned primer sets.However, constructing different amplification curve shapes for each multiplexed target is one of the major challenges for the ACA approach.The Smart-Plexer solves this problem by providing an easy-to-use framework for multiplex assay development, enabling high-level and highly accurate data-driven multiplexing.
This study shows the progression of the Smart-Plexer starting from a simple three-target classification problem.From the wet-lab testing of three singleplex assays for each of the three targets, a total of 27 combinations (in our case 3-plex assays) can be generated in silico (simulated multiplex) and ranked based on the mathematical curve-shape distances.Using synthetic DNA in qdPCR and a single fluorescent channel, the assays were empirically tested (empirical multiplex), and the ACA classification accuracies were evaluated for all the possible combinations.The distance scores computed from the Smart-Plexer for multiplex assay ranking are linearly correlated between simulated and empirical multiplexes.Moreover, we showed a further correlation between high-rank multiplexes and a high probability of increasing the ACA accuracies, confirming that the metrics used in this novel framework are theoretically connected to the distance measurement of the machine learning classifier.
As the complexity of developing multiplex assays exponentially increases with the number of targets, we further challenged the Smart-Plexer by designing a 7-plex assay to identify common respiratory tract infection (RTI) pathogens.Consistent with the 3-plex validation, the correlation between simulated and empirical multiplex is also maintained in 7-plex.Regarding the ACA classification, it is logical that higher similarities among curves exist in a scenario with a higher number of targets, making it harder to develop multiplex assays.Nevertheless, the Smart-Plexer brilliantly generated an optimal multiplex assay, which correctly identified pathogens presented in 14 commercial clinical samples.It was further demonstrated that, since ACA is a clustering method, it requires a large minimum distance between the two closest clusters and a large average distance among all clusters in the multiplex.Therefore, the Smart-Plexer ranking system enables the development of optimal multiplex assays for data-driven multiplexing.
Apart from the scalability of multiplexing that the Smart-Plexer can provide to the ACA method, we demonstrated for the first time that machine learning approaches can be applied to probe-based multiplexes, in our case, TaqMan.Probe-based assays, together with the use of intercalating dyes and isothermal chemistries, are expanding the boundaries of data-driven multiplexing and opening new windows for its application in commercial, research and clinical fields.The Smart-Plexer eases the development of any novel multiplex panel or molecular assays, enabling the use of the ACA as an emerging diagnostic tool.
Through this hybrid method, it is possible to select the highest rank combination in silico with wet-lab tested singleplexes, avoiding performing expensive and time-consuming multiplex assay development phases.
While this novel framework is validated with high-level multiplexing (7-plex), it is essential to highlight that distances between amplification curves can be a limiting factor in single fluorescent channel multiplexing.
This affects the Smart-Plexer since the inter-target differences of fitting parameters considered for the distance measurement become smaller as the target number increases.In this work, we use linear distance measurements, but more advanced metrics (e.g.Minkowski, Chebyshev or Cosine) can be adopted to improve the ranking performance.Moreover, when a higher level of multiplexing is required, the use of probe-based chemistries such as TaqMan comes handy.By leveraging the optical capability of real-time PCR instruments, a multiplex assay using multiple-channel detection can double or triple the number of targets in a single reaction.All these strategies aim to improve the ACA classification through a more innovative development from the chemistry perspective, while from the machine learning view, the current classifiers rely on state-of-the-art algorithms which shine for their robustness but are limited for tailoring to specific datasets.We previously demonstrated that more advanced classifiers such as convolutional neural networks (CNN) could extend the ACA capability to classify targets for higher-level multiplex assays.However, as a novel technique, data-driven multiplexing requires more optimisation and development of algorithms.
The Smart-Plexer represents a solution for developing multiplex assays by utilising both empirical testing and in-silico computation.The hybrid nature of this framework still requires wet-lab experiments; therefore, certain limitations exist in terms of staff training and time requirements.However, our future work will focus on the full automation of developing such assays.Novel methodologies to predict amplification curve behaviours will be developed.One example is the brand-new algorithm for designing multiplex PCR primers using Dimer Likelihood Estimation by Xie et al. 2021 21 .Another future aspect of this research is to further increase inter-target curve shape differences.An example would be in probe-based chemistries, where modifying amplification curve shapes can be achieved by changing the concentration levels of the fluorescent probe.In this way, we can enlarge inter-target distances of amplification curves to ease the ACA classification with better clustering performance.All the above-mentioned future works will inspire the use of the ACA method for a broad range of applications and significantly increase its flexibility and scalability.
In this work, we present for the first time a complete pipeline for developing optimal multiplex assays and open the usage of the ACA method to the broad scientific community.We finally unlock the development of novel molecular tests that will not only enable simpler molecular diagnostics but also make them affordable and available to everyone.
Limit-of-quantification. We used real-time PCR from QIAGEN (QIAquanta96) to evaluate the Limit-ofquantification (LoQ) of the selected 7-plex assay.Standard curves were generated with synthetic DNA ranging from 10 7 to 10 1 , apart from SARS-CoV-2 whose concentration was from 10 5 to 10 1 because of limitations due to pandemic suppliers (IDT).PCR data were extracted and processed according to the data processing step.Standard curve plots and statistical values are reported in Supplementary Fig. where [•] is the sign function and  j% is the given threshold value ( j% = 9 in our research).
Five-parametric sigmoidal fitting.Since amplification curves contain information such as background, plateau phase, and slope, we derived the most representative features of it using the sigmoidal equation.
The chosen model in this study for curve fitting is the five-parametric sigmoid function, whose equation is  To reduce optimisation iterations and unsuccessful fitting, we applied a pivot fitting on a subset of data (  ) to evaluate the optimal initial parameters    for the equation before searching on the entire dataset ().
First, we defined a non-linear Least Square function (), whose equation is shown below: To apply the pivot fitting, we first initialised   = [0,0,0,0,0] n .Then, for the  th curve  KL [ within the dataset   , the following optimisation problem was solved to find the fitted parameter vector: where the lower bound   and the upper bound   for all the parameters are -100 and 100, respectively.After all the curves were fitted, the mean vector of all the   was used as the optimal    .
With the outcome from the pivot fitting, we fitted all curves in  starting from    .In addition, to get better fitting performance, we increased the maximum number of fitting iterations (maxfev) to a sufficiently large value (1,000,000 in our case).The same   and   were used for the pivot fitting.
Calculating Average Distance Score (ADS) and Minimum Distance Score (MDS) for multiplex assays.There are four curve representations for calculating ADS and MDS, which are: raw curves (45-D), normalised curves (45-D), fitted parameters (5-D) and  parameter (1-D).Two steps were taken before the score calculation: (i) Extract the median feature vectors of each target for 45-D, 5-D and 1-D feature arrays.The median value was taken on each dimension, and the median feature vector with the same dimension was generated.It is assumed that the distribution of each target is Gaussian.However, outliers can affect the distribution unexpectedly.Therefore, the median value is a more robust representative compared to the average value, and  " median vectors corresponding to  " targets were constructed.(ii) Calculate Euclidean distance between each pair of targets, where given  " targets, the total number of distances  5 is: The vector of distances for each pair of targets is defined as: where  [‚ represents the Euclidean distance between extracted median vectors of target  and target .
With the constructed distance set, the ADS and MDS were calculated as the average and the minimum value of all elements in   , respectively: ACA methodology.The Amplification Curve Analysis (ACA) methodology was developed by Moniri et al. in 2020 30 .For the first time, shapes of amplification curves from real-time PCR data were used for multiple target identification in a single fluorescent channel reaction, utilising data-driven algorithms.The ACA takes the entire amplification time series as input and uses machine learning to classify curves into different categories of targets.This approach highlights the significance of the kinetic information embedded in amplification curves.As previously reported, several classical machine learning methods (e.g.k-NN, Random Forest, Support Vector Machine) as well as deep-learning based approaches (e.g.Convolutional Neural Networks) can be applied to the time series 31 .In this article, a k-NN classifier with 10 neighbours was used for the ACA performance evaluation.
Ranking system.The inputs of the ranking system are simulated ADS and MDS.To increase the likelihood of choosing an optimal assay for data-driven multiplexing approaches, we considered assays with the highest ADS and MDS ( Š‹OEn ) selected from the entire combination set ( end for The proposed Algorithm 1 is used to pick the best simulated multiplexes based on the developed metrics ADS and MDS, and these assays are further tested empirically to select the optimal one for the diagnostic use.Moreover, to verify the correlation of the Smart-Plexer ranking with the ACA performance, the algorithm was used to select the bottom multiplexes with the lowest ADS and MDS, by modifying step 3 and 4, so that the smallest instead of the largest ADS and MDS are applied.
The Smart-Plexer Workflow.The complete workflow of utilising the Smart-Plexer in a real laboratory setting is illustrated in Figure 1 and depicted as follows: given a number of target genes to be identified, several candidate primer sets are first in-silico designed and tested in singleplex format for each target, resulting in real-time PCR amplification curves for all the assays.The obtained data are further processed using the background, late curve, and noisy curve removal techniques mentioned in the Data Processing section.The processed curves are then fitted with the sigmoidal function from which the "c" parameters are extracted.
For each potential combination of primer sets, inter-target distances of "c" parameters from singleplex curves are calculated and function as simulated alternatives for empirical multiplex curve distances.In this way, the best candidates for multiplex assays can be selected by choosing the combinations with the most distant target clusters (represented by "c") in the simulation.This progress is achieved by calculating the "c" parameter-based ADS and MDS of each combination and finding the best ones using the ranking system mentioned above.The best candidate assays shortlisted from simulated multiplexes further go through wetlab tests on synthetic DNA templates, and the ACA-based target identification is applied to the empirical multiplex data.The final winner assay with the highest ACA classification performance on synthetic DNA is labelled as the optimal assay, which is the final output of the entire Smart-Plexer workflow.
3-plex validation.Synthetic DNA of Adenovirus (HAdV), Human coronavirus HKU1 (HCoV-HKU1) and Middle East respiratory syndrome-related coronavirus (MERS-CoV) targets were selected for a 3-plex validation, and all the data were generated in real-time digital PCR (qdPCR).Three primer sets were designed as candidates for each target, resulting in 27 potential combinations of multiplex assays in total.Because of the relatively small number of candidate assays, it is possible to perform wet-lab experiments for all combinations and analyse the relationship between simulated and empirical multiplex curve distances.Simulated ADS and MDS were calculated on different levels of curve representations (raw curves, FFI-normalised curves, and fitted parameters), and their correlations with the same metrics derived from empirical multiplex data were analysed.Furthermore, the ADS and MDS of "c" parameters, which are more concise indicators for intertarget curve distances, were generated and compared between simulated and empirical multiplexes.ACA performance against simulated ADS and MDS was depicted, and the t-SNE of the selected assays' results were illustrated.
7-plex validation.Following the 3-plex validation, seven targets were used to further validate the Smart-Plexer performance, where each target had at least two different assays, resulting in a total of 24 singleplexes and 4,608 candidate combinations.Unlike for 3-plex, the mass number of combinations makes it impossible to empirically test all the assays in multiplex settings.Instead, representative groups of assays

Fig. 1 |
Fig. 1 | Smart-Plexer workflow.a.Given a dataset of singleplex real-time PCR reactions (real-time amplification curves), a processing step is applied (a.i-a.iii).The processed curves are fitted following the equation depicted in step a.iv.An example is given in b., where each curve resulting from singleplex reactions is used in a simulation of multiplex assays.Three targets are considered, and each of them has three unique singleplex assays (a total of 27 simulated combinations).c.The simulated multiplex scores are calculated from the Smart-Plexer according to the Scoring Criteria.d.Distances within curves from different targets are calculated based on mathematical algorithms (such as Euclidean), and as shown in the confusion matrices, resulting values are used to rank multiplex assays from high (high distances within targets) to low (low distances within targets).e. High-rank multiplex assays are chosen for empirical testing, and the ACA method is used to evaluate the classification performance on target identification of each selected multiplex.f.Cluster visualisation with 2-D t-SNE represents the difference in inter-target distances between a High-Rank and a Low-Rank multiplex, resulting in high and low ACA classification accuracy, respectively.Selection of representative amplification curve.The ACA method uses the entire amplification curve as a time series where fluorescence values change as the number of cycles increases.Firstly, we chose the entire raw amplification curve generated from the real-time PCR reaction as the input of the Smart-Plexer.

3
Targets and 3 Assays = 27 Combina$on Wet -Lab tes$ng of 9 singleplex assays using real -$me PCR instruments Singleplex Assay Development: Data Processing & Fi=ng i. Background Removal ii.Late Curve Removal iii.Noisy Curve Filtering iv.Sigmoid FiOng Mul$plex Assay: ACA Classifica@on -Empirical tes$ng of TOP ranked Simulated Mul$plex Assays -ACA valida$on on Synthe$c DNA in a single channel reac$on Selec$on of Simulated Mul$plex: Ranking System Scoring Criteria : large distances among amplifica$on shapes of different targets v Scoring criteria metrics : ADS and MDS v Calcula$on of singleplex scores for all the possible combina$ons   =  (1 +  − − )  +

Fig. 2 |
Fig. 2 | Representative features investigation based on the 3-plex assay.a.The correlations of the Average distance score (ADS) between simulated and empirical multiplexes for the three types of curves/parameters (Raw curve, normalised curve and fitted parameters) are presented (from left to right in the same order).For each plot, each point with unique colour and shape corresponds to combination 1 to 27.The blue dashed lines are computed using linear regression.The Pearson

Fig. 3 |
Fig. 3 | Relative "c" parameter distributions of three different multiplex assays.a. Primer Mix 3.07 (PM3.07)illustrates the "c" parameter distribution of a low-rank ADS/MDS multiplex; b.PM3.01 as an example of high ADS but low MDS multiplexes.c.Multiplex assay with high ADS and MDS with clearly separated distributions.For each subplot, the left graph shows the distributions of "c" parameters for the Simulated Multiplex.The right plot represents the corresponding distributions according to the empirical multiplex data.The vertical dashed lines correspond to the mean of the distribution computed for different targets.To quantitatively verify that the distances are maintained in transition from simulated to empirical multiplexes, the RMSE of distances is calculated and displayed on the graph title.

Fig. 4 |
Fig. 4 | The influence of ADS/MDS on the ACA performance for all possible 3-plex combinations.a. 3-D plot of ACA classificationaccuracy for each combination versus simulated ADS and simulated MDS computed based on the "c" parameter.The rainbow plane is calculated using linear regression.In the left 3-D figure, the grey highlighted area is called Vacuumed Area, where simulated MDS is larger than simulated ADS (combinations in this area are mathematically impossible to be found).The right 3-D figure is a rotation of the left one, where a red is highlighted named Forbidden Area.In this region, high ADS/MDS combinations possess low ACA accuracies; however, no combinations were found.b-g.For the combination circled (TOP, BOT MDS, BOT ADS and OUTLIER) in a., 2-D t-SNE was applied on raw curves.In addition, for quantitative verification, the Mean Silhouette Scores (MSS) of target clusters were reported in the subplot title.

Fig. 5 |
Fig. 5 | Validation of Smart-Plexer based on 7-plex assays.a. 2-D ranking results for all 4608 combinations in 7-plex based on simulated ADS and simulated MDS.The plot is divided into four regions to explore the relationship between ACA performance and ADS/MDS.Combinations from five different classes (BOT, MID, BEST, TOP-ADS, TOP-MDS) were selected for multiplex empirical testing.b.The 2-D plot in the middle depicts the relationship between empirical and simulated scores based on "c" parameters.Enlarged data points for one of the BOT (PM7.1593) and BEST (PM7.2151)combinations are visualised with 3-D t-SNE on raw curves, and the corresponding Silhouette scores are calculated.c-d.Simulated and Empirical "c" distributions of the selected combinations (PM7.1593 and PM7.2151) are plotted (RMSE values in subplot titles).The vertical dashed lines correspond to the mean of the distribution computed for different targets.On the right, the confusion matrixes of ACA performance for both cases are presented, and overall accuracy using k-NN is reported in the title.True labels are on the yaxis and ACA predicted labels are on the x-axis (each target sensitivity is also reported in percentage).e.The box plot of ACA classification accuracy for each selected group.The mean and standard deviation of ACA accuracy on empirical multiplexes are calculated and shown on each box bar.
2. The Absence of amplification signals was detected in Negative Template Control (NTC) Data processing.The processing of raw amplification curves is comprised of three parts.Firstly, to ensure all curves start from approximately zero fluorescence value and to normalise the starting cycles of the curve across the entire time series, the background information was removed, which can be expressed as: KL () = () −  KO%Pwhere  KL () represents a curve with the background removed and () is the raw fluorescence values for each cycle  = 1, 2, ⋯ , .Here  indicates the total number of cycles for each amplification curve (45 in our case), and  KO%P is the average background value.In order to avoid instrumental noise commonly found at the beginning of the PCR reaction, the  KO%P value was estimated as the average value of the first several cycles' fluorescence, excluding the initial ones.In our case, five cycles were considered for the flat phase and the first three cycles were skipped.Secondly, late amplification filtering was applied to select curves that reached the plateau phase.The basic idea is to estimate the cycle threshold value ( V" ) for each curve, which can be represented as: V" =   .. KL () −  Z[\  ZO] −  Z[\ ≥  "_where  ∈ {1, 2, ⋯ , }, and  ZO] and  Z[\ represent maximum and minimum fluorescence values of the entire reaction respectively for each curve. "_ is the fluorescence threshold and curves whose  V" are above the cycle threshold (Ct = 30 as suggested by the manufacturer) were removed.Lastly, a filter was applied to remove non-sigmoidal curves with excessive noisy signals.The sigmoidal trend of a noisy curve may contain certain notches.Based on this feature, we estimated the first derivative of each curve using: KL c () =  KL () −  KL ( − 1),  = 2, ⋯ , The number of zero-crossing points in  de c () is related to the number of notches in the curve.Therefore, noisy curves should have significantly more zero-crossing points in their first derivatives compared with smooth sigmoidal curves.The curves that satisfied the following condition were regarded as noisy and removed: 4%("45) )6 +   = [, , , , ] n where  is the amplification cycle,  is the parameter vector, (, ) is the fluorescence at cycle .The mathematical function of these parameters and their corresponding representations in amplification curves are shown in curves  Amplitude of the function in the yaxis Affect the maximum fluorescence that the amplification curves can reach  Vertical shift of the function along the y-axis Affect the maximum fluorescence together with parameter   Maximum slope of the sigmoid function Related to the efficiency of PCR reactions  Horizontal shift of the function x-axis Fractional cycle of the inflection point (related to Ct values)  Richard's coefficient Asymmetry of the sigmoidal trend

Table 1 ,
6he lowest MSE is achieved with the five-parametric model (MSE=0.0036).The rising MSE in sixparameter sigmoid fitting is caused by unsuccessful optimisation resulting from a larger searching dimension.Based on the lowest MSE value, it is determined to utilise the five-parameter sigmoid function to extract features, and the equation is given below: () =  (1 +  4%("45) )6+

Table 2
below: •ŽŽ ).Provided the number of the best combinations to be selected as  Š‹OEn and the number of total combinations as  % , the following steps were

Table 2 |
five sigmoidal fitted parameters