Deriving reference values for nerve conduction studies from existing data using mixture model clustering

OBJECTIVE
to obtain locally valid reference values (RVs) from existing nerve conduction study (NCS) data.


METHODS
we used age, sex, height and limb temperature-based mixture model clustering (MMC) to identify normal and abnormal measurements on NCS data from two university hospitals. We compared MMC-derived RVs to published data; examined the effect of using different variables; validated MMC-derived RVs using independent data from 26 healthy control subjects and investigated their clinical applicability for the diagnosis of polyneuropathy.


RESULTS
MMC-derived RVs were similar to published RVs. Clustering can be achieved using only sex and age as variables. MMC is likely to yield reliable results with fewer abnormal than normal measurements and when the total number of measurements is at least 300. Measurements from healthy controls fell within the 95% MMC-derived prediction interval in 97.4% of cases.


CONCLUSIONS
MMC can be used to obtain RVs from existing data, providing a locally valid, accurate reflection of the (ab)normality of an NCS result.


SIGNIFICANCE
MMC can be used to generate locally valid RVs for any test for which sufficient data are available.1.


Introduction
Reference values (RVs) are of major importance for proper interpretation of clinical neurophysiological measurements, especially nerve conduction studies (NCS). Considerable debate remains about the use of RVs available in the literature and RVs collected in individual clinical neurophysiology departments. Publicly available RVs for NCS are rare: a recent systematic review only found one set of RVs of sufficient quality for a small number of measurements . In addition, universal application of publicly available RVs is hampered by the fact that they are likely to differ between clinical neurophysiology departments (Litchy et al., 2014;Brown et al., 2017), as they are influenced by factors such as electrode placement and size, filter settings and temperature. Therefore, several guidelines recommend that clinics develop their own RVs (Dillingham et al., 2016;Stålberg et al., 2019). However, as RVs are influenced by patient-related factors including age, body height and sex (Salerno et al., 1998;Dinesh Kumbhare et al., 2015), such an approach requires an exhaustive battery of NCSs on a large number of healthy subjects. To our knowledge, very few clinical neurophysiology departments actually create locally valid RVs.
Of note, a large number of measurements taken during routine clinical practice are likely to be normal and could be used to develop RVs, if they can be identified as such. Here, we assumed that any large enough set of NCS measurements consists of a mixture of two populations: a population of ''normal" measurements belonging to healthy nerves and a population of ''abnormal" measurements. These populations can be separated by a statistical technique called mixture model clustering (MMC). Here we used supervised MMC with statistical models, which combines clustering and linear regression models. MMC separates the data and creates a model for both populations that is based on factors that are known to affect the measured value, such as age, sex, height and temperature. The use of linear regression models in MMC has three advantages: 1) it improves clustering precision, 2) the model representing ''normal" measurements can be used to calculate individual RVs (or ''MMC predicted values") that are corrected for all relevant factors affecting the measurement and 3) the actual measured value can be compared to the MMC-predicted value based on these specific characteristics, which allows for the calculation of a z-score giving a precise quantification of the (ab)normality of a particular test result.

Patient data acquisition
All NCS were performed at Leiden University Medical Center (LUMC) and the Amsterdam University Medical Center (AUMC), location AMC. Both are tertiary referral centers for neuromuscular diseases. Details on stimulus and filter settings and electrode placement are provided as supplementary data (Appendix A). At the LUMC, all NCS and patient data from 1/1/2011-2/31//2017 were retrieved from reports. At the AUMC, all NCS data from 7/23/2008-11/18/2019 were retrieved directly from the EMG software using a custom-made script and anonymized. Recorded patient data included: patient age, sex, height and limb temperature. For motor conduction studies, we extracted distal motor latency (DML), negative peak compound muscle action potential (CMAP) amplitude, negative peak CMAP area, negative peak CMAP duration, nerve conduction velocity (NCV) of the most distal nerve segment in the arm and leg and minimal F-wave latency. In addition, we recorded the reason for referral and the conclusion from the report from LUMC data. At the AUMC, patient height was routinely recorded, but at the LUMC, this was usually only done when F-waves of leg nerves were recorded, as these were corrected for height (Buschbacher, 1999c(Buschbacher, , 1999b. According to Dutch law, formal approval by a local ethics committee and individual informed consent is not required for retrospective research on anonymized data gathered exclusively for patient care. We focused on a set of commonly used motor nerve conduction measurements, for which widely accepted RVs have been published: DML, CMAP amplitude, CMAP area, CMAP duration, NCV and minimal F-wave latency of median, ulnar, tibial and peroneal nerves (Dinesh Kumbhare et al., 2015). Results for the peroneal nerve will be shown throughout the manuscript as an example.

Log transformation of NCS data
Prior to clustering analysis, CMAP amplitude and area data were transformed to a logarithmic scale, as histograms were not consistent with two normal distributions (Fig. 1A). Logarithmic transformation is not possible on measurements with a value of 0. To determine the effects of leaving out these measurements, we performed the clustering analysis twice on log-transformed LUMC data: once after removing all measurements with value 0, and once after replacing 0 with 0.01.

Mixture model clustering (MMC)
A mixture model is a probabilistic model to determine the presence of a predefined number of clusters. MMC has previously been used successfully to classify Alzheimer Disease patients based on cerebrospinal fluid biomarkers (De Meyer et al., 2010;Toledo et al., 2015). Model-based mixture models allow for the simultaneous probabilistic classification into clusters and estimation of regression models in each cluster. Here, we used the flexmix package (Leisch, 2004;Grün and Leisch, 2008) for the programming language R (R Core Team, 2019), using a linear iterative expectation-maximization model (Dempster et al., 1977). In the expectation step, cluster memberships are estimated using regression models resulting in each measurement having a cluster-specific probability. In a subsequent maximization step, unknown model parameters (intercept, slope, and variance) for each of the clusters are estimated by maximizing the clusterspecific log-likelihood and using the probabilities calculated in the expectation step as weights. This process is iterated, until the values converge (the video in Appendix B provides an animation visualizing a simulation of expectation and maximization steps performed during mixture model clustering). The result of the iterative process is a matrix where each column represents the number of clusters and the rows represent each of the measurements. This matrix specifies the probability that the measurement of a subject belongs to each of the clusters. As the predefined number of clusters was two, a probability of more than fifty percent assigned the measurement to that cluster.
Based on the literature and availability of data, we used the following variables as independent variables (Buschbacher, 1999b(Buschbacher, , 1999c: for measurements from the AUMC and measurements on leg nerves from the LUMC, we used sex, age, body height, temperature, interaction between sex and age and interaction between sex and body height. For measurements of arm nerves from the LUMC, we used sex, age, temperature and the interaction between sex and age, because height data were missing relatively frequently from these measurements.
After clustering, we designated clusters with higher values for CMAP amplitude, CMAP area and NCV and lower values for DML, CMAP duration and minimal F-wave latency as the normal cluster.

Comparison of MMC results with published RVs
For comparison with published RVs, we stratified normal clusters according to the sex, height, and age categories used by Buschbacher (Dinesh Kumbhare et al., 2015). We then calculated mean, third percentile, standard deviation (SD) and mean ± 2SD (MN2SD) cut-off values of the normal cluster. Groups with fewer than 10 measurements were excluded from analysis.

Exploring robustness of MMC
We explored the effects of using multiple combinations of sex, age, height, and temperature and all possible interactions between these parameters in the clustering step of MMC for peroneal nerve CMAP amplitude data using AUMC data. Several linear models were tested. The most complicated model had main effect for sex, age, body height and temperature and interaction affect for sex and age and sex and body height. For each combination, we calculated the mean, SD, adjusted R 2 of the normal cluster and the percentage of abnormal measurements.
Next, we determined the robustness of MMC by adding increasing number of simulated abnormal data to existing peroneal nerve CMAP amplitude data from the LUMC. Simulated data were gener-ated by creating a linear mixed model of the abnormal values and generating and adding values by running a simulation of this model using the simstudy package (Goldfeld, 2019), increasing the abnormal population two-and fourfold and calculating the effect on size, average, and standard deviation of the normal cluster.
Finally, we aimed to explore the effect of sample size on reliability of the obtained RV by performing a second simulation of 1500 normal and abnormal CMAP amplitude measurements of the peroneal nerve in a 4:1 ratio. We then calculated the MN2SD cut-off value at the lower end of the normal cluster. By performing this simulation 30 times, we calculated the mean and the cut-off value for a sample size of 1500 measurements. We then iteratively reduced the number of measurements by 100 and recalculated mean and SD of the MN2SD each time.

External validation
The AUMC possesses NCS RVs, derived from 26 healthy individuals as described previously (Verhamme et al., 2009). These measurements are not part of the larger database used for MMC. We used this independent control dataset to validate MMC-derived RVs. For this aim, MMC was used to generate statistical models for each measurement. We modelled the interaction between sex and age and between sex and body height. MMC-based predictions with a 90% prediction interval were made of each measurement and compared with the actual measured values. Because measurements are expected to only have a lower or upper bound (i.e. all data are expected to be abnormal in only one direction: lower than normal for CMAP amplitudes, CMAP areas and NCVs, and higher than normal for CMAP duration, DMLs, minimal F-wave latency), 95% of measured values from healthy controls are expected to fall in the MMC-predicted range.

Data availability
Data are available in a public, open access repository. We have placed all R code and a test dataset online as open source data (gitlab.com/lumc/clinicalneurophysiology/ReferenceValues).

Demographics
Baseline data can be found in Table 1. In general, age and sex data were complete, temperature data were missing in 30% (LUMC) and 45% (AUMC). Height data was missing in 49% of NCSs at the LUMC.

Clustering and creating RVs
For all measurements, MMC resulted in two clear clusters (Fig. 1). In all cases, the normal clusters contained the highest number of values, showed a normal distribution and had a higher adjusted R 2 , suggesting that these were indeed representative of ''normal" measurements. In the normal cluster, significant effects of age, height, sex and temperature were present as expected from literature: e.g., CMAP decreased with age (data not shown).
There was a small mean difference of 0.6 mV between distributions of the normal clusters of peroneal nerve CMAP amplitudes of the LUMC and Amsterdam UMC (p < 0.001; Kolgomorov-Smirnov test) (Fig. 2).
After stratification according to the age and height classes used by Buschbacher (Dinesh Kumbhare et al., 2015), MMC-derived data from both hospitals were similar (Tables 2 and 3). Of note, MMC derived mean and cut-off value (MN2SD) for CMAP amplitudes were generally lower than those published by Buschbacher. MMC-derived 3rd percentile and MN2SD values for CMAP amplitudes were closer to each other than those published by Buschbacher. Replacing measurements with value 0 with 0.01 prior to log transformation did not affect the number, mean and SD of the nor-mal cluster as all low measurements were grouped in the abnormal cluster (data not shown).
Appendix C provides estimates of the independent variables and other statistical results of the linear models for the normal clusters after MMC.

Effect of different variables and changing cluster sizes
MMC of the peroneal nerve CMAP using only sex and age as variables resulted in two clear clusters in the AUMC data (Fig. 3). The addition of temperature and height as independent variables had limited effects on size, mean and fit of the normal cluster (e.g. adjusted R 2 was 0.17 using main effect for sex and age and interaction effect for sex and age, and 0.18 when data were clustered using main effect for sex, age, body height and temperature and interaction effect for sex and age and sex and body height).
We explored the effect of the number of abnormal measurements on the normal cluster by performing MMC on the peroneal nerve CMAP amplitude after adding simulated abnormal data to LUMC data (Fig. 4). Doubling the number of abnormal measurements had limited effects on size, mean and MN2SD of the normal cluster: MN2SD was 1.2 mV using the actual data set, 1.2 mV after doubling the abnormal cluster. However, increasing the abnormal cluster fourfold (resulting in clusters of approximately equal size) led to a change in the MN2SD of the normal cluster to 0.9 mV.
Finally, we explored the effect of sample size on reliability of the obtained value by iteratively reducing the number of measurements in a simulation from 1500 measurement to 200 measure- ments (Fig. 5). In this analysis, the mean cut-off was 1.3 mV and the SD was less than 0.1 mV. As expected, the SD of the obtained cut-off value gradually increases as the number of measurements used for MMC decreases, with a further increase in SD when the total number of measurements was lower than 300.

External validation
We calculated MMC-derived predicted values and their 95% prediction intervals for 21 measurements of an external, previously published set of 26 healthy subjects from the Amsterdam UMC (Verhamme et al., 2009). Of all measurements, 97.4% fell within the MMC derived 95% prediction interval.

Discussion
Here, we show that normal measurements can be extracted from existing NCSs using MMC. The derived cluster of normal measurements can be used to calculate a department-specific predicted value and z-score based on age, height, sex and temperature. This provides a more precise, individual estimate of the (ab)normality of a test result than the use of published RVs, because: 1) the estimation is based on very large data sets, 2) MMC-derived values incorporate all relevant patient characteristics and 3) values have been obtained using local equipment, settings and patient populations. In addition, this approach allows the user to attach a z-score to each measurement relative to the expected value of that patient, rather than merely stating that a measurement is within normal limits or not, as is currently common practice in most clinical neurophysiology departments.

External validation of MMC
MMC can be used in the absence of a formal gold standard, but we used several methods to show that the obtained values are indeed representative of normal measurements. First, MMC resulted in two clear clusters for each measurement, with the normal cluster following all expectations with regard to size, normal distribution and fit of the model. Second, MMC performed on two large independent datasets from different hospitals yielded values similar to each other and previously published RVs. Third, 97.4% of all measurements from an independent dataset of healthy controls fell within MMC-derived expected values, close to the expected value of 95%.

Comparison with published RVs
MMC-derived RVs were highly similar to previously published values (Dinesh Kumbhare et al., 2015). However, MMC-derived mean values for CMAP amplitudes and areas, while highly similar between LUMC and AUMC, were generally lower than those obtained by Buschbacher. This is probably because we calculated these values on the logarithmic scale, e.g., the mean tibial nerve CMAP amplitude would have been 7.2 mV instead of 6.3 mV when Table 2 Buschbacher's RVs and MMC-derived RVs from the Amsterdam UMC and LUMC for arm nerves. Bb: Buschbacher, SD; standard deviation, DML: distal motor latency, NCV: nerve conduction velocity, AUMC: Amsterdam University Medical Center. LUMC: Leiden University Medical Center, N: number of measurements. Amplitude data expressed in mV, Area in mVÁms, DML, duration and F-wave latency in ms. Details with regards to settings and recording sites can be found in Appendix A. Mean ± 2SD and 3rd percentile columns: mean-2SD and 3rd percentile for CMAP amplitudes, CMAP areas and NCVs, mean + 2SD and the 97th percentile for CMAP duration, DML and minimal F-wave latency. Empty fields: insufficient data for MMC.

Mean
Mean ± 2SD 3rd percentile N calculated on the natural scale. However, mean and SD should not be derived from skewed data (Fig. 1). This was previously noted by Buschbacher (Dinesh Kumbhare et al., 2015), although it appears that his published RVs were not based on log-transformed data. This is supported by the observation that the MN2SD and 3rd percentile values were highly similar in MMC-derived data, but not in Buschbacher's. In addition, differences may have been caused by differences in population, methods of data acquisition, such as filter settings, electrode size and electrode placement. Although equipment, electrode placement and filter settings are similar in our centers we observed small but significant differences between the datasets from both hospitals. We did not formally test whether all measurements differed significantly, but this is likely the case, underlining the importance of generating locally valid RVs (Tjon-A -Tsien et al., 1996;Van Dijk et al., 1999) We stratified data to create RV tables in accordance with groups used by Buschbacher to enable a comparison. However, we recommend using individual predicted values as this will yield more precise results than stratified data.

Comparison with other methods
Several other methods to derive RVs from existing data have been published. The Extrapolated Norms (E-Norms) procedure is based on the assumption that a cumulative distribution function (CDF) of values derived from biological measurements, will show an S-shaped curve, with a middle, linear segment representing the values of normal measurements (Jabre et al., 2015). RVs are derived by drawing a straight line through this middle segment and visually determining where it diverges from the S-curve. This method has recently been refined using the extrapolated reference method (E-Ref), which calculates the inflection point by measuring the angle between adjacent points on the CDF plot (Nandedkar et al., 2018).
A drawback of both methods is that they require a relatively homogeneous data set as they do not allow correction for patient-related factors within the model (Stålberg et al., 2019). Multiple models are therefore needed based on different subsets of factors (e.g. age, height, sex), meaning that very large numbers of patients are required to obtain accurate RVs. Whereas MMC will result in a precise, tailored RV for each individual measurement, Enorms derived RVs are based on stratified group data, where large strata will lead to less precise results and smaller strata will require increasing large data sets. Furthermore, additional potentially relevant variables (e.g., height or body weight) can easily be added to MMC to increase precision. Finally, MMC enables the calculation of a z-score (in other words, a continuous value on the degree of abnormality), whereas E-norms will only yield a dichotomous outcome (i.e. normal or abnormal) that cannot be used for further statistical analysis. Table 3 Buschbacher's RVs and MMC-derived RVs from the Amsterdam UMC and LUMC for leg nerves. Bb: Buschbacher, SD; standard deviation, DML: distal motor latency, NCV: nerve conduction velocity, AUMC: Amsterdam University Medical Center. LUMC: Leiden University Medical Center, N: number of measurements. Amplitude data expressed in mV, Area in mVÁms, DML, duration and F-wave latency in ms. Details with regards to settings and recording sites can be found in Appendix A. Mean ± 2SD and 3rd percentile columns: mean-2SD and 3rd percentile for CMAP amplitudes, CMAP areas and NCVs, mean + 2SD and the 97th percentile for CMAP duration, DML and minimal F-wave latency. Empty field: insufficient data for MMC.

Robustness of MMC
We show that clustering can be achieved with MMC for many measurements using only age and sex as independent variables. This is an advantage for common use, as these data will always be available. Nonetheless, we recommend including temperature and height data when performing MMC if possible. These variables have significant effects in the obtained regression model, and models including these variables will therefore yield more precise RVs.
Doubling the size of the abnormal cluster with simulated values based on existing peroneal nerve CMAP amplitude data had a limited effect on the final cut-off value of the normal value, but a further simulated increase of the abnormal cluster appeared to affect the SD and 3rd percentile cut-off of the normal cluster. Although this simulation is useful to show the limits of MMC, it should be noted that the abnormal cluster was far smaller for every actual measurement on which we performed MMC. An additional simulation reducing the amount of measurements showed that the relia-   bility of MMC decreases noticeably when fewer than 300 measurements are used. These data are in line with the observation that a minimum of 300 measurements appears to be necessary to obtain relatively consistent results.

Limitations
Height and temperature data were frequently missing. In addition, precision of MMC can probably be improved by adding additional independent variables. Ethnicity (Shivji et al., 2019) and Body mass index (Salerno et al., 1998;Buschbacher, 1999aBuschbacher, , 1999d have been reported to affect NCS values. We found indications that the side of the measurement affected modelling for NCS measurements of the arms (data not shown). Indeed, NCS values may differ between the dominant and non-dominant arm (Werner and Franzblau, 1996;Kommalage and Gunawardena, 2013). As we did not record which limb was the dominant side, we decided not to use side as a factor in the current analysis.
MMC assumes that both clusters have a normal distribution, which was usually not the case for the abnormal cluster. We do not expect that this affected the properties of the normal clusters, as these always showed a normal distribution.

Conclusions
We provide a method that enables all clinical neurophysiology departments to generate locally valid RVs. To facilitate adoption of this method we have placed all R code and a test dataset online (gitlab.com/lumc/clinicalneurophysiology/ReferenceValues).
Future studies comparing data sets from more neurophysiology departments could shed light on the effects of local variations in NCS practice and allow the creation of RVs for NCS measurements for which reference data currently do not exist. MMC may also be useful to obtain optimal cut-off data for entrapment neuropathies, jitter values from single fiber EMG or NCS data from children (Pitt and Jabre, 2018). In addition, MMC is likely to be useful for any kind of neurophysiological data, including measurements such as nerve cross sectional areas obtained with ultrasound, quantitative EEG parameters and evoked potentials. Simulation of 1500 normal and abnormal peroneal nerve Compound Muscle Action Potential (CMAP) amplitude measurements in a 4:1 ratio. The Y-axis shows the mean-2SD cut-off and standard deviation (SD) based on 30 simulations. Reducing the number of measurements in steps of 100 shows that the SD increases with fewer measurements, and becomes noticeably larger when the total number measurements falls below 300.