Machine learning classification reveals robust morphometric biomarker of glial and neuronal arbors

Abstract Neurons and glia are the two main cell classes in the nervous systems of most animals. Although functionally distinct, neurons and glia are both characterized by multiple branching arbors stemming from the cell bodies. Glial processes are generally known to form smaller trees than neuronal dendrites. However, the full extent of morphological differences between neurons and glia in multiple species and brain regions has not yet been characterized, nor is it known whether these cells can be reliably distinguished based on geometric features alone. Here, we show that multiple supervised learning algorithms deployed on a large database of morphological reconstructions can systematically classify neuronal and glial arbors with nearly perfect accuracy and precision. Moreover, we report multiple morphometric properties, both size related and size independent, that differ substantially between these cell types. In particular, we newly identify an individual morphometric measurement, Average Branch Euclidean Length that can robustly separate neurons from glia across multiple animal models, a broad diversity of experimental conditions, and anatomical areas, with the notable exception of the cerebellum. We discuss the practical utility and physiological interpretation of this discovery.


Thus, it remains an open question whether suitable morphometric
biomarkers exist that can robustly and systematically discriminate between neuronal and glial arbors.
Hundreds of laboratories worldwide continuously contribute their digital reconstructions of neurons and glia to the public online database NeuroMorpho.Org (Akram et al., 2018). This repository associates every cell entry with metadata (Bijari et al., 2020) describing the animal subject (species, strain, sex, age, and weight), anatomy (brain region, sub-region, cell type, and sub-type), experimental details (protocol, condition, histology, microscopy, and tracing), and provenance (authors, source publication, original version, and processing logs). Moreover, the detailed 3D representation of arbor geometry is accompanied by a battery of morphometric parameters extracted with L-Measure (Scorcioni et al., 2008), such as total length, number of branches, arbor height, and tortuosity. Glial cells were introduced to NeuroMorpho.Org in version 7.1 (2017) and now constitute 12.1% of over 185,000 tracings. The unrestricted availability of these data provides an unprecedented opportunity for scientific exploration, statistical analysis, and computational modeling .
This study leveraged supervised machine learning algorithms to distinguish between neurons and glia. Supervised learning has been successfully used in neurobiological data analysis such as automatic tracing of neurons and glia (Peng et al., 2017) and their quantification (Bijari et al., 2021) as well as healthcare applications such as automatic diagnoses and treatment planning (Kohli & Arora, 2018). We were interested in examining whether the same algorithms utilized in disease predictions and tracing automation are able to differentiate the two main types of cells of the nervous system using morphological structures and independent of their metadata.

| Dataset selection and preprocessing
The morphological reconstructions of glial processes and neuronal dendrites utilized in this work were contributed to NeuroMorpho.
Org by over 250 independent laboratories and reflect the distribution of published arbor tracings in neuroscience ( Figure 1). These digital tracings of neurons and glia were downloaded from the database using the Summary Reporting web-based functionality (Akram et al., 2022). This tool collates for every digital tracings the annotation of 35 distinct metadata fields, providing a detailed qualitative description of the cell , as well as 21 morphometric measurements which capture the quantitative structural features of the arbor (Scorcioni et al., 2008). In terms of metadata, the dataset spans a broad diversity of experimental methodologies, including over 20 different staining methods (e.g., genetic green fluorescent protein labeling, intracellular biocytin injection, immunostaining, and rapid Golgi), 15 digital reconstruction software (Neurolucida, Imaris, Amira, NeuronJ, Simple Neurite Tracer, Vaa3D, Knossos, NeuTube, Knossos, FarSight, Neuromantic, NeuronStudio, Arbor, Eutectic, and Raveler), and a continuum of ages across development, from embryo through juvenile to old adults. Moreover, both neuronal and glial data came from both mammalian and non-mammalian species and a large variety of anatomical regions (Figure 2). The full breakdown of all metadata categories annotated in NeuroMorpho.Org is provided in Supporting Information.
First, we downloaded all glial cells available at the time we began data analysis (Fall 2020), corresponding to 10 consecutive releases of NeuroMorpho.Org (versions 7.1 to 8.0 inclusive). We then analyzed the distributions of their metadata with respect to animal species, developmental stage, anatomical region, and other experimental Significance Neurons and glia, the two main cell categories in the nervous system, have never been subjected before to multivariate statistical classification using morphological measurements. We demonstrate that traditional machine learning classifiers do a near-perfect job of separating neurons and glia. Furthermore, this study identified a new morphometric biomarker capable of distinguishing these nervous system cells. In addition, we discovered that only five branches of a brain cell can discriminate between glia and neurons without tracing the whole cell for classification.
F I G U R E 1 Representative diversity of morphological reconstructions of glia and neurons from NeuroMorpho.Org with labels indicating animal species, anatomical region, and cell type. Blue: Glial processes; green: Neuronal dendrites; red: Cell bodies.
conditions, and queried the database to identify the same number of neurons with the most similar metadata characteristics. Specifically, we first generated a list of candidate neurons grouped by metadata using the Summary Reporting functionality. To do so, we selected all those metadata entries from the dropdown menus of Summary Reporting which were represented in the glia dataset. For example, under the "Species" menu, we selected all species represented in the glia dataset; under the "Anatomical region" menu, we selected all anatomical regions represented in the glia dataset; and we did the same for all other menus in the Summary Reporting interface. Since we were interested in comparing glial processes specifically to neuronal dendrites (as opposed to axons), only neurons with dendritic tracings available were selected. This process yielded a list of 37,931 neurons, which was a 10:3 excess relative to the number of glial cells. Therefore, we randomly down-sampled this neuron collection with an inclusion probability of 0.3. Moreover, we solely included neurons and glia with complete or moderately complete reconstructions, thus excluding those annotated as incomplete dendrites or incomplete glial processes by the original contributors. The resulting dataset of 22,792 cells comprised 11,398 neurons and 11,394 glia.
The morphometric quantification of neural trees supplied by NeuroMorpho.Org provides a detailed 3D representation of branch F I G U R E 2 Distribution of (a) animal species and (b) brain regions for the analyzed glia and neuron datasets. geometry ( Figure 3). The extracted features include parameters characterizing both the overall arborization size and scale-invariant properties. The formers include total cable length and surface area, spanning height and width, maximum Euclidean and path (geodesic) distance from the root (soma), and average branch diameter among others. The latter measurements capture bush complexity (e.g., number of branches and tree stems), branch angles (local and remote bifurcation amplitude), topological imbalance (partition asymmetry and maximum branch order), and spatial meandering (contraction and fractal dimension), among others. Altogether, this set of morphometric parameters is well suited to characterize the structure of neuronal dendrites and glial processes alike, and thus to quantify their similarities and differences. Of the 21 morphometrics extracted for each cell from NeuroMorpho.Org, we excluded Soma Surface and Depth from the analysis. Soma Surface is not an arbor morphometric, and 4.7% of the tracings in our dataset did not include soma reconstruction. Depth was similarly not reported for 8.6% of the cells as the accuracy of the tracing is reduced in certain cases by light diffraction and tissue shrinkage in the direction perpendicular to the imaging plane. The remaining 19 morphometric features were used in the analysis: number of stems, number of bifurcations, number of branches, overall width, overall height, average diameter, total length, total surface, total volume, maximum Euclidean distance, maximum path distance, maximum branch order, F I G U R E 3 Schematic of selected morphometric features. (a) Illustration of width, depth, and maximum Euclidean distance (left) in a monkey neocortical pyramidal cell (NMO_00002) from the Wearne_Hof archive (Duan et al., 2002); and of height and fragmentation (right) in a hippocampal granule cell (NMO_73103) from the Diaz archive (Sebastián-Serrano et al., 2016). (b) Diameter and local or remote bifurcation amplitude (left) in a rat neocortical microglia (NMO_95641) from the Roysam archive (Megjhani et al., 2015); and maximum path distance, length, and number of branches, bifurcations, and stems in a rat cortical oligodendrocyte (NMO_131081) from the Sato_Bigbee archive (Mohamed et al., 2020). average contraction, total fragmentation, partition asymmetry, average Rall's ratio, average bifurcation angle local, average bifurcation angle remote, and fractal dimension (Table 1). More details about these metrics are also available on the Frequently Asked Questions of NeuroMorpho.Org (http://neuro morpho.org/myfaq.jsp) and on the online manual of L-Measure (http://cng.gmu.edu:8080/Lm/ help/index.htm). Although the above-described parameters are intuitively interpretable, they may not be completely independent of each other. For example, total tree length, surface area, and average branch diameter are expected to be interrelated. Such information redundancy can unduly bias the objective characterization of the structural differences between glia and neurons, complicating subsequent interpretations.
Most reconstructions in NeuroMorpho.Org have coordinates reported in microns. In a subset of reconstructions, however, the coordinates are expressed in pixels. In these cases, the nominal measurements listed in the morphometric tables must be converted by an appropriate scaling factor. Therefore, we manually calculated the height of at least one cell in each archive from the figures of the corresponding original publications (and relative scale bar) and compared the resulting value to the height reported by NeuroMorpho.
Org. If the values did not match, we computed a conversion factor and applied it to size-related morphometric features, including width, height, total length, total surface, total volume, maximum Euclidean distance, and maximum path distance. A similar number of neurons and glia required scaling: 1580 glia (10.1% of glial total) from 10 archives and 1342 neurons (8.6% of neuronal total) from 9 archives.
We followed the exact same procedure for both glia and neurons.

| Dimensionality reduction
We computed the coefficient of determination (R 2 ) to quantify the pairwise correlation (Di Bucchianico, 2008) among the 19 morphometric parameters across neurons and glia using the rcorr function in

Morphometric parameter Description
Surface Average over all bifurcations of the ratio between the absolute difference between the numbers of terminations in the two subtrees and their sum

Fractal_Dim
Average over all branches of the slope of the regression line between the log10 of the path distance and the log10 of the Euclidean distance of each tracing point from the beginning of the branch Pk_classic Average over all bifurcations of ratio between the sum of the diameters of the two daughters, each elevated to 1.5, and the diameter of the parent, elevated to 1.5

Bif_ampl_local
Average over all bifurcations of the angle between the first two daughter compartments Bif_ampl_remote Average over all bifurcations of the angle between the ends of the two daughter branches the R package Hmisc (Harrel & Dupont, 2021). We then used Principal Component Analysis (PCA) to reduce the feature redundancy. PCA transforms the data into a set of new orthogonal variables by identifying the directions (principal components) along which the variation in the data is maximal (Abdi & Williams, 2010). By discarding the least informative components, each sample can be represented by fewer, linearly independent features instead of more, mutually dependent variables (Ringnér, 2008). Thus, PCA reduces the dimensionality while retaining most of the variation in the data. PCA was performed with the R package stats (R Core Team, 2021) by using the function prcomp() and setting scale = TRUE. Along with PCA, all parameters were standardized by first subtracting the mean of the entire feature vector from each element and then dividing by the standard deviation.

| Supervised learning
In supervised machine learning, an algorithm is trained with the known class labels and identifies the most informative combination of features that are associated with those labels (Kotsiantis et al., 2006). The resultant classifiers can then be applied for predicting the labels of unknown data based on their feature values. In this study, the training data consisted of the normalized principal components of the morphometric features as input, and the known class labels (the cell identity as 'neuron' or 'glia') as output. We used three distinct supervised learning algorithms and R (Core Team, 2021) v.4.1.1 for Windows as the computing platform. These classifiers were used to distinguish between neurons and glia, and no subset classifiers were trained and used to distinguish between different metadata categories of these cells. In all cases, we calculated sensitivity, specificity, and accuracy, respectively, defined as the fractions of true positives, true negatives, and correctly classified (true positives plus true negatives) cells, using the caret (Kuhn, 2021) package in R (Core Team, 2021).

K-nearest neighbor (KNN) is a supervised learning algorithm that
can be used both for classification (discrete value output), as applied here, and regression (continuous value output) problems. In KNN, the training instances are stored with their labels and each new instance is compared with the labeled ones using similarity measures.
Each new instance is classified based on labels of the K most similar neighboring instances. For example, if K is set to 5, five nearest neighbors are identified from the training data and the class label with the highest frequency is assigned to the new instance (Aha et al., 1991). The default Euclidean distance was used here to compute similarity between two data points. The built-in caret package (Kuhn, 2021) was utilized for the KNN implementation by using the train() function and setting method = "knn" with tuneLength = 10 and K = 5.

Support vector machine (SVM) is a binary classification algorithm
based on finding the maximum margin hyperplane that gives the greatest separation between the data points of different classes in multidimensional space. Those data points closest to the hyperplane are called the support vectors. If the data are not linearly separable, different kernels can be selected for nonlinear classification. This classifier is robust to large number of variables and small sample sizes (Cortes & Vapnik, 1995). We implemented SVM using the caret package (Kuhn, 2021) using the train() function with tuneLength = 10, and method = "radial" kernel, which is a common choice for classification tasks (Luts et al., 2010).
Random forest (RF) consists of a large number of individual decision trees. Each individual tree in the forest splits out a class prediction and the most frequent class becomes the model prediction. This is one of the most popular machine learning algorithms and is capable of both classification (as used here) and regression (Breiman, 2001;Sarica et al., 2017). We applied the randomForest package (Liaw & Wiener, 2002) using function train(), method = "rf", and ntree = 500.
The rationale for this choice is that a relatively high number of decision trees ensures that every input row is predicted multiple times.
Parameter mtry determines the number of variables randomly sampled as candidates at each split and was set to the default value of 5.

| K-fold cross-validation (K-fold CV)
It is customary in supervised learning to train the model on the majority of the data, leaving the remaining for testing. To rigorously examine the classification performance on our data, we performed k-fold cross-validation. This process divides the dataset into k equal parts. A classifier is first trained on k−1 parts for each fold. The accuracy of the trained model is then assessed by using the part of data excluded from the k−1 parts in training (Bouckaert, 2003). We performed 10-fold CV repeated 10 times using the caret (Kuhn, 2021) package by using the function trainControl(), method = "repeatedcv", number = 10, and repeats = 10.
PSwarm is a global optimization solver for bound and linearly constrained problems (Vaz & Vicente, 2009). This algorithm is based on a pattern search and particle swarm method, which guarantees the convergence to stationary points from arbitrary starting points.
We used the PSwarm Solver (v.1.5, June 2020, www.norg.uminho. pt/aivaz/ pswar m/) implementation in R to find the linear discriminant of neurons and glia based on two morphometric parameters.
We set the lower and upper bounds to 75 and 175, respectively, for intercept and to −10 and 75 for slope, and the number of iterations (maxit) to 2·10 9 . One limitation of PSwarm optimizer is that it can only be implemented on a Windows machine.
All analyses were carried out on a 64-bit machine equipped with

Average branch Euclidean length (ABEL)
is the average over all branches in a cell of the straight-line distance between the beginning and ending points of each branch. This quantity was calculated from three of the morphometric parameters provided for each cell by NeuroMorpho.Org: branch path (geodesic) length, the number of branches, and contraction, which is the ratio between Euclidean branch length and branch path length (its inverse is tortuosity).
Specifically, ABEL was derived by summing the product of contraction by branch path (geodesic) length and then dividing the result by the total number of branches in each cell: where NB is the total number of branches. We also calculated ABEL values were used to compute ABEL within each set, and the average and standard deviation were then computed over the 100 sets.
Finally, classification was carried out using the mean ABEL value.

| RE SULTS
We began our quantitative analysis by inspecting the morphometric features extracted for neurons and glia (Figure 4). The pairwise coefficients of determination (R 2 ) for glial ( Figure 4a) and neuronal ( Figure 4b) morphometrics revealed substantial correlations between specific features. For example, surface is highly correlated to volume, the number of bifurcations to the number of branches, length, fragmentation, and branch order (and the latter four to one another), path distance to Euclidean distance, and contraction to fractal dimension. Although the coefficients of determination tended to be higher in neurons than in glia, most visibly between maximum path distance and total surface area, and between overall height and maximum Euclidean distance, the majority of correlations were highly consistent between the two cell types. In order to remove the interdependency among features, we performed PCA jointly on the full dataset to orthogonalize the morphometric parameters ( Figure 4c). The first 11 principal components captured 95.70% of the variance and we thus decided to exclude the last eight components from machine learning. The 11 principal components considered in subsequent analysis constitute a combined transformation of all 19 morphometric parameters described above but are guaranteed by PCA to be independent.
The first two components (PC1 and PC2) alone capture more than 50% of the overall morphological variance in neural cells. cases, we performed 10-fold cross-validation: the dataset was randomly split into 10 folds without replacement, with 90% of the data used to train the classifier and the remaining 10% used for testing.
The process was repeated 10 times for more reliable assessment.
The total runtime for 10 repeats of 10-fold cross-validation was 15 min for KNN with 5 nearest neighbors (k = 5), 2.2 h for SVM, and 5 h for RF. All three classifiers performed remarkably well in sepa- The optimal ABEL threshold of 14.33 μm results in overall classification accuracy of 97.6%, with fewer than 2.4% of glia and 2.5% of neurons misclassified. Notably, the misclassification rate dropped steeply around the threshold ABEL value, with >90% of the misclassified cells found in a narrow ABEL range of 7 (12-19) μm (Figure 8a).
These results were robust across multiple species (such as mouse, rat, drosophila melanogaster, rabbit, and monkey), strains (e.g., The optimal linear boundary, which is the line best separating neurons and glia in two-dimensional space, followed the equation Multiple studies reported that certain neuron types have longer terminal branches than internal (bifurcating) branches (Duan et al., 2002;Kawaguchi et al., 2006;Li et al., 2005)  of neurons). When considering male and female animals separately, ABEL was found to be equally capable of separating neurons and glia (accuracy for males: 98.1%, and for females: 98.3%). For both neurons and glia, ABEL was slightly but significantly smaller in females than in males (female/male ABEL ratio: 0.74 for glia, 0.84 for neurons). Accordingly, the ABEL threshold to discriminate between glia and neurons was smaller for females (12.18 μm) than for males (15.21 μm). These differences between sexes were small compared to the main effect between cell types, where ABEL is significantly smaller for glia than for neurons (glia/neuron ratio: 0.13 for males, 0.12 for females). The complete ABEL results by sex and cell types are reported in the Supporting Information.

Sprague
Next, we tested the robustness of ABEL as a morphological bio- The ABEL classification accuracy for this new dataset was the same at 97.6% (using the unaltered 14.33 μm threshold). We then tried to assess whether the few outliers were due to systematic patterns or random noise. Classification accuracy was largely consistent across almost all of the metadata investigated, with only notable exceptions when portioned by brain region (Figure 10a). Specifically, the high misclassification rate in the cerebellum prompted a deeper evalua- Here, we found that the single culprit was a specific subtype of invertebrate sensory neuron: dendritic arborization (da) Class III cells

| DISCUSS ION
Open sharing of digitally reconstructed neuronal morphology from labs across the world has made it possible for researchers to carry out statistical analysis, classification, and computational modeling of their interest (Bota & Swanson, 2007;Halavi et al., 2012;. Far fewer morphological classification studies have also included glia, and they typically did not focus on directly com- Our results showed that supervised learning methods are able to distinguish neurons from glia with exceptionally high (>99%) accuracy.
We, thus, set out to determine which specific differences could explain such striking separation. While neurons were confirmed to have larger arbors than glia, we also discovered that glial trees tend to bifurcate more than neurons, and that glial branches are slightly more tortuous than their neuronal counterparts. These morphometric features have already proved useful in the separate investigation of neurons (Kawaguchi et al., 2006;Polavaram et al., 2014), and glia (Khakh & Deneen, 2019;Verkhratsky et al., 2019), but to our best knowledge never in their comparison. Combining these measurements, we de- In situ hybridization of the corresponding genes is useful to study the somatic distribution of these neurons and glia but does not label their dendrites and processes. Immunostaining can, in some cases, visualize cell-type-specific neural arbors, and multicolor combinations of antibodies may allow co-labeling of distinct cell types in the same preparation. In contrast, relatively simpler but non-selective staining such as Golgi (Ghosh, 2020) impregnates a broad spectrum of neurons and glia. In these cases, ABEL can provide a practical way to quickly recognize neurons from glia.
It is important to note in this regard that measuring ABEL does not necessarily require the detailed tracing of the full arbors. Euclidean length is simply defined as the straight-line distance between the start and end points of a branch, which can be computed directly from the microscopic image in any common software. Moreover, we showed that as few as five branches are sufficient to provide an ABEL approximation that distinguishes glia from neurons with >95% accuracy. Even for complex arbors with hundreds of bifurcations and terminations, it is thus possible to estimate ABEL with minimum effort.
Besides the practical utility, it is tempting to speculate about the possible scientific interpretation of our main finding. The systematically small ABEL values of glia suggest a tendency to optimize spatial occupancy, consistent with extensively reported tiling properties for these cells (Barber et al., 2021;Pogodalla et al., 2022). In con- Aside from the sparse exceptions, the robustness of the results reported in this study is underscored by the very large dataset, distributed provenance of the reconstructions, and broad diversity of metadata. At the same time, it is also essential to recognize that this study is intrinsically limited by data availability. For example, although the included species span primates, rodents, fish, and invertebrates, the majority of reconstructions for both neurons and glia come from rats and mice. Furthermore, while many anatomical regions are represented in the study, the list is far from complete. And albeit several classes of glia and of neurons were analyzed, their distribution was far from uniform.
These factors reflect the state of the research in this field rather than a flawed analysis design. Nevertheless, the conclusions must be considered tentative until further validated as more data continue to accumulate.
This work also illustrates the usefulness of subjecting very large datasets to exploratory analysis via machine learning, followed by a targeted investigation of the most promising phenomena or patterns revealed. This "breadth-then-depth" approach may help shed light on otherwise elusive mechanisms. In particular, tracing glial morphology has become progressively more common and, thanks to increased awareness of data sharing, ever larger amounts of glial reconstructions are being deposited to NeuroMorpho.Org.
This increment in data availability in a public repository opens new doors for scientific discovery, especially when applying different analysis and modeling techniques for glia that have been productively applied to neurons since the early days of computational neuroscience.

ACK N OWLED G M ENTS
This work was supported, in part, by NIH grants R01NS39600, U01MH114829, and R01NS086082 to GAA, and R01EY029715 to QW. We are thankful to the NeuroMorpho.Org team and all data providers whose generous resource sharing made this study possible.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1002/jnr.25131.

DATA AVA I L A B I L I T Y S TAT E M E N T
All morphological reconstructions utilized in this work are available at NeuroMorpho.Org. The individual archive listing, the metadata,

and the scaling adjustments are all included in the Supplementary
Materials. The source code for all analysis is publicly available on github.com.