Quantitative neuronal morphometry by supervised and unsupervised learning

Summary We present a protocol to characterize the morphological properties of individual neurons reconstructed from microscopic imaging. We first describe a simple procedure to extract relevant morphological features from digital tracings of neural arbors. Then, we provide detailed steps on classification, clustering, and statistical analysis of the traced cells based on morphological features. We illustrate the pipeline design using specific examples from zebrafish anatomy. Our approach can be readily applied and generalized to the characterization of axonal, dendritic, or glial geometry. For complete context and scientific motivation for the studies and datasets used here, refer to Valera et al. (2021).


Installation of analysis tools and downloading datasets and custom codes
Timing: 1 h 1. Install L-Measure v5.3 or v5.2 software (depending on the available version for your operating system). a. Download the version relevant to your platform. More information in this regard is available on the L-Measure website (see key resources table for the research resource identifier (Bandrowski et al., 2016)). 2. Install Python 3.8 on your computer system (see key resources table for link) or on a virtual environment (see Note below). a. Install the following packages on your installed python: i. matplotlib version 3.3.4 ii. NumPy version 1.20.1 iii. Pandas version 1.2.2 iv. scikit-learn version 0.24.1 v. SciPy version 1.6.0 b. Alternative 1: After activating your virtual environment from the 'scripts' directory, use 'pip install -r requirements.txt' to automatically install all required packages. c. Alternative 2: We have provided '.yaml' file located in the 'scripts' directory to install both python environment and the required packages. To do so, open your 'conda' terminal and run 'conda env create -name star-protocol -f star-protocol.yaml'. This will create the python environment named 'star-protocol' along with the right packages. For more information about 'conda' and its installation, please visit: https://conda.io/projects/conda/en/latest/ user-guide/index.html Note: After installing Python, all scripts can be run either via command line using 'python [name-of-script].py [input parameters]' or through any preferred python environment interface providing optimized utilities for code modifications.
3. Download from the GitLab repository all python scripts and representative data pertaining to this protocol and associated use case scenario. a. URL: https://gitlab.orc.gmu.edu/kbijari/zebrafish-analysis-protocol 4. For simplicity purposes, all reconstruction files used in the examples illustrated here are also available in this GitLab repository. The reconstruction files from this and thousands of additional datasets can also be obtained from NeuroMorpho.Org (Ascoli et al., 2007).

STEP-BY-STEP METHOD DETAILS
We begin by illustrating how to extract quantitative morphological attributes from digital reconstructions of neurons represented in the standard SWC file format (Nanda et al., 2018). Then, we demonstrate the procedure to run unsupervised, supervised, and statistical analysis on these attributes.

Data quantification
Timing: 2 h These steps describe how to process and quantify neural morphologies stored in the SWC file format (Figure 1). These are most typically representing neurons and glia (Abdellah et (Wright et al., 2013). The output is a set of quantitative morphological attributes for every neuron, which could be used for supervised, unsupervised, or any other quantitative analysis.
1. Run L-Measure in your operating system. a. In Linux you need to run java -jar Lm.jar in the command line from the folder where the L-Measure executable scripts are located. b. Check the troubleshooting section of the L-Measure website if you run into errors. c. Figure 2 shows the main view of L-Measure software being executed and related sample output. 2. In the 'function' tab select the morphological attributes that you wish to extract from the SWC files. The selected morphological attributes utilized here (referred to henceforth as ''core metric functions'') are: 'N_bifs', 'N_branch', 'Width', 'Height', 'Depth', 'Length', 'EucDistance', 'PathDistance', 'Branch_Order', 'Contraction', 'Fragmentation', 'Partition_asymmetry', 'Bif_ampl_local', 'Bif_ampl_remote', 'Fractal_Dim' (see also Note below).
Note: The 'included' and 'excluded' points in step 6 are not user-defined, but rather automatically determined by L-Measure and displayed for information only. Specifically, these values refer to the number of rows in the SWC file used to calculate the desired morphometric. For instance, the functions measuring the bifurcation angles (Bif_ampl_local and Bif_ampl_remote) only operate on the arbor bifurcation points, and not on the stems, continuations, and terminations. Thus, the included points would be in this example the counts of bifurcations and the excluded points would be the rest of the arbor tracing points. For exact definitions and more information about core metric functions and the measurements extracted in Step 2, visit the L-Measure website (http://cng.gmu.edu:8080/Lm/help/index.htm).
3. In the 'input' tab select the SWC files corresponding to one or more neurons from the local drive. 4. In the 'output' tab specify a file name and the location where you wish to store the extracted values. 5. Finally, in the 'go' tab press 'go' for L-Measure to run the analysis and calculate the numerical attributes for each SWC file. 6. In the output file, L-Measure produces neuron name, name of the core metric function, total sum of the function over all tracing points (Total_Sum), number of points included in the analysis (Count_considered_compartments), number of points excluded (Count_discarded_compartments), minimum value (Min), average (Ave), maximum (Max), and standard deviation (S.D.) of each metric function (Scorcioni et al., 2008). 7. Edit these quantifications by removing irrelevant or redundant statistics in each row and save the resultant file in comma separated value (CSV) format (see Note below for more details). Every line in this CSV file should represent a single neuron. File 'neuron_features.csv' in the scripts/data folder was obtained in that way from the output of L-Measure for this experimental dataset. Alternatively, you can run 'python convert.py -i /location/of/input-file -o /location/to/save/output-file.csv'; this script reads the results yielded fromL-Measure from the user's specified location and creates a CSV file (output-file.csv) that can be used for both the supervised and unsupervised subsequent steps.

Note:
Step 7 is required because not all elements of the L-Measure default statistical summary are appropriate for every feature. Therefore, selecting the most suitable statistic for the

OPEN ACCESS
analysis is important. For more information on the used parameters and their relevance see Table 1 below. It is important to mention that L-Measure does not calculate median values and mean-based statistics may not be appropriate if the data is not normally distributed. A detailed description of the values L-Measure can extract from the SWC files and their applications is provided by the reference paper (Scorcioni et al., 2008).

Unsupervised clustering
Timing: 2 h The following steps describe the unsupervised analysis, a process that groups neurons based on their morphological features independent of any a priori knowledge about the cells. We start with the fundamental K-means algorithm for data clustering and then use graphical mixture models to find innate distributions within the dataset.
8. From your Python environment or command line, run 'python unsupervised-elbow-curve.py -i ./Data/neuron_features.csv'. a. This script reads the neuronal data from the data folder and normalizes them. b. Then it plots the elbow curve, which shows the trade-off between residual variance and the number of clusters ( Figure 3A). c. Elbow curve is used to determine the optimal number of clusters. This heuristic method calculates the sum of squared distances of all points from the center of their respective cluster as a function of the number of clusters. The point(s) of maximum inflection (visible bends) are appropriate choices for the number of clusters. In this case, the curve has two high-inflection points (3 and 5) and we selected 3 as the number of clusters for our analysis. 9. From your Python environment or command line, run 'python unsupervised-kmeans.py -i ./Data/neuron_features -k 3'. a. This script reads the neuronal data from the data folder and groups them into 3 clusters utilizing the K-means algorithm. b. The script prints the cluster label for each neuron as well as the centers of the three clusters. c. The program also performs a principal component analysis to optimize the spatial display of the feature distribution along the main directions of their variance. d. The program then plots (i) a scattered visualization of the data based on the first two principal components and (ii) a bar plot depicting the count of neurons in each cluster ( Figures 3B and 3C). e. You can alter the number of clusters by changing the input parameter ''k'' before running the script. 10. From your Python environment or command line, run 'python unsupervised-BIC-curve.py -i ./Data/neuron_features.csv'. a. This script reads the neuronal data from the data folder and normalizes them. b. Then it plots the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) curves, which quantify the grouping distinctiveness as a function of the number of clusters ( Figure 4A). c. Bayesian Information Criterion (BIC) is a likelihood-based method for model selection among a finite set of possibilities. This is a heuristic score to determine the optimal number of clusters. d. Akaike Information Criterion (AIC) measures the relative amount of information lost by a given model for a certain number of clusters. e. Generally, lower values of AIC and BIC are more desirable. For our analysis, we selected 3, where both curves have relatively low value compared to neighboring points. a. This script reads the neuronal data from the data folder and groups them into 3 clusters using Gaussian mixture distributions. b. The script prints the group label for each neuron. c. The program also performs a principal component analysis to optimize the spatial display of the feature distribution along the main directions of their variance. d. The program then plots (i) a scattered visualization of the data based on the first two principal components and (ii) a bar plot depicting the label distribution ( Figures 4B and 4C). e. You can alter the number of clusters by changing the input parameter ''k'' before running the script.
Optional: Scikit-learn includes various algorithms to perform grouping and clustering datasets. Based on the nature and inherent characteristics of the data, different clustering models might achieve more interpretable outcomes (Bijari et al., 2018). To explore different clustering techniques please visit https://scikit-learn.org/stable/modules/clustering.html for an overview of suitable alternatives and code snippets to execute them.

Supervised clustering
Timing: 2 h Many datasets of interest, including the examples employed here, have specific labels assigned to each neuron, representing knowledge that the researcher has about these cells that is in principle unrelated to the arbor morphology. Examples of such labels (often referred to as ''true classes'') include electrophysiological characteristics, anatomical location of the soma, expression of particular genes, functional specificity, and many more (Ascoli and Wheeler, 2016). Having the data labels in hand, the next series of steps describes the process of supervised analysis. This process assesses the morphological features that best separate among the chosen labels. We begin with visualizing the dataset to gain a spatial perception of the different data points based on their principal components. Then we add the labels to the results of the previous unsupervised methods to check whether the intrinsic morphological characteristics of the cells, as discovered by unsupervised learning, correspond to any independent functional property (distinct ground truth classes) and ascertain how adequately the supervised algorithms perform relative to the known information. At the end, we train and test a battery of supervised classification methods on the labeled data to quantify how accurately the morphological features can discern among the labels.
12. For this step, neural data should have explicit class labels. There are typically no limits to the number of these labels (ground truth information) and they are typically determined by the experimentalist while preparing the dataset based on biologically relevant knowledge. To prepare your labels, please create or format your data labels as described below. a. Use 'labels (template).xlsx' in the 'Scripts' folder and label each neuron with the proper label based on your experiment. b. If you have more than one class, use additional columns. c. Save the template file for your records. Also, export a CSV file to provide input for the next steps in the analysis. 13. From your Python environment or command line, run 'python supervised-visualize.py -i ./Data/ neuron_features.csv -l ./Data/neuron_labels.csv'. a. This script reads the neuronal data and their corresponding labels from the data folder. b. The program next performs a principal component analysis to optimize the spatial display of the feature distribution along the main directions of their variance. c. The program then outputs a series of scatter plots, one for each set of labels, with each neuron data point associated with its label symbol ( Figure 5). 14. From your Python environment or command line, run 'python labeled-kmeans.py -i ./Data/neu-ron_features.csv -l ./Data/neuron_labels.csv -k 3'. a. This script reads the neuronal data and groups them into 3 clusters. b. The program then outputs the homogeneity score and a series of scatter plots, each displaying the color-coded K-means clusters with symbols (shapes) corresponding to the true class labels in each label category ( Figure 6). c. Homogeneity is a clustering metric based on the ground truth labels. It checks if clusters contain only samples belonging to a single class. A clustering result satisfies homogeneity if all of its clusters contain only data points that are members of their ground truth class. This metric is bounded between 0 and 1, with lower values indicating low homogeneity (Rosenberg and Hirschberg, 2007). d. You can alter the input parameters before running the python script. 15. From your Python environment or command line, run 'python labeled-GMM.py -i ./Data/neuro-n_features.csv -l ./Data/neuron_labels.csv -k 3'. a. This script reads the neuronal data and groups them into 3 clusters, just as in steps 9a-d above. b. The program then outputs the homogeneity score (as described in 14.c) and a series of scatter plots, each displaying the color-coded distributions with symbols (shapes) corresponding to the true class labels in each label category (Figure 7). c. You can alter the input parameters before running the python script.
16. From your Python environment or command line, run 'python supervised-classification.py -i ./Data/neuron_features.csv -l ./Data/neuron_labels.csv' and follow the command line instructions. a. This script reads and pre-processes the neuronal data and their corresponding labels from the data folder. b. The script asks which class label you want to select for the task of classification. This could vary based on your choice of dataset. c. The program then calculates the feature importance using the 'ExtraTreesClassifier' algorithm (see Note below) and plots those values in a bar graph.
Note: The ExtraTreesClassifier algorithm used in step 16 is an approach to feature selection based on the Random Forest method. Random Forest creates several tree-based models using the features and attempts to classify the data via those decision trees. The importance of ll OPEN ACCESS the features depends on the number of times that feature is used to split a node and on the number of samples it splits. Specifically, this algorithm weighs the total decrease in ''node impurity'' by the proportion of samples reaching that node. A Mean Decrease in Impurity (MDI) is then calculated as the average of this measure over all trees of the ensemble. In other words, the fewer splits a feature needs to classify the data, and the larger the proportion it classifies, the ''purer'', hence more important, it is (Geurts et al., 2006).  a. This script reads and pre-processes the neuronal data and their corresponding labels from the data folder. b. The script then asks which feature and which class label you want to select for the density analysis. This varies based on your choice of dataset. c. In this example, we have selected the morphological features 'Contraction' and 'EucDistance' to run density analysis. Nevertheless, the analysis could be repeated on any other dimension of the data. To do so, follow the command line instructions. d. Afterward, from the command line instructions we select 'lateral line (LL)' class labels to perform density analysis within this class for different class labels, namely, 'ALL' and 'PLL' for anterior and posterior LL, respectively.. e. The program plots the density curve for the 'Contraction' and 'EucDistance' data considering different 'lateral line' class labels (Figure 9).
18. From your Python environment or command line, run 'python supervised-pdv-comparison.py -p ./Data/PDVs -l ./Data/neuron_labels.csv' and follow the command line instructions. a. This script reads the persistent diagram vectors and the data labels from the data folder (see Note below).
Note: The persistence diagram vectors (PDVs) utilized in step 18 are based on the concept of topological persistence in the field of computational topology (Kanari et al., 2018). PDVs are Figure 6. Visualization of the neurons based on K-means results (color-coded clusters) and their ground truth labels (shapes) For label meanings, see Figure 5 legend.

OPEN ACCESS
derived from a simplified representation of the neural arbor that only considers the stems, bifurcations, and terminations while ignoring all continuation points along the branches. In other words, this process creates a straight segment between any two topological nodes of the tree representing the Euclidean length of each branch. The descriptor function is the total length of the unique path (i.e., the sum of all branch lengths) from the tree stem to any point in the tree. The stem is the tree root, typically corresponding to its point of emergence from the soma (Li et al., 2017). Starting from this simplified representation, persistence analysis sweeps the neuron tree in decreasing function values, i.e., beginning from the farthest terminal tip, while tracking the appearance (end point) and disappearance (start point) of each neurite branch. The persistence diagram summarizes all resulting appearances and disappearances into a set of 2D points whose (x, y) coordinates represent the distance from the soma of the end and start points of each branch. Mathematically, the set of points in the persistence diagram captures a nested branch decomposition of the neuron tree with respect to the simplified Euclidean branch length description. Finally, the persistence diagram summary is converted into a 100-dimensional vector representing the function values at 100 positions sampled uniformly between the minimum and maximum values, corresponding respectively to the beginning of the tree and the farthest terminal tip. Intuitively, persistence diagram vectors capture the morphological information similar to that represented in Sholl diagrams (Garcia-Segura and Perez-Marquez, 2014;Sholl, 1953). The advantage of this approach, however, is that it produces a proper metric space which allows quantitative applications such as those b. The script then asks which feature, class label, and values you want to select for the density analysis. This varies based on your choice of dataset. c. In this example, we illustrated the statistical analysis on lateral line (LL) to compare class labels 'ALL' and 'PLL'. Nevertheless, the comparison could be repeated for any other class labels. To do so, follow the command line instructions and provide your desired input values. d. The program calculates pairwise arccosine distances of different vectors based on their labels and groups them in 'within' or 'across' populations for same labels (ALL/ALL or PLL/ PLL) and different labels (ALL/PLL), respectively. e. Afterward, the software runs student's t-test and displays a statement indicating whether the two groups ('within' vs 'across') are statistically different based on the resulting p-value. f. The program also outputs a bar plot of the average value of 'within' and 'across' populations with error bars indicating S.D. (Figure 10).
Note: Model-specific parameters (e.g., number of clusters) will be inserted by the user through the command prompt and users will be guided with appropriate help messages from the command prompt in the clustering and classification algorithms. Furthermore, users should be advised that we have accepted the default python package parameters (e.g., number of iterations, penalties, and learning rates) offered by the Scikit-learn toolkit. Please refer to the Scikit-learn package manuals to check if the default settings are suitable for your data.
Optional: To disregard some of the data points from the visualization or analysis based on their class or relevance, simply remove the corresponding entries from both the

EXPECTED OUTCOMES
In this protocol, neuronal reconstructions are quantified into measurable attributes expressed as numerical values. Moreover, the digitally reconstructed neurons are compared through unsupervised and supervised analysis, manifesting the most distinguishing features conducive to neuronal classification.

LIMITATIONS
The protocol only includes morphological attributes of the neural reconstructions into account. Future variations could consider additional features in the analysis, such as genetic expression or time-lapse dynamics (Nanda et al., 2020).

TROUBLESHOOTING
Use specified version of the tools or packages for proper functioning. Re-install tools if errors persist.
Problem 1 Different Python packages overlap causing code to malfunction.

Potential solution
Install Python in a virtual environment and install a fresh copy of the mentioned packages. For further details on how to install a virtual environment for different operating systems please visit https:// packaging.python.org/guides/installing-using-pip-and-virtual-environments/ (Step 2 in before you begin) Problem 2 L-Measure does not produce the desired outputs (Step 1). Potential solution Make sure the software is executable and has appropriate privileges. In Linux, for example, this is achieved using the ''chmod u+x Lm.*'' command.

Problem 3
Depending on your data, during the classification phase you might receive a warning that some of the algorithms did not converge yet. For example, after running the supervised_classification.py script you might see a warning like: (ConvergenceWarning: Stochastic Optimizer: Maximum iterations (90) reached and the optimization hasn't converged yet. warnings.warn) (Step 16)

Potential solution
In such cases, you need to increase the number of iterations for that algorithm. For example, in the case above, in the multi-layer-perceptron algorithm, increase the 'max_iter' parameter on line 143 of the supervised_classification.py script with a text editor of your choice.

Problem 4
Issue running python scripts while providing input parameters (any step).

Potential solution
All python scripts are enhanced with help command inside. Run 'python [Script].py -h' to see more information about the input parameters of the script.

Problem 5
Python runs into error while reading files from the provided paths (any step)

Potential solution
Check path names and file names to make sure they exist in the appropriate location and their name does not include any special characters.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Giorgio A. Ascoli (ascoli@gmu.edu). The script first calculates the pairwise arccosine distances between PDVs of 'within' and 'across' populations with respect to class labels (ALL and PLL), and then performs their statistical comparison. Bar plot shows the average of 'within' and 'across' distances with error bars indicating standard deviations.

Materials availability
This study did not generate any reagents.
Data and code availability All image stacks, data, and scripts of this protocol are publicly available on GitLab at https://gitlab. orc.gmu.edu/kbijari/zebrafish-analysis-protocol/tree/master/. Neuronal reconstructions are available at NeuroMorpho.Org (Lopez-Schier archive)