Joint regression-classification deep learning framework for analyzing fluorescence lifetime images using NADH and FAD

: In this paper, we develop a deep neural network based joint classification-regression approach to identify microglia, a resident central nervous system macrophage, in the brain using fluorescence lifetime imaging microscopy (FLIM) data. Microglia are responsible for several key aspects of brain development and neurodegenerative diseases. Accurate detection of microglia is key to understanding their role and function in the CNS, and has been studied extensively in recent years. In this paper, we propose a joint classification-regression scheme that can incorporate fluorescence lifetime data from two different autofluorescent metabolic co-enzymes, FAD and NADH, in the same model. This approach not only represents the lifetime data more accurately but also provides the classification engine a more diverse data source. Furthermore, the two components of model can be trained jointly which combines the strengths of the regression and classification methods. We demonstrate the efficacy of our method using datasets generated using mouse brain tissue which show that our joint learning model outperforms results on the coenzymes taken independently, providing an efficient way to classify microglia from other cells


Introduction
Fluorescence lifetime image microscopy (FLIM) has shown great promise in visualization of physiological parameters in fixed and living cells in biology and medicine [1][2][3]. FLIM measures the average time a molecule (excited by a photon) remains in its excited state before returning to its ground state and emitting a photon simultaneously. This lifetime estimation of the molecule is measured based on an exponential decay function and is used to localize specific fluorophores. One advantage of fluorescence lifetime imaging is that, it can probe local cellular microenvironments around the fluorophores [4]. FLIM can also be used to identify local metabolic signatures leading to improved understanding of specific cell types [5]. These metabolic signatures can be intrinsically detected because of FLIM's ability to probe autofluorescent metabolic coenzyme such as NADH (Nicotinamide Adenine Dinucleotide) and FAD (Flavin Adenine Dinucleotide). By quantitatively identifying lifetime and binding status, FLIM analysis of NADH and FAD can reveal metabolic alteration and indicate whether a cell is in a more glycolytic or oxidative state [6][7][8][9]. FLIM can be used as an imaging technique with a wide range of applications such as confocal microscopy, two-photon excitation microscopy, as well as multiphoton tomography [10][11][12].
In this study, we focused on microglia, the resident macrophages of the Central Nervous System(CNS) which play a role in virtually all neuropathologies [13][14][15][16]. They influence brain development, maintain neural environment as well as respond to injury and infection. Due to their involvement in CNS processes and neurodegenerative diseases, understanding the morphology and function of microglia is extremely critical in a number of scenarios. In this respect, accurate imaging and computational tools are needed to identify metabolic signatures specific to microglia and their activation status. FLIM is used as an imaging method for this purpose, given its ability to probe cellular microenvironment in a label-free manner which does not introduce alteration in the cellular signature. Furthermore, utilizing recent advances in Machine Learning, it has been shown that FLIM based monitoring of microglial metabolic alterations can be used to both differentiate microglia from other CNS cell-types using artificial neural network based approach [17] as well as identify their activation status [18]. These studies show promising results in the utility of computational tools for analyzing FLIM to understand the functional role of microglia in the CNS.
In this paper, we develop a neural network based approach to analyze FLIM data obtained from two metabolic coenzymes NADH and FAD jointly for the purposes of studying microglia. Most previous studies of microglia used FLIM obtained from NADH alone [18] or have been focused on network architecture for estimating and visualizing the lifetimes [17,19]. To the best of our knowledge, FLIM data obtained from two metabolic coenzymes NADH and FAD have not been used together in the same machine learning framework. . But, such an idea could potentially have value, since NADH and FAD sometimes offer unique information, in terms of relative change with metabolic alterations. For example, enzyme-bound NADH has higher lifetime than free NADH while this trend is reversed for FAD [20]. Based on the state of metabolism and relative shift, FAD lifetime and their relative bound/free status can follow different trends than NADH. Furthermore, the fluorescence lifetime of protein-bound NADH/FAD is sensitive to changes in the binding status. Therefore studying the effect of these two coenzymes together, can lead to a heterogeneous data source that can provide complementary information on how to characterize the microglial footprint and distinguish it from other CNS cell-types.
Contributions We propose an efficient approach to solve the above problem in a computational setting and address the key challenges that arise in this context. We formulate a customized deep-learning framework specifically designed to work with FLIM data, which inputs the bimodal fluorescent lifetime decays corresponding to each coenzyme and learns both regression and classification to separate microglia from other cell types (Fig. 1). The regression component represents the time-resolved fluorescence decay data using the non-parametric regression approach, by expanding it using a set of discrete Laguerre basis which is solved as a layer of the neural network. This is followed by a set of layers that performs the classification on the intensity image based on the similarity of the corresponding fitted exponential decay curves of the pixels. This part of the deep network is used to learn parameters of Conditional Random Fields(CRF) which are widely used for semantic segmentation. The primary advantage of CRF is three-fold: a) the objective/ maximum likely loss function can be explicitly constructed in CRF, therefore interpretable, b) as noted by [21] deep semantic segmentation neural networks are usually not perfect, especially at object borders since these have convolutional filters with large receptive fields and hence produce coarse outputs [22]. In this scenario, CRF is often used in a post processing framework to improve the final classification c) CNN based segmentation models such as U-net [23] lack smoothness constraints that encourage label agreement between similar pixels, and spatial and appearance consistency of the labelling output, a property which is explicitly captured by CRFs. By utilizing the regression and CRF components together, our deep learning architecture provides a joint regression-classification framework for learning to distinguish microglia using bimodal FLIM decay data. To the best of our knowledge, such a joint regression-classification deep learning framework for FLIM has not been demonstrated, though there have been some efforts to estimate lifetime decays using neural networks [24][25][26] (see next section for details).

Previous work
Our review of previous work covers three main subareas, which are directly related to the methodology in this paper. These are as follows: Computational approaches in FLIM: The starting point of analysis of time-resolved fluorescent lifetime data involves the determination of the fluorescence intensity decay (Y) by identifying of a set of fitting parameters that best describe the characteristics of the fluorescence decay, followed by classification of the fluorescent system based on those parameters. Mathematically, the measured fluorescence intensity decay data (Y) is given by the convolution of the instrument response(F) from excitation light pulse with the fluorescence response of the tissue(I). Thus, to estimate the fluorescence of a compound, the excitation light pulse must be deconvolved from the measured fluorescence intensity pulse [27]. Several deconvolution approaches are used in this regard, the most popular being nonlinear least-square iterative reconstructions which assume the functional form of Y. However, there are methods which generate Y directly without any assumption, such as the Fourier [28] and Laplace transform methods [29] and the exponential series method [30], among others. Under the category of Least squares approaches, the most popular method is the nonlinear least-squares iterative reconvolution (LSIR) approach [30,31]. Laguerre based deconvolution was also used in this regard [27,32], in which the fluorescence signal Y is expressed as an expansion on the discrete time Laguerre basis instead of a weighted sum of exponential functions. Such papers and others [33,34] have shown that Laguerre expansion coefficients are highly correlated with the lifetime value, suggesting such an approach may be very useful in direct characterization of biochemical compounds in terms of their fluorescence emission temporal properties. In this regard, Smith et. al. [24] recently proposed a fit-free approach in fluorescent lifetime image formation using deep learning (DL) and Walsh et. al. [35] who performed auto-fluorescence imaging of NAD(P)H and FAD followed by classification (individually) of T cell activation.
Joint Regression-Classification methods in Machine Learning Joint regression classification approaches have been used frequently in biomedical literature, though not in the context of FLIM. Zhu et. al. [36] used this type of joint training for clinical score regression and clinical label classification in context of Alzheimer's disease (AD). Wang et al. [37] adopted a joint regression-classification approach for identifying imaging biomarkers that are associated with both cognitive measures and AD. A deep multi-task multi-channel learning framework for simultaneous classification and regression for brain disease diagnosis, using MRI data and personal information was proposed by Liu et. al. [38]. In addition to biomedical literature, several papers in the fields of vision and machine learning use such joint learning approaches for a variety of problems such as time series classification [39] and age prediction [40].
Learning CRFs Conditional random fields (CRFs) is a machine learning approach adapted from statistical modelling, where the inference problem is formulated on a graphical model that implements dependencies between the different nodes of the graph. As a result, it gives a classifier a context of "neighboring samples" while classifying a given sample. CRFs are widely used in image segmentation. However, here we only focus on those methods which learn the parameters of CRF. Barbu et. al. [41] proposed a joint training of CRF together with an inference algorithm in their Active Random Field approach. Domke et. al. [42] formulated a back-propagation based parameter optimization scheme in graphical models using approximate inference. In the same vein, it has been shown [43,44] how back-propagation approaches through belief propagation can be used to optimize model parameters in CRF. Krahenbuhl et. al. [45] showed how to learn parameters of a dense CRF using a modified mean-field algorithm for inference. Zheng et. al. [22] formulated a dense CRF as a Recurrent Neural Network(RNN) which lead to an end-to-end trainable system for semantic image segmentation.

Methods
Our deep learning architecture for the joint regression-classification problem on bi-modal lifetime decay data consists of two main stages. In the first step, the lifetime decay curve (one for each pixel) is expressed as a linear combination of predefined Laguerre bases subject to additional smoothness constraints. This is solved by a quadratic programming layer, recently proposed by [46]. The coefficients obtained at this step, provide an indication of similar activation patterns among pixels. This information, in conjunction with spatial similarity can be used to distinguish similar tissue types. Therefore, in the second step, we combine the coefficients from each of the modalities (NADH and FAD) and pass this on as features to the CRF subnetwork for classification. A schematic diagram of our network can be seen in Fig. 2. We describe the details for each step next.

Representation and modeling
We first describe the mathematical representation of the fluorescent decay curve. As mentioned earlier, the measured fluorescence intensity decay data(Y) is given by the convolution of the tissue fluorescence response signal(I) with the excitation light pulse which is a part of the instrument response(F) plus some additive noise(ϵ). Each of these variables are functions of time. This relationship can be written as where ⊗ represents convolution of the two signals. The function I(t) can be best approximated as a multi-exponential decay function. Here we use bi-exponential decay function to represent the signal, written as where a 1 , a 2 and τ 1 , τ 2 represent the amplitude and the lifetimes for each exponential component. Such exponential functions can be expanded a number of ways, but it has been shown that using Laguerre polynomials or other orthogonal polynomials can lead to better approximations than a Taylor series expansions. Using this expansion, signal I(t) can be expressed as where b α j (t) is the discrete set of Laguerre bases (of scale α), d is the number of bases and c j is the unknown coefficients of combination. Plugging this in Eq. (1), it can be rewritten in matrix form as where V is the t×d matrix derived from the convolution of F and I and basis matrix B = {b α 1 · · · b α L } is also t × d (see [32] for details). Here we assume t is the number of time points, n is the number of pixels and d is the number of Laguerre bases. The matrix Y is of size t × n whereas the matrix C is of size d × n. In the above model (4), if the number of bases are large, it can tend to avoid overfit which does not generalize well to unseen data. To avoid this, a constraint is added on the third derivative of BC, to ensure a smooth reconstruction [47]. This can be written in a discrete form DBC ≤ 0, where D is the third order forward finite difference matrix. Taken together, the optimization problem can be written as The objective can be further expanded as Tr(C T V T VC − 2Y T VC). Typically the above problem is solved as a dual, formulating it as a non-linear least squares, where the coefficient matrix C (whose column vectors are represented as C i ) are recovered after the model is solved.
In our problem, we extend the model (5) to include an additional regularization term, to ensure pixels with similar decay profile is classified similarly. The details of this extension is discussed in Section 3.1.2. We give a brief overview here. We use an energy term, commonly known as Conditional Random Fields (CRF) of the form where the unary energy components ψ(i) measure the inverse likelihood (and therefore, the cost) of the pixel i taking a given label, and pairwise energy components ψ(i, j) measure the cost of assigning different labels to pixels i and j. We assume the probability of two pixels belonging to the same label is given apriori (such as in presence of training data). This is denoted by µ(i, j) which is set to 1 if a nearby pair of pixels are assigned different labels, 0 otherwise. In addition, Gaussian energy function is used for representing the pairwise term, written as K(i, j) = exp(− . This CRF energy term, when solved in conjunction with the objective in (5), can be written as follows: min Here, we use a variable X of size n × 1, as the output of classification where the values are in {0, 1} indicating background and foreground respectively. Also, the symbol ⊙ represents element wise multiplication and the matrix ψ represents the unary costs. We show the detail of this derivation in Section 3.1.2. Turns out that the above model (6) is a bilevel optimization problem and can no longer be written as a linear least squares problem. Therefore, standard methods for solving for the Laugerre regression problem [32,47] no longer work here.

Regression as a neural network layer
The above model (6), is computationally challenging to solve: particularly for the CRF energy term, since it leads to a dense CRF formulation (all pairs of pixels considered) [22,45]. Therefore, we split out model into two parts -in the first step, we solve the model (5) as a layer of the neural network. This layer is repeated twice (can be solved in parallel) -once to learn C NADH , pertaining to the NADH lifetimes and once to learn C FAD corresponding to the FAD lifetime data. One can think of the C NADH and C FAD as two sets of features describing same underlying phenomenon. In the second step (the classification part), these are concatenated asĈ (we also add the pixel RGB values and position as additional features) and is used as input to the next set of layers. In this layer, the network learns the parameters for a Dense CRF, using an efficient approach proposed by [48], which uses Permutohedral Filtering for significant computational speedup. The whole neural network is trained jointly using gradient descent.
Here we describe the approach used to solve the regression objective. We use the OptNet(QPTH) method proposed by Amos et. al. [46] who show how neural network layers can be designed to solve quadratic objective and encode constraints. Briefly, given a quadratic program of the form min z zQz t + q t z (7) this method first constructs the Lagrangian dual. This can be written as where ν and λ are dual variables introduced. This is used to derive the Karush-Kuhn-Tucker (KKT) conditions [49] by derivating Eq (9), wrt to the variables z, ν and α and setting the resultant derivatives to 0. This leads to the following system.
where D(·) creates a diagonal matrix from a vector and z * , ν * and λ * are the optimal primal and dual variables. Since these equations form a linear systems and their differentials involve only matrix multiplication, it can be easily handled by the automatic differentiation module within a deep learning software such as Autograd in PyTorch [50]. The reader is also encouraged to see how the gradients are computed with respect to the data parameters in backward pass, in [46], as it is outside the scope of this paper.
The advantage of this approach is that it can be used in batch mode and outperforms standard Quadratic programming solvers like CPLEX [51] and Gurobi [52]. Since this layer optimizes for z in the form of a vector, we perform some basic transformations to our inputs to be compatible with the inputs necessary for our model. Particularly, we perform Kronecker product [53] of V T V with a identity matrix of size m, I m , that is Q = I m ⊗ V T V and vectorize Y T V to generate q. The constraints are handled similarly.

Classification of microglia as learning CRF
In this section, we describe the details of our CRF model. We define a dense CRF (all pairwise potentials considered) on a set of n random variables corresponding to each of the pixels X = {x 1 , . . . , x n }, each of which is represented by feature vectorsĈ 1 , . . .Ĉ n respectively. Each random variable x i of takes one label from a set of binary labels L = {0, 1} corresponding to background and microglia respectively. As described earlier, the CRF energy function can be described as a sum of the unary and pairwise terms, written as E = ∑︁ i ψ(i) + ∑︁ i≠j ψ(i, j), where the unary term describes the potential for the random variable x i taking the label 0 or 1 and the pairwise terms describes the potentials for random variables x i and x j taking the different labels. Specifically the pairwise term can be further expanded by using Gaussian function to express the similarity of x i and x j using the corresponding feature vectorsĈ i andĈ j . This can be written as ) as a label compatibility function between the labels. We use the Potts model to define µ, where µ(i, j) = 1 if labels are different and 0 otherwise. σ is considered the width of the kernel. The kernel K itself is a mixture of gaussian kernels K m (created from different features, where C m i represents the subset of features chosen) with weights w m . We can rewrite the CRF problem as an optimization problem, where the goal is to find a variable X i ∈ 0, 1 for pixel i. Let Ψ be the matrix in n × 1, that represents the unary costs. Then, the unary potential can be simply represented as Ψ. The pairwise terms can be written as X(µ ⊙ K)(1 − X) T + (1 − X)(µ ⊙ K)X T . This objective counts the pairwise terms when X i = 0 and X j = 1 (and vice versa). Taken together, the CRF problem can be written as the following optimization problem: Here the ⊙ operation stands for element-wise multiplication. Since the number of classes in our application in two, we do not require any constraints on X. This model when taken together with the Model in (7), gives us the non-convex problem shown in (6). There are many ways to solve the above model, But since the QPTH framework is implemented as neural network layer, we opt for a similar approach for the CRF since we can then connect the two components and optimize the neural network as a whole. Here, we briefly outline the Recurrent neural network(RNN) framework to solve the CRF problem. This approach was initially proposed by Krahenbuhl et. al. [45] and later implemented as an RNN by Zheng et. al. [22]. First, the initialization is done by applying a soft-max function over the unary potentials at each pixel. This is followed by a message passing layer, applying m Gaussian filters on previously initialized values. Since the CRF is potentially fully connected, each filter's receptive field spans the whole image, making it computationally challenging to calculate the filters. However, in the RNN extension of CRF, this is done using a Permutohedral lattice implementation [48], which can compute the filter response in O(n) time, where n is the number of pixels of the image. Next, the weighted sum of the m filter outputs from the previous step is computed as a convolutional layer. In the next layer, the label compatibility function µ is incorporated as another convolution layer where the spatial receptive field of the filter is 1 × 1, and the number of input and output channels are both set to 2. After this, the output of the compatibility layer is subtracted from the unary potentials. Finally we apply another softmax for normalization. At this point, the output is connected back to the input and the model is trained over a few iterations. Finally, a rectified linear unit (RELU), where RELU(x) = max(0, x) (for input x), is used to threshold the output into two classes and obtain the output X.

Optimization and training
We propose an end-to-end deep learning method to combine the QPTH and CRF methods for a joint regression-classification model for identifying microglia from FLIM data. The CRF is incorporated as a last set of layers which takes as its input theĈ from the QPTH layer. We then train the whole integrated model together by sharing the representative weights. The shared weights are further re-trained to fine-tune the performance of the merged model. The model is trained via the Adam optimizer which calculates gradients to update shared weight parameters. The loss function that is minimized for the merged model, is pixel-wise binary cross entropy loss between the output from CRF to a ground truth segmentation for each image Z. This can be written as

Experiments
We performed our experiments using data collected from mouse brain tissue samples (n = 103).
For each sample, we used the TCSPC (Time correlated single photon counting) software to output various parameters, including the lifetime parameters and intensity images. For each sample, we obtain both the NADH and FAD lifetime data. In addition, we use a separate antibody (Iba1) which only binds to microglia in the CNS to generate the ground truth images. Note that this is independent from the lifetime data. We briefly elaborate on the acquisition details of the dataset in next section.

Animals
All animals were maintained in an ALAAC-accredited animal facility with a 12-hr light/dark cycle regime and had access to food and water ad libitum. All experiments were performed in accordance with the University of Wisconsin-Madison Institutional Animal Care and Use Committee. For FLIM imaging, 100 µm thick coronal slices were prepared from the fixed brains of young adult C57BL/6J and CX3CR1-GFP mice (Jackson Labs), aged 6-8 weeks. Animals were euthanized by isoflurane overdose and transcardially perfused with 30 ml of ice-cold PBS, followed by a second perfusion with an ice-cold solution of 4% PFA in PBS. Brains were then dissected, post fixed for 24hrs in a solution of 4% PFA in PBS, and then moved to HBSS (all performed at 4C and protected from light).

Immunohistochemistry
100 µm thick coronal sections were prepared from the midbrain region of each brain using a Leica Vibratome. Two slices from each animal were used for immunohistochemical staining. Briefly, slices were washed at room temperature with 0.3% TritonX-100 in PBS, before incubating in blocking buffer (1 % BSA, 0.3 % TritonX-100/PBS) for 2 hours at room temperature. Slices were then incubated with anti-Iba1 antibodies (1:1000; Wako Catalog No. 019-19741) in blocking buffer in the dark at 4C overnight. Slices were washed again at room temperature with 0.3% TritonX-100 in PBS followed by incubation in the dark for 2 hours with AlexaFlour594 anti-rabbit IgG antibodies (1:200) in blocking buffer, at room temperature. Slices were washed with 0.3% TritonX-100 in PBS and mounted on 1mm slides using Cytoseal60 mounting medium. Mounted sections were stored at room temperature, protected from light until they could be imaged.

Multiphoton lifetime imaging
The lifetime and intensity imaging [54] was performed on a custom multiphoton laser scanning system built around an inverted Nikon Eclipse TE2000U at the Laboratory for Optical and Computational Instrumentation [55]. For each sample, around 20 neighboring FOVs were randomly selected, and the average value of lifetime and free NADH ratio was calculated based on masking. The instrument response function of the optical system was measured during each imaging session. A separate antibody that binds to microglia only was used to generate the images used as ground truth. This is the iba1 antibody which is widely used for microglia imaging. Using this, we are able to see the microglia accurately, which is why it is ideal for generation of ground truth images,

Evaluations
We lay out our evaluations in the following order: 1) First we use a variety of classification metrics to quantitatively and qualitatively evaluate how well our model performs for the task of distinguishing microglia from other cell types. 2) Secondly, we study the contribution of each component (regression and classification) of bimodal scheme for the given task. To do this, we evaluate each component separately. First we quantify the quality of the fit using the regression approach alone. Next, we provide the output for regression to a generic clustering approach to see how it performs compared to our scheme. 3) Finally, we test the utility of the bimodal approach by comparing the results to only one modality at a time.

Evaluation of efficacy of bimodal model via segmentation metrics
Here, we train out model with 40% of the data and the remaining data is used for testing. We use 10 different classification/segmentation metrics to evaluate the quality of the classification with respect to ground truth. These include several indices computed based on the overlap of ground truth and obtained segmentation such as Classification Accuracy (Acc), Precision, Recall, Fvalue, Dice, Jaccard coefficients and Area Under the ROC Curve(AUC) as well as information theoretic ones such as Mutual Information (MI), Normalized Variation of Information (NVI) as well as Rand Index(RI) and Adjusted Rand Index (ARI). The details of these metrics and their properties are explained next and can also be found at [56]. Metrics: For a given datatset, let Y denotes the output of a binary classifier in R n andŶ also in R n , is the ground truth. We use the notation Y i to indicate the ith element of the vector Y (same forŶ). This can be used to calculate True Positives (TP=Y T ×Ŷ, which counts the number of data points for which both Y andŶ is a 1 or in the positive class), False Positives (FP=Y T × (1 −Ŷ), which counts the number of points where Y is a 1, but ground truth is 0 or in the negative class), True Negatives (TN=(1 − Y) T × (1 −Ŷ), counts the number of points where both are 0) and False Negatives (FN=(1 − Y) T ×Ŷ which counts the number of points where Y is a 0, but ground truth is 1). Here the operator × is the vector dot-product. Based on these, we can define the following metrics: Accuracy computes the percentage of points correctly classified and is defined as Precision P computes the proportion /percentage of identified positive class points which is actually correct and is defined as Recall R calculates what proportion of positive class points is identified correctly and is given as Combining Precision and Recall, Fvalue (also called F-measure) gives an intuition of how precise a classifier is (how many points it classifies correctly), as well as how robust it is and is defined as Jaccard Coefficient, also known as Intersection over Union, is a statistic used for gauging the similarity and diversity of two clusterings/sets and is defined as Dice Coefficient is similar to Jaccard and captures the percentage of overlap between the two clusterings. It is defined as DSC(Y,Ŷ) = 2TP 2TP + FP + FN Next, we define Area under the curve (AUC). Note that AUC calculates the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Since AUC is a probability, its value varies between 0 and 1, where 1 represents all positives being ranked above the negatives. AUC( is often estimated using the Wilcoxon-Mann-Whitney(WMW) Statistic, which is computed as follows: Here, we assume T is the set of points in the positive class (Ŷ i = 1 : i ∈ [1, n]), N is the set of points in the negative class (Ŷ i = 0 : i ∈ [1, n]) and f (y) is the classifier function assigning a score to each element y i ∈ Y : i ∈ [1, n], where |T | + |N| = n.
To define the information theoretic measures, we define two probability distributions P Y and PŶ for discrete variable Y andŶ respectively. Let P YŶ denote their joint distribution. Then Mutual Information can be defined as Based on the probability measures, one can also define entropy . These are then used to define Normalized Variation of Information (NVI), a metric used to calculate the amount of normalized information lost and gained in changing from one clustering to another and is defined as follows: To define Rand index [57], we first need to define a few other terms. Comparing two clusters Y andŶ, let a -number of pair of data points are placed in the same group in Y and in the same group inŶ ; b -number of pairs of data items placed in the same group in Y and in different groups inŶ ; c -number of pairs of data items placed in same group inŶ and in different groups in Y and; d -number of pairs of data items placed in different groups in Y and in different groups inŶ Using these, Rand index, which computes similarity of two clusterings in terms of pairs of items, is defined as The adjusted rand index is also defined in terms of the above values as follows: The rational for this definition is a bit involved and beyond the scope of the paper. The reader is referred to [57] for more details. Range of these metrics are in [0, 1] and higher values indicate a more accurate segmentation, except for NVI, where exactly identical clusters have NVI value of 0. As shown in Table 1, our method performs very well overall, with only a small decrease in values in the test set compared to training. Qualitative results are shown in Fig. 3, which shows the obtained segmentation is close to the ground truth in most cases.

Evaluation of the regression approach
To study the effect of regression, we first investigate how well the Laguerre basis approximation can estimate the input fluorescent decay curve. To do this, we use the bi-exponential decay function as an approximation of I(t) (which is then convolved with the instrument response) to construct the corresponding signal Y(t) in Eq. (1). Observe that Laguerre basis expansion is exact, only when I(t) is a biexponential function. Figure 4(a) shows the input signal (in red) along with the signal reconstructed by the Optnet(QPTH) layer, when the number of bases are set to 10. We see that overall the reconstructed signal nicely approximates Y(t). An even better fit could be attained by increasing the number of bases used in the Laguerre expansion. But   Fig. 4(b), that by increasing the bases from 5 to 10, we see a sizable improvement in reconstruction error, but from 10 to 20, the decrease in error is not as significant. This makes intuitive sense as in most reconstructions, the higher order terms play a more crucial role in reducing errors, which tapers off as more terms are included. Therefore, in all our experiments, we set d to 10.

Evaluation of CRF classification
To study the utility of the CRF segmentation module, we replace the CRF layer, with a implementation of Kmeans as a Neural Network. Kmeans is one of the most popular method for clustering related tasks. We use the KmeansNet implementation proposed by Marshland [58]. All other parameters such as training and test sets and number of epochs remain the same. Results can be seen in Table 2 and Fig. 5, demonstrating that our Bimodal model with CRF as the last set of layers, works much better than the model using KmeansNet . Generally, when KMeansNet is included, the data is over segmented which adversely affects the quality metrics. Notably, however, the KMeansNet is faster than the model using CRF, and may be considered as an alternative if quicker computation times are desirable.

Evaluation of bimodal data for purposes of classification
To evaluate if the bimodal model has any advantages over using a single modality such as either FAD or NADH, we run our model in the unimodal setting, where there is only one layer(instead of two layers) for the optnet, which this is then forwarded to the CRF layer. We repeat this experiment for both NADH and FAD separately and report on the same quality metrics as before on test data. Note that the training and test data is same for each of these experiments. Table 3 shows the quantitative results of this experiment. As we can see, the bimodal model shows improvement over each of the coenzymes used separately, especially NADH, where the improvement is substantial over all the metrics. Qualitative results of segmentation shown in Fig. 6 shows that the bimodal approach leads to more accurate identification of microglia compared either NADH or FAD used alone.

Model parameters and time complexity
We set the number of iterations in each component to be 10, but usually the output converges in less, particularly for the RNN part. The number of epochs is set to 5. We found that it took 30 − 40 seconds to analyze each sample in each layer (this value depends on the type of GPU used, in our case, we used a NVIDIA GTX 1080 GPU). With these settings, it took on an average about 3 hours for the training the model. We found that increasing the size of the training improved evaluation metrics slightly, but as expected, the demands on computational cost and memory increased as well. As future work, we will study how to distribute the processing across a cluster of GPUs (the Optnet implementation as provided by the authors does not allow this at present).

Conclusion
We propose a bimodal regression-classification neural network for classifying Microglia using Fluorescent lifetime images. Our model inputs the lifetime data from two different coenzymes -NADH and FAD, and then learns the regression coefficients for accurate representation which are then used as features for the classification module using CRF. The bimodal data provides a diverse data-source for the classification model, which is evident in this approach outperforming the models with uni-modal data. In our future work, we plan to extend our model to include other forms of information about microglia such as cell morphology and other shape features.