Analysing cerebrospinal fluid with explainable deep learning: From diagnostics to insights

Analysis of cerebrospinal fluid (CSF) is essential for diagnostic workup of patients with neurological diseases and includes differential cell typing. The current gold standard is based on microscopic examination by specialised technicians and neuropathologists, which is time‐consuming, labour‐intensive and subjective.


INTRODUCTION
The analysis of cerebrospinal fluid (CSF) is a key procedure for the diagnosis of central nervous system diseases. An indispensable element of CSF analysis remains the microscopic analysis of cytocentrifuged CSF samples to obtain differential cell counts [1,2]. Cell type determination and quantification are still crucial for the differential diagnosis of acute and chronic infections (e.g., the prevalence of granulocytes, activated lymphocytes and plasma cells) and subarachnoid haemorrhage invisible on CT and MRI imaging (e.g., detection of erythrophages, haemosiderophages and haematoidin crystals) as well as neoplastic meningitis [1,[3][4][5][6].
Gold-standard morphological profiling of CSF cells relies on visual inspection by highly specialised technicians and/or board-certified neuropathologists, and manual counting is only feasible for highquality samples of intermediate or low cell density [1,2]. While the results are highly dependent on experience and expertise and prone to observer bias and oversights, procedures are labour-intense and time-consuming with limitations for quality control and economic scalability. Furthermore, specialised personnel is usually not available 24/7 for time-critical decisions in emergency situations [2].
Absolute quantification of CSF cells and erythrocytes by automated cytometry is applied in many laboratories for a rough orientation of the cell density. However, current commercial cytometers developed for blood cell analysis are unable to reliably count low cellularity CSF samples (<30 cells/μl; normal: <5 cells/μl), to differentiate CSF cell types and to provide differential cell counts [2]. Deep learning algorithms have been developed for cell detection and differentiation in blood and bone marrow specimens. Recently, the detection of epithelial tumour cells has been demonstrated in CSF samples by a neural network, but no holistic approach addressing all relevant diagnostic questions in CSF diagnostics is currently available [7][8][9].
Although some algorithms for specific diagnostic tasks, like the identification of blast cells in acute myeloid leukaemia in bone marrow samples, have reached human-level performance [10], very few have been clinically implemented. Besides the lack of generalisability of the algorithm to external data sets, one essential issue is that training data may not be representative of samples encountered in daily clinical work, which demonstrate higher variability in quality and specimen preservation [11]. Moreover, pathology and medicine, in general, are characterised by "long tail" distributions of diagnoses, that is, only a few disease entities make up the majority of the cases and a plethora of differential diagnoses exist that are very rare. Most current alleged "clinical grade" AI approaches perform well for frequent diseases, but are unable to properly classify rare cases [12]. Furthermore, the clinical value of algorithms is limited by oversimplification and reducing the complexity of the pathologists' tasks by incorporating several layers of information into the decision-making and diagnostic labelling of samples [13].
We, therefore, set out to compile a real-world CSF dataset containing several sources of variation (e.g., different laboratory sample pre-processing, staining protocols, scanners, sample qualities, cell preservation states, object densities and common artefacts) and all diagnostically relevant cell types/objects to train a robust algorithm for cell type differentiation with the potential to solve complex diagnostic tasks. We developed a convolutional neural network (CNN), which reliably recognises 15 categories of diagnostically relevant CSF cell types and objects. To validate our model and assess its realistic usefulness in diagnostic practice, we validated the CNN-based approach by comparing its performance to that of seven boardcertified neuropathologists from different academic institutions. By using explainable AI methods, we identified morphological features learned by the model to discriminate between specific cell types, which allowed us to further validate the model and explain its current limitations.

CSF dataset collection, processing and annotation
We selected 128 CSF specimens from 78 patients, which were diagnostically evaluated at the Department of Neuropathology Charité -Universitätsmedizin Berlin between 2008 and 2020 (Table S1). Diagnoses included pleocytosis (n = 8), inflammation (n = 6), chronic haemorrhage (n = 20) and neoplastic meningitis (n = 44; Figure 1A). Slides were stained in two different laboratories according to standard procedures with May-Grünwald-Giemsa. The quality of each slide was assessed prior to digitisation, allowing for a mix of high, intermediate and low quality depending on the extent of autolytic and artificial changes (Table S1). Slides were scanned at a magnification of 40Â

Data partitioning for model selection
The dataset was partitioned into three subsets: a training set for learning of the DNN weights, a validation set for model selection and a test set for evaluation of the generalisation performance on unseen data. The partitioning was performed on a case level; that is, samples of one case belong to one partition (either training, validation or test) only, so that generalisation performance across patients can reliably be estimated (Appendix).

Pre-processing and data augmentation
Around each cell annotation, a 128 Â 128 px image was cropped from the tile. For annotations close to the boundary, mirror padding was applied to yield crops of the same size. From these crops, the training images were extracted by cropping 112 Â 112 px patches at random locations and applying random blur with a Gaussian kernel, random rotation, mirroring and random colour variation (brightness and contrast). Following Tellez et al. [14], a rather strong augmentation was used. For evaluation purposes, patches were centred around the annotation, and no random transformation was applied. The patches were standardised by subtracting the mean and dividing by the standard deviation of each colour channel in the training set.

Model architecture and training
As a backbone for our model, we used a VGG16 model pre-trained on ImageNet and fine-tuned it on our data [15]. The main training objective was a cross-entropy loss function, where the contribution of each class was weighted by the inverse of its relative frequency in the training set. For early stopping based on the validation set performance, macro-averaged sensitivity (i.e., balanced accuracy) was chosen to account for the imbalanced class distribution. Complementary to the cross-entropy, a consistency term was added to the loss function, similar to Bortsova et al. [16]. Each patch was pre-

Model evaluation, visualisation and explainability
Four individual models (CNN1-4, Figure 1D) were trained, each using one partitioning of the data. To evaluate the performance of the models on unseen data, a joint confusion table of the predictions of all four models on their respective test sets was computed.
The confusion table was then normalised such that each row sums to 1. This yields the sensitivity for each class on the diagonal. The average of these class-wise sensitivities is defined as mean sensitivity (macro-average sensitivity). For the combined categories activated monocyte/macrophage (E) and tumour cell (J), we additionally report the sensitivity on the subgroups respectively (e.g., activated monocyte and macrophage for E; carcinoma, melanoma, leukaemia and lymphoma for J) and report the precision and the F1 score for each class (see Table S2). Note that these metrics are not defined for the subgroups of E and J since the subclasses are not predicted by the model.
In order to verify that the models have learned meaningful representations, we projected the 4096-dimensional feature vectors (output of the penultimate layer of the CNNs) of the test samples into 2D by Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [18]. To complement the evaluation, explanation heatmaps were computed using Layer-wise Relevance Propagation (LRP) applying Zennit [19][20][21]. The explanation heatmaps were computed on each of the four models (CNN1-4) individually (for details, see Appendix).

Validation set (CSF500) for performance comparison to neuropathologists
To compare the performance of the CNN head-to-head to neuropathologists, we created a completely independent evaluation dataset (i.e., CSF500) consisting of 12 different patients with nonspecific, inflammatory, haemorrhagic and neoplastic diagnoses (for original diagnosis, see Table S2). As for the training set samples, the CSF500 samples were scanned with Hamamatsu NanoZoomer  Table S3.
Raters were also asked to assign a diagnostic label to each CSF sample (i.e., inflammatory, haemorrhagic, neoplastic or a combination of two labels). For diagnostic labelling of the CNN, samples were determined as haemorrhagic (>8 h) in case an erythrophage, haemosiderophage or haematoidin crystal was detected and as inflammatory upon the presence of either plasma cells or granulocytes. In case both haemorrhagic and inflammatory cells were present, the most abundant cell types determined the diagnosis, in case of equality, "haemorrhagic" was chosen, because granulocytes may be derived from peripheral blood. The presence of a single tumour cell resulted in the additional label "neoplastic." In cases where tumour cells were the most abundant cell types, the sample was diagnosed as "neoplastic" only.

Performance evaluation and statistics
Statistical analysis was conducted using R (version 3.6.3). For performance evaluation, precision (true positives/(true positives + false positives)) and sensitivity (true positives/(true positives + false negatives)) were calculated. Overall, inter-observer reliability was measured using Krippendorff's alpha coefficient as it is recommended in case of missing observations (label P: not applicable) [22]. Calculation of Fleiss' Kappa coefficients additionally allowed the comparison of inter-rater reliability for individual cell type categories. Both coefficients were calculated applying the R package irr (v0.84.1) and yielded highly similar results (mean difference 3.05 Â 10 À5 ). Coefficient matrices were visualised using corrplot v0.92 [23]. Class-wise receiver operating characteristics (ROC) curves were computed in a one-vsrest fashion. The average AUC was calculated using the one-vs-one approach by Hand et al. [24]. ROC analysis was performed using scikit-learn (version 1.0.2; http://scikit-learn.sourceforge.net).

Multiclass prediction
After training the neural network, we first used the test set to estimate the performance on unseen data. We repeated the training four times on different partitions of the dataset (four-fold cross-validation, that is, CNN1-4, see Figure 1D). Confusion matrices of the four CNNs were summarised and are shown in Figure 3A Table S4).
Very good performance was noted for physiological cell types (sensitivity for lymphocytes 87% and monocytes 82%) as well as key cell classes relevant for diagnostic labelling (e.g., neutrophilic granulocytes 96%, tumour cells 79% and haemosiderophages 78%). The confusion matrix shows that misclassification mainly occurred in related morphological classes of the lymphoid or myeloid lineage representing a challenge also to human evaluators. For example, a substantial overlap was noted between plasma cells and activated lymphocytes belonging to a morphological continuum of B-cells with signs of cellular activation.
When applying a more tolerant classification scheme allowing for confusion for these two categories, sensitivity for activated lymphocytes increased to 90% and for plasma cells to 96% (Table S2). Overall sensitivity was significantly better in samples of low cellularity (e.g., distinctly localised cells without overlapping or abutting cell borders) and high or intermediate quality ( Figure 3B). We also saw a high correlation of sensitivity to the number of training samples available for each category ( Figure 3C).

Morphological class features learned by neural networks form distinct clusters
To analyse if the models learned meaningful representations, the high-dimensional feature vector of CNN1 was visualised in 2D by applying UMAP, which shows that morphologically defined cell types were projected in discrete clusters in the 2D representation ( Figure 4A, see also Figure S3).

Cell type recognition by the CNN is based on comprehensible and visualisable cytological features
To further elucidate the most relevant image regions for the CNN predictions, we used layer-wise relevance propagation (LRP) to identify image regions (down to pixel resolution) and features with high informative value for classification. Figure 5 shows the LRP heatmaps for  Figure 6B). Lower inter-rater agreement was mainly observed for those cell types belonging to a morphological continuum (e.g., to the lymphoid or myeloid lineage). Strikingly, there was a very low F I G U R E 5 Pixel-level image regions relevant for classification. Shown is the original image (with object ID, ground truth label and ensemble prediction) as well as a feature heatmap calculated for each of the four individual CNNs based on layer-wise relevance propagation (LRP). Pixels highlighted in red are evidence in favour of the predicted class whereas blue pixels are evidence against it (e.g., counter-evidence). The degree of confidence is given as probability (p, range 0-1) with high numbers representing high confidence. Examples of correctly (✔) and incorrectly (✖) classified objects are selected and the respective cytological feature relevant to human recognition is specified. Misclassified cells represent common difficulties with segmentation (e.g., define borders in a complex environment with abutting or overlapping objects; 379, 404 and 293), feature similarity (379), rare and blurred objects with misfocus on isolated sharp structures ignoring context (404) and too small image patches (293). agreement between raters 1-7 and the GT as well as the CNN for the categories activated lymphocyte (B) and shrunken cell (L), which may be attributable to institutionally defined and non-standardised morphological cut-offs and consensus in continuous cellular categories, especially relevant for shrunken cells (i.e., delineation of artificially condensed cells from naked-nucleic lymphocytes).

Most CSF samples were correctly diagnosed by the ensemble CNN
CSF samples were correctly labelled based on the CNN predictions in 10/11 cases (Figures 7A and S4). The CNN showed a high capa-   [11,25]. However, only a few of these have been translated into clinical applications and are currently used in diagnostic settings [11,26,27]. In this study, we aimed to develop an AI-based automated cytological/CSF image analysis approach for which we compiled a large dataset with 123,181 annotations of digitised CSF samples from 78 patients and trained a deep neural network capable of multiclass recognition of diagnostically relevant CSF cells and diagnostic labelling of CSF specimens. Our aim was to create a routine diagnostic-grade AI tool and real-world dataset that can serve as a reference to design and improve machine learning solutions for CSF as well as other cytological diagnostics but also to consider and address obstacles in the dataset design and partitioning strategy that may hamper clinical translation.
Several machine learning approaches have been developed for the analysis of peripheral blood smears and bone marrow samples and were capable of differentiating different categories of cells, for example, leukocyte subtypes, as well as normal and neoplastic cells [7,28]. Furthermore, deep learning models even proved feasible in discriminating between different haematopoietic cancer subtypes [29].
Recently, the proficiency in identifying epithelial cancer cells of a CNN in CSF samples was shown by Yu et al. [9]. The authors demonstrated that the accuracy of the CNN was similar to experts and the CNN was superior in predicting the carcinoma origin compared with humans while reducing the working time by 90%. However, none of these works addressed the full spectrum of diagnostically relevant CSF objects and provided an in-depth explanation of the model behaviour, as presented in this study.
One of the reasons for limited utility in clinical settings is that models are developed on high-quality samples that differ from those encountered in daily diagnostic routines and that do not consider the numerous sources of variance that influence morphological representation in images: laboratory-specific sample handling, staining protocols, scanner setting and especially degree of cell deterioration over time. Instead of compiling a dataset consisting of optimal quality samples which may achieve high-performance metrics, but result in low generalisability of the model, we aimed to design a real-world dataset reflecting the realistic and full spectrum of samples encountered in daily work, including samples of different quality and object density (see Figure 3). We paid special attention to representing the full variability of the data including rare cell types. Hence, we overrepresented tumour samples in our training dataset since these feature a high degree of variability, e.g., due to different tumour origins and degrees of differentiation, such that a large number of samples is required to train a deep neural network. For this reason, the prevalence of cell types in the training set does not reflect a real-world situation in daily diagnostics, e.g., tumour cells are usually recognised in only 5-10% of CSF samples [30][31][32], but were over-represented in our dataset with 56% of CSF samples (see Figure 1). Even though this led to a slight over-sensitivity to tumour cells (see Figure 7), we argue that in a clinical setting, false positive tumour predictions are less severe than false negative ones.
In clinical settings with a wide prevalence distribution of common and very rare conditions (i.e., long-tail distribution) [11], datasets are usually highly imbalanced, which represents a potential further limitation to external validity. In imbalanced datasets, the allocation of a limited number of clinical samples with multiple categories into training, test and validation set is complicated as the assignment of one patient to more than one set may result in overfitting. Compared with previous work [7], we allocated samples of individual patients exclusively to one set (e.g., training, validation or testing set), which is crucial for an accurate performance estimation because samples of one case cannot be considered statistically independent and partitioning per sample rather than per case will optimistically bias performance metrics [33]. However, due to this restriction, it is complicated to divide the dataset in such a way that classes are balanced. We address this issue by proposing a novel data partitioning strategy (see Appendix and Figure S1).
Because of the naturally limited number of rare objects in clinical datasets, there is a need to develop not only suitable partitioning strategies but also intuitive ways to leverage the data-derived representations of learning systems to infer general concepts [11]. Identifying the underlying substructures in a comprehensive image context that resulted in the correct classification of an object by using explanation heatmaps may help to approach new solutions to study algorithmic understanding and transferable conceptualisation. To achieve this, the field of explainable AI has recently advanced with methods such as GradCAM, SmoothGrad and Layer-wise Relevance F I G U R E 7 Performance of the ensemble CNN on the CSF500 dataset in comparison to neuropathologists. (A) Comparison of diagnostic labels for the 12 CSFs assigned by the raters and the ensemble CNN. Sample 1* was not rated because of the selection of four eosinophilic granulocytes misleading to inflammatory impression in a slightly sanguineous sediment with unspecific changes. For neoplastic CSFs, the number of tumour cells detected compared with the ground truth is provided. Of note, clinical information was not provided to the raters before the evaluation. The ensemble CNN proved capable of correctly classifying diagnostically relevant cell types and assigning the correct diagnostic label in 10/11 cases. Misclassification of two inflammatory samples was based on the identification of one and six tumour cells in cases 3 and 9. (B) The single plasma cell misclassified as a tumour cell by the ensemble CNN is shown as well as the individual prediction and explanation heatmaps of all four CNNs, which highlight a plasma cell with unusual cytoplasmic blebs. (C) An image example of case 9 is depicted with individual votes of human raters, the ensemble CNN and the ground truth (GT). The CNN misclassified fewer cells as tumour cells compared with the human raters. Correctly identified cells are highlighted in green in the table, yellow indicates cells that belong to a morphological continuum and may be considered as correctly labelled, and red corresponds to incorrectly classified cells. Abbreviations: A, lymphocyte; B, activated lymphocyte; C, plasma cell; E, activated monocyte/macrophage; J, tumour cell; M, mitosis Propagation [34,35]. These methods have already been applied for deep learning in pathology, for instance, tumour classification [36], cytology [7] and morphological biomarker discovery [37].
Using explanation heatmaps allowed us to visualise cellular substructures in correctly and incorrectly classified cells and align them to features relevant to human object recognition in cytological specimens. Furthermore, we identified specific features that provided counter-evidence for a certain prediction (e.g., extracellular deposit reminiscent of hemosiderin in a mitosis example, see Figure 5, 404).
The visualisations make the otherwise latent classification rules transparent: They seem to be not only based on the recognition of features in favour of a certain category, but also on information learnt to be highly specific for other categories, resembling in some respects heterogeneity, skewed nuclear-to-cytoplasmic ratio and increased cytoplasmic basophilia [5,39].
Among neuropathologists, it is well-known that lymphocytic activation in inflammatory conditions, such as viral meningoencephalitis or neuroborreliosis, may be so pronounced that it mimics malignant lymphoma [1]. Especially in cases where clinical information is not provided, it represents an ongoing dilemma to distinguish between neoplastic and inflammatory lymphocytes [5,39], which is exemplified in case 9 of the CSF500 set (e.g., viral meningoencephalitis) labelled as neoplastic by three of seven neuropathologists and the CNN. Compared with the CNN, neuropathologists were able to consider context information of the 2000 Â 2000 pixel images containing several other cell types, information on preservation status, overall staining intensity and cellularity of a sample, which is not available to the CNN that classifies images only based on small patches containing a single item.
Implementing dynamic patch sizes similar to Hashimoto et al [40], i.e., varying instead of fixed input size and resolution, in the design of the CNN may potentially increase classification accuracy not only for large cell types, which are currently insufficiently enclosed in the fixed patches (see Figure 5, 293) but also by incorporating context information available to human raters.
To obtain differential cell counts, determination of the exact cell type is necessary. However, neuropathological assessment of CSF capability of CNNs, trained on a variable selection of the training dataset, to identify specific classes more or less correctly, which suggests that the variance of the data is very high and the model would benefit from even larger and more diverse datasets. Furthermore, providing confidence values together with the class predictions to diagnosticians will help overcome algorithmic aversion and increase the safety of using CNNs as screening tools in CSF diagnostics. A valuable extension of the current model will be the implementation of a calibration model with class-wise thresholds and labelling of unclassifiable objects, which could be directly delegated to human evaluation, similar to what has been introduced with the concept of calibrated classifier scores in the DKFZ brain tumour classifier [26].
In this work, we validated the capability of our model to classify CSF cells based on an independent dataset of 511 cells. Although the performance was already good enough to derive broad diagnostic labels for 11 clinical cases, a thorough and detailed prospective clinical validation period cannot be replaced and needs to be conducted in parallel with routine diagnostics in the future. Our algorithm provides the basis to develop an automated, transparent and validated diagnostic assist system, which may be available 24/7 also in emergency situations, e.g., at night or the weekend when neuropathological diagnostic service is usually not provided. Furthermore, the estimated time to whole slide image classification in a fully automated workflow is expected to last only a few minutes (e.g., scan time: 1-3 min, whole slide image tiling: 10 seconds, CNNbased cell classification: 2 ms per cell), which will mainly depend on the cellularity of the sample and reduce the turnaround time from lumbar puncture to diagnosis as has been shown by Yu et al.
recently [9]. The integration of our system with further improvements in handling images from portable microscopes or even smartphones may translate cytological CSF diagnostics from laboratories to clinical bedside applications.
Besides automation and improvement of diagnostic routine processes, machine learning algorithms have already demonstrated proficiency in refining tumour classification and determining cancer origin based on methylation profiles [26,27]. Analysing epigenetic patterns in large cohorts of various brain cancers even proved capable of identifying new cancer subtypes and new tumour entities that went unrecognised in small datasets and previous histology-based tumour diagnostics [42,43]. Similarly, extending the reference dataset of the CSF classifier and applying unsupervised machine learning approaches could result in the identification of new, clinically and prognostically meaningful cell types or delineate novel distinct cell states along a common lineage and differentiation trajectory in neoplastic and inflammatory CSF samples, which could have been missed by human microscopic analysis so far.

SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section at the end of this article.  Figure S1). Note that this partitioning strategy does not necessarily yield disjoint train, validation and test sets across partitionings (e.g., a case could be used in the train set of both partitions 1 and 2), in contrast to standard stratified cross-validation without grouping (e.g., used in [7]).

Explanation heatmaps
For the computation of LRP heatmaps, we followed the recommendations by Montavon et al. [21]: The ϵ-rule is used for dense layers and the (α = 2, β = À1)-rule for convolutional layers. The input to the LRP-backpropagation is a vector that is 1 for the pre- Similar to Matek et al. [7], we experimented with GradCAM heatmaps but found that the resolution of these is too coarse to capture the fine details that are important to understand the models [44]. For instance, for the 112 Â 112 input images used in this work, the GradCAM heatmap would be only 7 Â 7 pixels wide (16Â coarser resolution than the input). In comparison, our LRP heatmaps have the same resolution as the input image and allow us to examine the predictions at the pixel level. Moreover, pixellevel heatmaps of the common ResNet architecture-as the SmoothGrad heatmaps used by Matek et al. [7]-have a peculiar grid pattern; that is, the heat is concentrated on pixels that lie on an equally spaced grid. This issue has been discussed previously in Hui et al. [45] and is not an artefact of heat mapping, but can be directly linked to the architecture. Thus, the grid artefact is visible both in LRP and SmoothGrad heatmaps [7]. Since this artefact impedes the interpretation of the heatmaps, we opted to use the more conservative VGG16 architecture that does not show this artefact.