Development of deep learning models for microglia analyses in brain tissue using DeePathology™ STUDIO

BACKGROUND
Interest in artificial intelligence-driven analysis of medical images has seen a steep increase in recent years. Thus, our paper aims to promote and facilitate the use of this state-of-the-art technology to fellow researchers and clinicians.


NEW METHOD
We present custom deep learning models generated in DeePathology™ STUDIO without the need for background knowledge in deep learning and computer science underlined by practical suggestions.


RESULTS
We describe the general workflow in this commercially available software and present three real-world examples how to detect microglia on IBA1-stained mouse brain sections including their differences, validation results and analysis of a sample slide.


COMPARISON WITH EXISTING METHODS
Deep-learning assisted analysis of histological images is faster than classical analysis methods, and offers a wide variety of detection possibilities that are not available using methods based on staining intensity.


CONCLUSIONS
Reduced researcher bias, increased speed and extended possibilities make deep-learning assisted analysis of histological images superior to traditional analysis methods for histological images.


Introduction
Evaluation of histological sections is a core practise in clinical and research routine. In recent years, interest in deep learning-assisted histological analysis has spiked from around 20-50 publications per year  to 300-600 (2018-2020). 1 This reflects, among other factors, changes in availability and technological advances (i.e. faster computers) that make deep learning more accessible (Shen et al., 2017).
Deep learning has clear advantages over conventional methods. It can save 90% of analysis time (Bascunana et al., 2021) and improves detection sensitivity (Klein et al., 2020). However, developing deep learning algorithms often involves computer scientists with knowledge in the field of artificial intelligence (AI). Instead, we used DeeP-athology™ STUDIO (further referred to as STUDIO), which is a commercially available do-it-yourself platform to develop custom deep learning-based models without the need to write computer code. Slides are stored locally and do not need to be uploaded to an external third-party server.
We have shown recently that STUDIO is an appropriate tool to produce results with high quality and improved reproducibility between individual researchers. Number and size of β-amyloid plaques and activated microglia determined with deep learning models correlated closely with values obtained from our classic method of object segmentation with AxioVision Software (Carl Zeiss Microscopy Deutschland GmbH, Jena, Germany) (Bascunana et al., 2021). Additionally, deep learning opens the door for refined object classification (Bascunana et al., 2021). While our previous publication focusses on the comparison of two different analysis methods, it does not go into detail about how the models were trained.
In this follow-up report, we provide a comprehensive guide on how to successfully develop custom models, particularly using the versatile Objects mode of STUDIO. Despite being easy to use, this new tool may seem intimidating to users, especially if not familiar with AI-assisted analysis. We give practical suggestions on how to start and what to be aware of during the process. Moreover, we demonstrate the vast possibilities of deep learning models by comparing three different models developed with the same staining and tissue type.

Hard-and software
We used DeePathology™ STUDIO (DeePathology Ltd., Ra'anana, Israel) in the April-2021-version on a virtual machine with Windows 10 Enterprise N, two Intel® Xeon® Gold 6154 processors, an NVIDIA GRID RTX6000P-24Q graphics card and 32 GB random-access memory (RAM).

Project initiation
Currently, STUDIO features four different modes that use different neural networks as a basis: Regions: Ideal for regional segmentation, i.e. classification of large areas into different categories based on their cellular pattern.
Cells: Best suited to detect and classify cells with typical round/oval shape and smaller size. Works with isolated cells (e.g. blood smears or cytospins) and single cells in tissue structures, e.g. detecting NeuNpositive neurons. Less suitable to detect non-round shaped cells like microglia or astrocytes.
Objects: Versatile approach that uses instance segmentation neural networks and excels when identifying different cells within a tissue or other objects such as plaques. Objects can have any shape or size. Less suitable to detect many distinct objects close to each other in dense areas (in this case, "cells" may be better suited).
Tiles: Used for analysis of larger fields for pattern/tissue recognition. The given area is divided into squares and each square is classified into a predetermined category.
We used the "Objects" mode to develop our microglia-detecting algorithms.

Establishing categories and settings
After the project has been initiated, one or more slides were added to the project. These slides contained representative examples of all categories of interest and were similar to the slides to be analysed.
First, we added all categories that should be included in the model.
Category 1 is always background, i.e. everything that is not supposed to be detected as object.
Note: Adding or removing categories later on will reset the model and training will start from zero.
The user can choose the most suitable settings for the algorithm now (see Table 1), but all settings (besides categories) can also be changed and adapted later on if needed. Next, we chose 5-10 perfect examples of objects belonging to each category (including background), marked them and assigned them to the correct category. When annotating an object, we did not include any other objects in the selection, even if they were overlapping with the desired object.
Note: When thinking about categories, users should have a precise picture in mind what objects in each category look like. Training an algorithm requires clear and unambiguous inputif you are confused, the algorithm will give confusing results.
Choosing the background was particularly important: we selected different areas containing objects that were not supposed to be detected, not only empty areas commonly referred to as "background".

Initiating learning and monitoring progress
As soon as we had selected enough objects per category (at least 5), STUDIO started generating an algorithm from the examples. The ongoing learning process was indicated by "training in action" in the menu bar.
We monitored the learning process under "Graphs". In Objects mode, the graph showed the loss function used for training the neural network, i.e. "epochs" on the X-axis and "training loss" on the Y-axis (Fig. 1). One epoch means that each annotated sample has been included in forming the algorithm. The time required for each epoch depends largely on the computational power of the machine and the number of objects included in the training set. Training loss, the other parameter in the chart, is an indicator for the accuracy of the algorithm's predictions. Zero represents perfect predictions and the higher the training loss the less accurate the prediction is. After we corrected the algorithm (i.e. changing categories of detected objects or adding new objects), training loss numbers increased until the algorithm had adapted to include the new data.
As time passed, the training dataset was analysed repeatedly. The epoch number increased and the training loss generally decreased (as long as we did not give new input), indicating improvement of the algorithm.

Increasing training dataset
When the basic object recognition worked, the training dataset needed to be increased to ensure broad applicability. At this point, images/scans with varying staining intensity were added to include objects from different slides in the training dataset.
We employed multiple ways to increase the training dataset: We added new objects manually, through test runs and inside the gallery.
Manually: We added new objects at any point by drawing the outline manually. This was the slowest method, but at the same time maybe the most accurate, as we had full control over the exact outline. This method was most relevant in the beginning of the training phase, when the algorithm's predictions were not yet accurate enough. However, it also became relevant later on. We noticed that while the algorithm was recognizing the correct objects, it did not trace them as accurately as we did as humans (Fig. 2). With the addition of less accurately outlined objects and the subsequent changes in the algorithm, the margin between the desired outline and the detected outline got larger, especially when we detected objects with a complex shape (i. e. not round). In this situation, we added manually outlined objects from time to time to improve detection.

Test runs (detection-evaluation process):
We used preliminary versions of the algorithm to run it on specific areas of the slide, where it outlined predicted objects in the current field of view. We then added   Manual vs. algorithm-based object outline. The screenshot from DeeP-athology™ STUDIO shows a side-by-side comparison of a manually outlined object (right) and an object outlined by the algorithm (left). Both objects are part of the training dataset of approach 3 and belong to category 2 "microglia" (see Table 2). The algorithm outlines tended to include more background and to have a rounder shape than the manual outline.
one or more of these predicted objects to the training dataset by rightclicking on them. We used the flow chart (detection-evaluation process, Fig. 3) as decision help and made sure to review predicted objects at a relevant magnification. Initially, we chose smaller areas (i.e. larger magnification) for a test run. Once the model's predictions were more confident, we also analysed larger areas. Importantly, the detectionevaluation process was also performed on "complicated" areas, i.e. areas with different structure and areas of transition (tissue type borders, varying tissue structure, and edges of tissue), dust specs or staining artefacts, tissue folds, etc. Gallery: Inside the gallery, we added suggested objects to any of the categories. We filtered displayed suggestions e.g. by slide, or category, or displayed only "confusing objects". We used this as a quick way to add many objects to the dataset. If "AI mode" was switched on, predicted objects were automatically classified and added to the respective category. When using "AI mode", it was crucial to monitor the training dataset in case of false predictions that may flaw the algorithm.
The larger the training dataset became, the more time one epoch lasted. To see the effects of newly added objects on the model, we needed to wait for some epochs after adding the objects until the algorithm had been adapted (i.e. training loss had reduced again). Then, we tested the algorithm again.

Reviewing annotations
We used the "annotation review" feature to review the training dataset, i.e. all annotations or annotations belonging to a specific category. We eliminated incorrectly categorized objects that would confuse the model and may negatively influence its performance. The displayed objects were automatically sorted according to their fit into the assigned category with objects fitting worst (and thus, having the highest chance of being incorrect annotations) appearing on top of the list.

Version control
With most models, we reached a point where we were satisfied with the predictions, but wondered if it could be improved further. We also let the model train over a longer period to see how accurate it could become. In any of those cases, we kept a snapshot of the current status as backup and potential restoration point. A snapshot always contains all information about the current model including object and region annotations, and the current algorithm.
Particularly when training for a longer period, we changed the setting "Model version to use" to "Best" instead of "Latest" to always keep a copy of the until then best model version.
We also used snapshots as templates for new experiments that used the same model (but different slides), and to share a model among colleagues.

Validation
To validate the model, we started by defining a test area. Within this area, we annotated all objects of all categories correctly by hand (except Fig. 3. Detection-evaluation process. The flow chart depicts a possible decision tree whether to add or not to add a predicted object to the training dataset (Section 3.4). The user has to decide from the beginning, how much of the ramifications they want to include in the object (e.g. should the algorithm also bridge small gaps or be very conservative). Moreover, the algorithm will initially detect ramifications without (clear) soma as objects. Indecisiveness of the user (when do you count a soma as soma) and variability of microglia shapes and section plane will determine how fast the algorithm improves. Over time, the detected objects will appear more and more round (i.e. more background around the ramifications will be included in the object and more ramifications will be slightly cut at the end).
To counteract this effect, manually trace out microglia as needed.
Strictly speaking, "microglia" is a subcategory of "cluster", as every microglia cluster obviously consists of several microglia. However, this training mainly focused on the difference between resting or moderately activated microglia (category 1) and clusters of highly activated and/ or dystrophic microglia (category 2 for background). Everything that was not annotated was considered background. We ran the validation to receive statistics on the accuracy of the model (precision and recall). Precision is the rate of true positives among all detected objects. Recall is the rate of true positives among all as true annotated objects and is referred to as the model's sensitivity. STUDIO calculates true/false positives based on the overlap of the predicted objects with the manually annotated objects. The results range from 0 to 1, with higher values indicating better performance. Validation provides a useful tool to put a model's performance into numbers. We also used it while still training the model. A low precision value indicated that further training should focus on reducing false positive detections. On the other hand, a low recall value indicated that the algorithm was still missing many objects (false negatives). Frequent validation helped making the training annotations more goal-oriented.
However, we did not rely purely on the validation results to judge the algorithm. We compared the validation annotations to the predicted objects, and checked which objects were missed or falsely recognized and whether these mistakes were gross errors or within the margin of error that is to be expected under real-life conditions. Moreover, we decided individually whether we were more willing to accept false positives or false negatives, depending on the purpose of the analysis.
Note that validation and training datasets must be kept separate. Thus, STUDIO never uses objects inside the validation area for training of the algorithm. To ensure separation, we always annotated validation objects after the area had been marked as validation area. Additionally, we performed validation on a separate slide that was not used for training purposes. We also froze the algorithm and saved the project and Fig. 4. Screenshots of the same slide analysed with three different deep learning algorithms. Cortical region of a coronal brain section stained against IBA1. (A) 2.5x magnification view. Objects with magenta outline are categorised as microglia somas (algorithm 1) or activated microglia (algorithm 2 and 3). Objects outlined in green are microglia clusters (algorithm 3). Black squares indicate the areas selected for higher magnification views. Scale bar represents 500 µm. (B-D) 20x magnification views of the selected areas, analysed with all three different algorithms as in A. All screenshots were taken directly in DeeP-athology™ STUDIO. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 3
DeePathology™ STUDIO settings used for the presented algorithms.

Approach
(1) Microglia bodies (2)  a snapshot before validation (see Section 3.6). This allowed us to revert the model to the same status used during validation.

Analysing experiments
When we were satisfied with the recognition and categorization capabilities of the algorithm, we switched off training mode (i.e. switched the "Freeze status" from "running" to "frozen" in the settings). This preserved the status of the algorithm. The algorithm was then ready to analyse experimental data.
Be aware that as long as you keep the training running, the algorithm will change. This means that different slides are effectively analysed with (slightly) different algorithms. It is therefore highly recommended, if not mandatory, to switch off training during analysis. If needed, training mode can be enabled again at any point, for example to optimize the algorithm for a different staining intensity or tissue type. However, before resuming training the algorithm, it is recommended to save a snapshot for reproducibility reasons.
We analysed either one slide at a time or multiple slides in batch mode. To analyse one slide, we marked the region of interest (Viewers -Report -Mark) and then generated the report. To analyse multiple slides, we first marked all regions of interest (Add slides to project, then Viewers -Report -Mark), then selected all slides (Report -Start report).
In both cases, we specified in the settings (Table 1) to include raw data (i.e. category, size and X/Y-coordinates of all detected objects) and the size of automatically generated screenshots (to monitor model performance on each slide).

Further notes
Training had to be started several times from the beginning because the resulting algorithm did not perform as well as expected. This was usually caused by sub-optimal examples in the training dataset and was hard to correct. Thus, it proved to be the best choice to go back to step one and to make sure that the categories were clearly distinguishable.
If the algorithm missed an object that belonged to one of the categories, we annotated the object manually by drawing. When we ran the algorithm again after some time, it usually had improved to include this and similar objects.
Overfitting can occur, especially when the training dataset is too small. It means that the algorithm fits the given data perfectly, but makes mistakes in other areas or on other slides that look slightly different. The algorithm is overly specific for the training dataset, but cannot be applied to other datasets. The best way to overcome this problem is by increasing the training dataset by adding more and diverse object examples.
In some cases, the model started to recognize very small objects. In this case, we used the area threshold setting to filter out unwanted small objects. If we were unsure regarding the threshold value, we investigated the size of relevant detected objects by clicking on them.
Note: This setting does not affect the model itself, the filter is applied only for the detection.
While we generally recommend using perfect examples to train the algorithm, also imperfect examples can contribute to fine-tuning of the algorithm. Similarly, manually classifying "confusing" objects can greatly improve the outcome for difficult decisions.

Examples of different algorithms
While it is obvious that different stainings will give different results, also one single staining can be analysed in different ways to answer different questions. Thus, it is crucial that any researcher is aware of their question before devising and training an algorithm. Below, we give three examples of algorithms all developed on brain sections stained for IBA1 against a haematoxylin counterstain. IBA1 has been described as a marker for microglia (Ito et al., 1998), but can also be expressed on (morphologically different) monocytes (Jeong et al., 2013). We chose this staining as it is widely used when studying brain inflammation and neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease or Huntington's disease. For example, we have previously described distinct spatiotemporal microglia activation using similar high-resolution whole-slide imaging in Alzheimer's disease models (Scheffler et al., 2011).
In Table 2, we compare three different approaches to detect microglia on IBA1 stained mouse brain sections. Fig. 4 shows sample images of the same region analysed with the different models. The main settings used to develop these models in STUDIO are summarized in Table 3.
In the first approach, we aimed for detecting only microglial somas, while the second approach aimed to recognize microglia including their ramifications ( Fig. 4A and B). Both approaches contain only one object category besides the background. In contrast, the third approach aimed to recognize two types of objects, ramified microglia and microglia clusters ( Fig. 4A and B). We validated all three models and summarize the validation results in Table 2. Approach 1 (only microglia somas) has a high precision (0.91), indicating few false positives. Slightly lower recall (0.84) suggests a few missed objects (Fig. 4C). Approach 2 (ramified microglia) has an excellent recall value (0.99), but lower precision (0.61), indicating that it has a higher rate of false positives. For approach 3 (ramified microglia + cluster), three values are given for each measure: overall performance and performance in each of the two categories. While none of the values are outstanding, given the similarities and overlap of the two object categories, the result is nevertheless satisfying. However, further training could be focused on avoiding false positive detection, especially to avoid false detection of microglial processes without corresponding soma (Fig. 4D). We further analysed the same cortical region of interest (area: 10.3 mm 2 ) on the slide shown in Fig. 4. The different object counts reflect the differences in sensitivity and specificity already observed during validation (Fig. 5). The different average object size is a consequence of the selected target objects and their size, with somas being smaller than microglia including their ramifications, which are in turn smaller than clusters consisting of several microglia cells (Fig. 5).

Fig. 5.
Quantification results comparing three different algorithms. Microglia were detected within a cortical region of interest on a coronal brain section stained against IBA1 using three different algorithms. The results were quantified as objects per 10 mm 2 and average object size. Data is shown as mean ± standard deviation (where applicable).

Discussion
Artificial intelligence and deep learning have come a long way from an almost futuristic method accessible only to computer scientists to a common technique that is ready for widespread practical application.
One large area of application is image analysis in clinical diagnostics and biomedical research. If used correctly, employing deep learning algorithms can improve accuracy and efficiency of otherwise lengthy image analysis (Bascunana et al., 2021;Klein et al., 2020). While the theoretical considerations have been discussed elsewhere (Le et al., 2020;Wiestler and Menze, 2020), we here present practical tips and suggestions on how to develop deep learning algorithms in STUDIO through supervised learningwithout the need of programming. Supervised learning stands in contrast to unsupervised learning e.g. discovering patient clusters by genetic analysis without manually labelled training data (Lopez et al., 2018). In case of supervised learning, artificial intelligence requires human input: It is the responsibility of human users to select training data and it is their decision when to stop training. The quality of these decisions greatly influences the applicability of any produced algorithm.
We demonstrate our approach through analysis of medical images, specifically histological images, with three practical examples of algorithms analysing the same original slides, but with different aims and complexity. Each approach faced different challenges, but validation revealed satisfying performance. However, these algorithms may be further refined in the future, focussing on their individual weaknesses (false positives and/or false negatives) to improve detection. A recent article with appealing results describes how microglia can be classified into four categories (ramified, rod-like, activated and amoeboid) using a machine learning approach with a convolutional neural network (Leyh et al., 2021). This method requires complex image preparation (including contrast equalization, soma and process detection, cell reconstruction and separation) before the machine learning model can be applied to any region of interest (Leyh et al., 2021). Depending on the application and the experience of the researcher, this complexity may be advantageous or problematic.
Deep learning partially overcomes research bias. First, it does not require the researcher to select features before training, which is an advantage over earlier forms of machine learning (Erickson et al., 2017). Second, model performance is objective, i.e. independent of who uses it once the training phase has concluded. However, the model can become inherently biased if trained with inappropriate data. Indeed, "researcher group" was the biggest contributor (31%) to model performance in a study comparing different models to predict software performance (Shepperd et al., 2014). To this end, the ability to easily share a model with other researchers as included in STUDIO is a first step to develop powerful standardized models for histological image analysis.

Conclusions
We demonstrate the potential of deep learning-assisted image analysis for evaluation of histological images using the detection of microglia in mouse brain tissue as an example. Reduced researcher bias, increased speed and highly versatile detection options make this method superior to traditional analysis methods based on staining intensity. Taken together, our study facilitates the implementation of this cuttingedge technology into everyday routine of pathologists and researchers.

Ethics approval and consent to participate
All animal tissue was obtained in accordance with the guidelines for animal experiments of the European Union Directive and regional laws.

Funding
This project was supported by the Norwegian Health Association