CT-Based COVID-19 triage: Deep multitask learning improves joint identification and severity quantification

Graphical abstract

• A carefully selected independent external test set with other lung pathologies.
• Publicly released annotated data and code to ensure reproducibility.

Abstract
The current COVID-19 pandemic overloads healthcare systems, including radiology departments. Though several deep learning approaches were developed to assist in CT analysis, nobody considered study triage directly as a computer science problem. We describe two basic setups: Identification of COVID-19 to prioritize studies of potentially infected patients to isolate them as early as possible; Severity quantification to highlight patients with severe COVID-19, thus direct them to a hospital or provide emergency medical care. We formalize these tasks as binary classification and estimation of affected lung percentage. Though similar problems were well-studied separately, we show that existing methods could provide reasonable quality only for one of these setups. We employ a multitask approach to consolidate both triage approaches and propose a convolutional neural network to leverage all available labels within a single model. In contrast with the related multitask approaches, we show the benefit from applying the classification layers to the most spatially detailed feature map at the upper part of U-Net instead of the less detailed latent representation at the bottom. We train our model on approximately 1500 publicly available CT studies and test it on the holdout dataset that consists of 123 chest CT studies of patients drawn from the same healthcare system, specifically 32 COVID-19 and 30 bacterial pneumonia cases, 30 cases with cancerous nodules, and 31 healthy controls. The proposed multitask model outperforms the other approaches and achieves ROC AUC scores of 0.87 ± 0.01 vs. bacterial pneumonia, 0.93 ± 0.01 vs. cancerous nodules, and 0.97 ± 0.01 vs. healthy controls in Identification of COVID-19, and achieves 0.97 ± 0.01 Spearman Correlation in Severity quantifi-

Introduction
During the first months of 2020, COVID-19 infection spread worldwide and affected millions of people (Li et al., 2020b). Though a virus-specific reverse transcription-polymerase chain reaction (RT-PCR) testing remains the gold standard (World Health Organization et al., 2020), chest imaging, including 5 computed tomography (CT), is helpful in diagnosis and patient management (Bernheim et al., 2020;Akl et al., 2020;Rubin et al., 2020). Moreover, compared to RT-PCR, CT has higher sensitivity (98% compared to 71% at p ≤ 0.001) for some cohorts Fang et al. (2020). Fleischner Society has addressed the role of thoracic imaging in COVID-19, providing recommendations intended to guide 10 medical practitioners with one scenario including medical triage in moderateto-severe clinical features and a high pretest probability of disease (Rubin et al., 2020). Radiology departments can respond to the pandemic by division into four areas (contaminated, semi-contaminated, buffer, and clean), strict disinfection and management criteria (Huang et al., 2020b). The International Society 15 of Radiology surveyed current practices in managing patients with COVID-19 in 50 radiology departments representing 33 countries across all continents. In symptomatic patients with suspected COVID-19, imaging was performed in 89% of cases, in 34% of cases -chest CT. Faster results than molecular tests (51%) and easy access (39%) were the main reasons for imaging use (Blažić et al.,20 2020).  Figure 1: A schematic representation of the automatic triage process. Left: the chronological order of the studies. Center: re-prioritized order to highlight findings requiring radiologist's attention (P denotes COVID-19 Identification probability). Right: accompanying algorithmgenerated X-ray-like series to assist the radiologist in fast decision making (color bar from green to red denotes Severity of local COVID-19-related changes).
The pandemic dramatically increased the need for medical care and resulted in the overloading of healthcare systems (Tanne et al., 2020). Many classification and segmentation algorithms were developed to assist radiologists in COVID-19 identification and severity quantification, see Sec. 1.1.1. However, 25 little research has been conducted to investigate automatic image analysis for triage, i.e. ranking of CT studies. During an outbreak, many CT scans require rapid decision-making to sort patients into those who need care right now and those who will need scheduled care . Therefore, the study list triage is relevant and may shorten the report turnaround time by increasing 30 the priority of CT scans with suspected pathology for faster interpretation by a radiologist compared to other studies, see Fig. 1.
The triage differs from other medical image analysis tasks, as in this case, automatic programs provide the first reading. The radiologist then becomes the second reading. Technically, many of the developed methods may provide a 35 priority score for triage, e.g., output probability of a classifier or the total lesion volume extracted from a binary segmentation mask. However, these scores must be properly used. We assume that there are two different triage problems: 1. Identification. The first challenging task is to identify studies of patients with COVID-19 and prioritize them so the physician can isolate potentially 40 infected patients as early as possible . 2. Severity quantification. Second, within COVID-19 patients, a triage algorithm must prioritize those who will require emergency medical care (Kherad et al., 2020).
Binary classification provides a direct way to formalize Identification, but 45 the optimal computer science approach to estimate Severity is not as obvious.
It was shown that human-based quantitative analysis of chest CT helps assess the clinical severity of COVID-19. (Colombi et al., 2020) had quantified affected pulmonary tissue and established a high correlation between the healthy pulmonary tissue volume and the outcomes (transfer to an intensive care unit 50 or death). The threshold value for the volume of healthy pulmonary tissue was 73%. This result and similar ones motivate clinical recommendations in several countries: COVID-19 patients need to be sorted based on quantitative evaluation of lung lesions. In particular, the Russian Federation adopted the following approach (Moro-55 zov et al., 2020c): the volume ratio of lesions in each lung is calculated separately and the maximal ratio is treated as the overall severity score. However, manual binary segmentation of the affected lung tissue is extremely time-consuming and may take several hours (Shan et al., 2020). For this reason, a visual semiquantitative scale was implemented rather than a fully quantitative one. The 60 original continuous score is split up into five categories: from CT-0 to CT-4 with a 25% step so that CT-0 corresponds to normal cohort and CT-4 -to 75%-100% of damaged lung tissue. Patients with CT-3 (severe pneumonia) are hospitalized, and CT-4 (critical pneumonia) are admitted to an intensive care unit. The scale is based on a visual evaluation of approximate lesion volume in A retrospective study (Morozov et al., 2020b) analyzed the CT 0-4 scores and lethal outcomes in 13003 COVID-19 patients. The chance of a lethal outcome increased from CT-0 to CT-4 by 38% on the average (95% CI 17.1-62.6%). Another retrospective analysis (Petrikov et al., 2020) found a significant cor-70 relation between an increase of CT grade and clinical condition deterioration (r = 0.577).
These two triage strategies, Identification and Severity quantification, are not mutually exclusive, and their priority may change depending on the patient population structure and current epidemiological situation.

75
• An outpatient hospital in an area with a small number of infected patients may rely on Identification solely.
• An infectious diseases hospital may use Severity quantification to predict the need for artificial pulmonary ventilation and intensive care units.
• Finally, an outpatient hospital during an outbreak needs both systems to 80 identify and isolate COVID-19 patients as well as quantify disease severity and route severe cases accordingly.
This paper explores the automation of both Identification and Severity quantification intending to create a robust system for all scenarios, see Fig. 2.
1.1. Related work 85 1.1.1. CT analysis for COVID-19 Identification and Severity Estimation As briefly discussed above, we consider two problems: COVID-19 identification and severity quantification in chest CTs. In both cases, researchers Table 1: Overview of continuous output indices proposed in previous works. The Type column denotes score type: COVID-19 identification, COVID-19 severity or both. Type of the Identification is given in brackets COVID vs. : P -Pneumonia, NP -non-Pneumonia, HC -Healthy controls, N -Nodules, C -Cancer. The Metric column contains reported ROC AUC values unless otherwise indicated. Remarks. 1. Accuracy because ROC AUC was not reported. 2. The metric was provided for the identification problem only. 3. Pearson correlation. 4. The average volume error, measured in cm 3 . 5. The paper does not provide a score, Dice score for the output masks is reported. Below, we present only some of the existing CT-based algorithms for a more comprehensive review we refer to . The majority of reviewed works use a pre-trained network for lung extraction or bounding box estimation as a necessary preprocessing step. We will skip the description of this step below for all works.

95
Binary classification. Researchers usually treat the problem of identification as binary classification, e.g. COVID-19 versus all other studies. Likely, the most direct way to classify CT images with varying slice thicknesses is to train well established 2D convolutional neural networks. For example, authors of (Jin et al., 2020b) train ResNet-50 (He et al., 2016a) to classify images using the 100 obtained lung mask. An interesting and interpretable way to aggregate slice predictions into whole-study predictions is proposed in (Gozes et al., 2020a), where the number of affected slices is used as the final output of the model. Also, this work employs Grad-cam (Selvaraju et al., 2017) to visualize network attention. A custom slice-level predictions aggregation is proposed in (Jin et al.,105 2020a) to filter out false positives.
The need for a post-training aggregation of slice prediction can be avoided by using 3D convolutional networks, Wang et al., 2020b).  propose a two-headed architecture based on 3D ResNet. This approach is a way to obtain hierarchical classification as the first head is 110 trained to classify CTs with and without pneumonia. In contrast, the second one aims to distinguish COVID-19 from other types of pneumonia. Alternatively, slice aggregation may be inserted into network architectures to obtain an endto-end pipeline, as proposed in Bai et al., 2020). Within this setup, all slices are processed by a 2D backbone (ResNet-50 for (Li et al.,115 2020a), EfficientNet (Tan and Le, 2019) for (Bai et al., 2020)) while the final classification layers operate with a pooled version of feature maps from all slices.
Segmentation. The majority of papers for tackling severity estimation are segmentation based. For example, the total absolute volume of involved lung parenchyma can be used as a severity score (Shan et al., 2020). Relative volume 120 (i.e., normalized by the total lung volume) is a more robust approach taking into account the normal variation of lung sizes. Affected lung percentage is estimated in several ways including a non-trainable computer vision algorithm , 2D U-Net (Huang et al., 2020a), and 3D U-Net (Chaganti et al., 2020). Alternatively, an algorithm may predict the severity directly, e.g.,

125
with Random Forest based on a set of radiomics features (Tang et al., 2020) or a neural network.
Multitask approach. As discussed above, many papers address either COVID-19 identification or severity estimation. However, little research has been conducted to study both tasks simultaneously. (Gozes et al., 2020b) propose an original 130 Grad-cam-based approach to calculate a single attention-based score. Though the authors mention both identification and severity quantification in the papers, they do not provide direct quality metrics for the latter. Amine et al. (2020) propose a multi-head architecture to solve both segmentation and classification problems in an end-to-end manner. They use a 2D U-Net backbone with an 135 additional classification head after the encoder part, which takes a latent feature map from the bottom of U-Net as input. Even though they do not tackle the problem of severity identification, they demonstrate that solving two tasks jointly could benefit both. However, they report metrics only for classification and segmentation of 2D axial slices and do not propose an approach to applying 140 their method to the whole 3D CT series.

Deep learning for triage
As mentioned above, we define triage as a process of ordering studies to be examined by a radiologist. There are two major scenarios where such an approach could be useful:

145
• Studies with a high probability of dangerous findings must be prioritized.
The most important example is triage within emergency departments, where minutes of acceleration may save lives (Faita, 2020), but it may be useful for other departments as well. For example, the study (Annarumma et al., 2019) estimates the average reporting delay in chest radiographs as 150 11.2 days for critical imaging findings and 7.6 days for urgent imaging findings.
• The majority of studies do not contain critical findings. This is a common situation for screening programs, e.g., CT-based lung cancer screening (Team, 2011). In this scenario, triage systems aim to exclude studies 155 with the smallest probability of important findings to reduce radiologists' workload.
Medical imaging may provide detailed information useful for automatic patient triage, as shown in several studies. (Annarumma et al., 2019) propose a deep learning-based algorithm to estimate the urgency of imaging findings on 160 adult chest radiographs. The dataset includes 470388 studies annotated in an automated way via text report mining. The Inception v3 architecture (Szegedy et al., 2016) is used to model clinical priority as ordinal data via solving several binary classification problems as proposed in (Lin and Li, 2012). The average reporting delay is reduced to 2.7 and 4.1 days for critical and urgent imaging 165 findings correspondingly in a simulation on historical data.
A triage system for screening mammograms, another 2D image modality, has been developed in (Yala et al., 2019). The authors draw attention to reducing the radiologist's load by maximizing system recall. The underlying architecture is ResNet-18 (He et al., 2016a), which is trained on 223109 screening mammo-170 grams. The model achieves 0.82 ROC AUC on the whole test population and demonstrates the capability to reduce workload by 20% while preserving the same level of diagnostic accuracy.
Prior research confirms that deep learning may assist in triage of more complex images such as volumetric CT. A deep learning-based system for rapid 175 diagnosis of acute neurological conditions caused by stroke or traumatic brain injury is proposed in (Titano et al., 2018). A 3D adaption of ResNet-50 (Korolev et al., 2017) analyzes head CT images to predict critical findings. To train the model, the authors utilize 37236 studies; labels are also generated by text reports mining. The classifier's output probabilities serve as ranks for triage, 180 and the system achieves ROC AUC 0.73-0.88. Stronger supervision is investigated in (Chang et al., 2018), where authors use 3D masks of all hemorrhage subtypes of 10159 non-contrast CT. The detection and quantification of 5 subtypes of hemorrhages are based on a modified Mask R-CNN  extended by pyramid pooling to map 3D input to 2D feature maps (Lin et al.,185 2017). More detailed and informative labels combined with an accurately designed method provide reliable performance as ROC AUC varies from 0.85 to 0.99 depending on hemorrhage type and size. A similar finding is reported in (De Fauw et al., 2018) for optical coherence tomography (OCT). The authors employ a two-stage approach. First, 3D U-Net (Ç içek et al., 2016) is trained on 190 877 studies with dense 21-class segmentation masks. Then output maps for another 14884 cases are processed by a 3D version of DenseNet (Huang et al., 2017) to identify urgent cases. The obtained combination of two networks provided excellent performance achieving 0.99 ROC AUC.

195
First, we highlight the need for triage systems of two types: for COVID-19 identification and severity quantification. We study existing approaches and demonstrate that a system trained for one task shows low performance in the other. Second, we have developed a multitask learning-based approach to create a single neural network which achieves top results in both triage tasks. In 200 contrast to common multitask architectures, classification layers take the spatially detailed 3D feature map as input and return the single probability for the whole CT series. Finally, we provide a framework for reproducible comparison of various models (see the details below).

Reproducible research 205
A critical meta-review (Wynants et al., 2020) of machine learning models for COVID-19 diagnosis highlights low reliability and high risk of biased results for all 27 reviewed papers, mostly due to a non-representative selection of control patients and poor analysis of results, including possible model overfitting. The authors use (Wolff et al., 2019) PROBAST (Prediction model Risk Of Bias 210 Assessment Tool), a systematic approach to validate the performance of machine learning-based approaches in medicine and identified the following issues.
1. Poor patient structure of the validation set, including several studies where control studies were sampled from different populations. 2. Unreliable annotation protocol where only one rater assessed each study 215 without subsequent quality control or the model output influenced annotation. 3. Lack of comparison with other well-established methods for similar tasks. 4. Low reproducibility due to several factors such as unclear model description and incorrect validation approaches (e.g., slice-level prediction rather 220 than study-level prediction).
The authors conclude the paper with a call to share data and code to develop an established system for validating and comparing different models collaboratively. Though (Wynants et al., 2020) is an early review and does not include many properly peer-reviewed papers mentioned above, we agree that current COVID-225 19 algorithmic research lacks reproducibility. We aim to follow the best practices of reproducible research and address these issues in the following way.
1. We selected fully independent test dataset and retrieved all COVID-19 positive and COVID-19 negative cases from the same population and the same healthcare system, see details in Sec. 3.5.
4. We publicly released the code to share technical details of the compared 235 architectures 2 .
Finally, we use solely open CT images for training and testing. We also annotate and release the lesions masks for the COVID-19 positive cases from the test set, see details in the Sec. 3.5. Therefore, our experiments are reproducible as they rely on the open data.

Method
As discussed in Sec. 1 method should solve two tasks: identification of COVID-19 cases and ranking them in descending order of severity. Therefore, we organize Sec. 2 as follows.
• In Sec. 2.1 we describe lungs segmentation as a common preprocessing 245 step for all methods.
• In Sec. 2.2 we tackle the severity quantification task. We describe methods which predict segmentation mask of lesions caused by COVID-19 and provide a severity score based on that.
• In Sec. 2.3 we discuss two straightforward baselines for the identification 250 task. First is to use segmentation results and identify patients with nonempty lesions masks as COVID-19 positive. Second is to use separate neural network for classification of patients into COVID-19 positive or negative. However, as we show in Sec. 5 these methods yield poor identification quality, especially due to false positive alarms in patients with 255 bacterial pneumonia.
• In Sec. 2.4 we propose a multitask model which achieves better COVID-19 identification results than the baselines. In particular, as we show in Sec. 5, this model successfully distinguishes between COVID-19 and bacterial pneumonia cases.

260
• In Sec. 2.5 we introduce quality metrics for both identification and severity quantification tasks to formalize the comparison of the methods.

Lungs segmentation
We segment lungs in two steps. First, we predict single binary mask for both lungs including pathological findings, e.g. ground-glass opacity, consol-265 idation, nodules and pleural effusion. Then we split the obtained mask into separate left and right lungs' masks. Binary segmentation is performed via fully-convolutional neural network in a standard fashion. Details of the architecture and training setup are given in Section 4.2.
On the second step voxels within the lungs are clustered using k-means 270 algorithm (k = 2) with Euclidean distance as a metric between voxels. Then we treat resulting clusters as separate lungs.

COVID-19 severity quantification
To quantify COVID-19 severity we solve COVID-19-specific lesions segmentation task. Using predicted lungs' and lesions' masks, we calculate the lesions' 275 to lung's volume ratio for each lung and use the maximum of two ratios as a final severity score for triage, according to recommendations discussed in Sec. 1.
Threshold-based. As a baseline for lesions segmentation, we choose a thresholdingbased method. As pathological tissues are denser than healthy ones, corresponding CT voxels have greater intensities in Hounsfield Units. The method consists 280 of three steps. The first step implements thresholding: voxels with intensity value between HU min and HU max within the lung mask are assigned to the positive class. At the second step, we apply Gaussian blur with smoothing parameter σ to the resulting binary mask and reassign the positive class to voxels with values greater than 0.5. Finally, we remove 3D binary connected compo-285 nents with volumes smaller than V min . The hyperparameters HU min = −700, HU max = 300, σ = 4 and V min = 0.1% are chosen via a grid-search in order to maximize the average Dice score between predicted and ground truth lesions masks for series from training dataset.
U-Net. The de facto standard approach for medical image segmentation is the 290 U-Net model (Ronneberger et al., 2015). We trained two U-Net-based architectures for lung parenchyma involvement segmentation which we refer to as 2D U-Net and 3D U-Net. 2D U-Net independently processes the axial slices of the input 3D series. 3D U-Net processes 3D sub-patches of size 160 × 160 × 160 and then stacks predictions for individual sub-patches to obtain prediction for the 295 whole input 3D series. Thus, we do not need to downsample the input image under the GPU memory restrictions. For each model, we replace plain 2D and 3D convolutional layers with 2D and 3D residual convolutional blocks (He et al., 2016b), correspondingly. Both models were trained using the standard binary cross-entropy loss (see other details in Sec. 4.3).

COVID-19 Identification
We formalize COVID-19 identification task as a binary classification of 3D CT series. CT series of patients with verified COVID-19 are positive class. CT series of patients with other lung diseases, e.g. bacterial pneumonia, non-small cell lung cancer, etc., as well as normal patients are negative class. 305 Segmentation-based. One possible approach is to base the decision rule on the segmentation results: classify a series as positive if the segmentation-based severity score exceeds some threshold. We show that this leads to a tradeoff between severity quantification and identification qualities: models which yield the best ranking results perform worse in terms of classification, and vice ResNet-50. Another approach is to tackle the classification task separately from segmentation and explicitly predict the probability that a given series is COVID-315 19 positive. The advantage of this strategy is that we only need weak labels for model training, which are much more available than ground truth segmentations.
To assess the performance of this approach we follow the authors of Bai et al., 2020) and train the ResNet-50 (He et al., 2016b) which takes a 320 series of axial slices as input and independently extracts feature vectors for each slice. After that the feature vectors are combined via a pyramid max-pooling operation (He et al., 2014) along all the slices. The resulting vector is passed into two fully connected layers followed by sigmoid activation which predicts the final COVID-19 probability for the whole series. In our paper, we denote 325 this architecture as ResNet-50 (see other details in Section 4.4).

Multitask
Baselines for the identification task described in Sec 2.3 do not perform well, as we show in Sec. 5. Therefore, we propose to solve the identification task simultaneously with the segmentation task via a single two-headed convolutional 330 neural network.
The segmentation part of the architecture is slice-wise 2D U-Net model. As earlier, its output is used for the evaluation of the severity score.
The classification head shares a common intermediate feature map (per slice) with the segmentation part. These feature maps are stacked and aggregated into 335 a feature vector via a pyramid pooling layer (He et al., 2014). Finally, two fully connected layers followed by sigmoid activation transform the feature vector to the COVID-19 probability.
Following (Amine et al., 2020), the shared feature maps can be the outputs of the U-Net's encoder and have no explicit spatial structure in the axial plane. 340 We refer to this approach as Multitask-Latent. In contrast, we argue that the identification task is connected to the segmentation task and the classification model can benefit from the spatial structure of the input features. Therefore, we propose to share the feature map from the very end of the U-Net architecture, as shown in Fig. 3. We refer to the resulting architecture as Multitask-Spatial-1.

345
More generally, shared feature maps can be taken from the l-th upper level of the U-Net's decoder. Together they form a 3D spatial feature map, which is aligned with the input 3D series downsampled in the axial plane by a factor of 2 l−1 . We denote this approach as Multitask-Spatial-l. Since 2D U-Net architecture has 7 levels, l can vary from 1 to 7.

350
As a loss function we optimize a weighted combination of binary cross entropies for segmentation and classification (see other details in Section 4).

Metrics
To assess the quality of classification of patients into positive, i.e. infected by COVID-19, and negative, i.e. with other lung pathologies or normal, we use 355 areas under the ROC-curves (ROC AUC) calculated on several subsamples of the test sample described in Sec. 3.5.
• The first subsample contains only COVID-19 positive and healthy subjects, while studies with other pathological findings are excluded (ROC AUC COVID-19 vs. Normal).
• The third subsample contains COVID-19 positive patients and patients with lung nodules typical for non-small cell lung cancer (ROC AUC COVID-19 vs. Nodules).

365
• The last ROC AUC is calculated on the whole test sample (ROC AUC COVID-19 vs. All others).
ROC-curves are obtained by thresholding the predicted probabilities for ResNet-50 and multitask models, and by thresholding the predicted severity score for segmentation-based methods. 370 We evaluate the quality of ranking studies in order of descending COVID-19 severity on the test subsample, which contains only COVID-19 positive patients. As a quality metric, we use Spearman's rank correlation coefficient (Spearman's ρ) between the severity scores y true calculated for ground truth segmentations and the predicted severity scores y pred . It is defined as ρ(y true , y pred ) = cov(rg(y true ), rg(y pred )) σ(rg(y true )) · σ(rg(y pred )) , where cov(·, ·) is a sample covariance, σ(·) is a sample standard deviation and rg(y) is the vector of ranks, i.e. resulting indices of y elements after their sorting in the descending order.
To evaluate the COVID-19 lesions segmentation quality we use Dice score coefficient between the predicted and the ground truth segmentation masks.

375
Similar to Spearman's ρ, we evaluate the mean Dice score only for COVID-19 positive cases.

Data
We use several public datasets in our experiments: • NSCLC-Radiomics and LUNA16 to create a robust lung segmentation 380 model.
• Mosmed-Test as a hold-out test set for the final evaluation of all models.

Mosmed-1110
385 1110 CT scans from Moscow outpatient clinics were collected from 1st of March, 2020 to 25th of April, 2020, within the framework of outpatient computed tomography centers in Moscow, Russia (Morozov et al., 2020a).
Scans were performed on Canon (Toshiba) Aquilion 64 units in with standard scanner protocols and, particularly 0.8 mm inter-slice distance. However, 390 the public version of the dataset contains every 10th slice of the original study, so the effective inter-slice distance is 8mm.
Radiologists performed an initial reading of CT scans in clinics, after which 400 experts from the advisory department of the Center for Diagnostics and Telemedicine (CDT) independently conducted the second reading as a part of a total audit targeting all CT studies with suspected COVID-19. Additionally, 50 CT scans were annotated with binary masks depicting regions of interest (ground-glass opacity and consolidation).

MedSeg-29
MedSeg web-site 3 shares 2 publicly available datasets of annotated volumetric CT images. The first dataset consists of 9 volumetric CT scans from a web-site 4 that were converted from JPG to Nifti format. The annotations of this dataset include lung masks and COVID-19 masks segmented by a radiologist.

410
The second dataset consists of 20 volumetric CT scans shared by (Jun et al., 2020). The left and rights lungs, and infections are labeled by two radiologists and verified by an experienced radiologist.

NSCLC-Radiomics
NSCLC-Radiomics dataset (Kiser et al., 2020;Aerts et al., 2015) represents 415 a subset of The Cancer Imaging Archive NSCLC Radiomics collection (Clark et al., 2013). It contains left and right lungs segmentations annotated on 3D thoracic CT series of 402 patients with diseased lungs. Pathologies -lung cancerous nodules, atelectasis and pleural effusion -are included in the lung volume masks. Pleural effusion and cancerous nodules are also delineated sep-420 arately, when present.
Automatic approaches for lungs segmentation often perform inconsistently for patients with diseased lungs, while it is usually the main case of interest. Thus, we use NSCLC-Radiomics to create robust for pathological cases lungs segmentation algorithm. Other pathologies, e.g. pneumothorax, that are not 425 presented in NSCLC-Radiomics could also lead to poor performance of lungs segmentation. But the appearance of such pathologies among COVID-19 cases is extremely rare. For instance, it is less than 1% for pneumothorax (Zantah et al., 2020). Therefore, we ignore the possible presence of other pathology cases, while training and evaluating our algorithm.

LUNA16
LUNA16 (Jacobs et al., 2016) is a public dataset for cancerous lung nodules segmentation. It includes 888 annotated 3D thoracic CT scans from the LIDC/IDRI database (Armato III et al., 2011). Scans widely differ by scanner manufacturers (17 scanner models), slice thicknesses (from 0.6 to 5.0 mm), 435 in-plane pixel resolution (from 0.461 to 0.977 mm), and other parameters. Annotations for every image contain binary masks for the left and right lungs, the trachea and main stem bronchi, and the cancerous nodules. The lung and trachea masks were originally obtained using an automatic algorithm (van Rikxoort et al., 2009) and the lung nodules were annotated by 4 radiologists (Armato III 440 et al., 2011). We also exclude 7 cases with absent or completely broken lung masks and extremely noisy scans.

Mosmed-Test
We ensure the following properties of the test dataset: • All cases are full CT series without missing slices and/or lacking metadata 445 fields (e.g., information about original Hounsfield units).
• Data for all classes comes from the same population and the same healthcare system to avoid domain shifts within test data. February 2020, at the beginning of the Russian outbreak. We remove 5 cases 5 https://mosmed.ai/en/datasets/ct_lungcancer_500/ with artifacts related to patients' movements while scanning. The remaining 37 cases were independently assessed by two raters (radiologists with 2 and 5 years of experience) who have annotated regions of interest (ground-glass opacities and consolidation) via MedSeg 6 annotation tool for every of the 37 Mosmed-

455
Test series. During the annotation process, 5 out of 37 images were identified to have no radiomic signs of COVID-19, so we remove these images from the list of COVID-19 positives. Then, we iteratively verify annotations based on two factors: Dice Score between two rates, and missing large connected components of the mask by one of the raters. The discrepancy between the two raters has 460 been analyzed until the consensus is reached -0.87 ± 0.17 Dice Score over 32 COVID-19 infected cases. We publicly release 2 the final version of COVID-19 positive dataset including both images and annotated lesions masks. Note, that the Mosmed-20 was collected at inpatient clinics, whereas Mosmed-1110 is a subset of Moscow out-patient clinics database created from two to six 465 weeks later, which guarantees that studies are not duplicated.
Bacterial pneumonia. We use 30 randomly selected cases from a dataset (Korb et al., 2021) with 75 chest CT studies with radiological signs of communityacquired bacterial pneumonia in 2019.
Lung nodules. We use a subset of MoscowRadiology-CTLungCa-500 7 , a public 470 dataset containing 500 chests CT scans randomly selected from patients over 50 years of age. We selected 30 cases randomly among cases with radiologically verified lung nodules.
Normal controls. The dataset with healthy patients consists of two parts: 5 Mosmed20 cases mentioned above without radiomic signs of COVID-19, and 26 475 cases from MoscowRadiology-CTLungCa-500 without lung nodules larger than 5mm and other lung pathologies.

Experiments
We design our experiments in order to objectively compare all the triage models described in Sec. 2. As shown in the Tab. 2, all the methods are trained 480 on the same datasets and evaluated using the mean values and the standard deviations of the same quality metrics defined in Sec. 2.5 on the same hold-out test dataset described in Sec. 3.5. We believe, that the experimental design for training neural networks for triage described in Sec. 4.3 and 4.4 exclude overfitting. All computational experiments were conducted on Zhores supercomputer 485 (Zacharov et al., 2019). Table 2: Training, validation and test data splits for all triage models. For each method, we give the optimized training objectives in the corresponding table cells for the training datasets. Every column of Mosmed-Test dataset represents the metrics which are calculated using the corresponding test subset. Remarks. 1. pos. with mask /pos. mean COVID-19 positive cases with or without lesions mask, correspondingly, and neg. means COVID-19 negative cases. 2. DSC means Dice Score. 3. AUCs means ROC AUC COVID-19 vs. All, vs. Normal, vs. Bac. Pneum. and vs. Nodules. 4. Seg. BCE and class. BCE means segmentation and classification Binary Cross-Entropy correspondingly. 5. ρ means Spearman's ρ. 6. Multitask-Latent, Multitask-Spatial-4, Multitask-Spatial-1.

Preprocessing
In all our experiments we use the same preprocessing applied separately for each axial slice: rescaling to a pixel spacing of 2 × 2 mm and intensity normalization to the [0, 1] range.

490
In our COVID-19 identification and segmentation experiments we crop the input series to the bounding box of the lungs' mask predicted by our lungs segmentation network.
We further show (Sec. 5) that this preprocessing is sufficient for all the models. Despite the diversity of the training dataset, all the models successfully 495 adapt to the test dataset.

Lungs segmentation
For the lungs segmentation task we choose a basic U-Net (Ronneberger et al., 2015) architecture with 2D convolutional layers, individually apply to each axial slice of an incoming series. The model was trained on NSCLC-Radiomics and 500 LUNA16 datasets for 16k batches of size 30. We use Adam (Kingma and Ba, 2014) optimizer with default parameters and an initial learning rate of 0.001, which was decreased to 0.0001 after 8k batches.
We assess the model's performance using 3-fold cross-validation and additionally using MedSeg-29 dataset as hold-out set. Dice Score of cross-validation 505 is 0.976 ± 0.023 for both LUNA16 and NSCLC-Radiomics datasets, and 0.962 ± 0.023 only on NSCLC-Radiomics dataset. The latter result confirms our model to be robust to the cases with pleural effusion. Dice Score on MedSeg-29 dataset is 0.976±0.013, which shows the robustness of our model to the COVID-19 cases.

Lesions segmentation 510
We use all the available 79 images of COVID-19 positive patients with annotated lesions masks (50 images from Mosmed-1110 and 29 images from MedSeg-29) to train the threshold-based, 2D U-Net, 3D U-Net models.
Additionally, we train the 2D U-Net's architecture on the same 79 cases along with 402 images from the NSCLC-Radiomics dataset. These 402 images 515 were acquired long before the COVID-19 pandemic, therefore we assume that ground truth segmentations for them are zero masks. During training this model we resample series such that batches contain approximately equal numbers of COVID-19 positive and negative cases. We refer to this model as 2D U-Net+.
2D U-Net and 2D U-Net+ were trained for 15k batches using Adam (Kingma 520 and Ba, 2014) optimizer with default parameters and an initial learning rate of 0.0003. Each batch contains 5 series of axial slices. 3D U-Net was optimized via plain stochastic gradient descent for 10k batches using a learning rate of 0.01. Each batch consists of 16 3D patches. In order to estimate mean values and standard deviations of models' quality 525 metrics defined in Sec. 2.5 each segmentation network was trained 3 times with different random seeds. Resulting networks were evaluated on the hold-out test dataset, described in Sec. 3.5.

ResNet-50 and multitask models
The remaining 806 positive images without ground truth segmentations 530 and 254 negative images from the Mosmed-1110 and 402 negative images from NSCLC-Radiomics were split 5 times in a stratified manner into a training set and a validation set. Each of the 5 validation sets contains 30 random images. For each split we train the ResNet-50 and the classification heads of Multitask-Latent, Multitask-Spatial-1 and Multitask-Spatial-4 models on the defined train-535 ing set, while segmentation heads of the multitask models were trained on the same 79 images, as 2D U-Net (see Sec. 4.3).
For each network on each training epoch we evaluate the ROC AUC between the predicted COVID-19 probabilities and the ground truth labels on the validation set. We save the networks' weights which resulted in the highest val-540 idation ROC AUC during training. For all the multitask models as well as for ResNet-50 top validation ROC AUCs exceeded 0.9 for all splits.
We train all networks for 30k batches using Adam (Kingma and Ba, 2014) optimizer with the default parameters and an initial learning rate of 3 · 10 −4 reduced to 1·10 −4 after 24k batches. Each batch contains 5 series of axial slices.

545
During training the multitask models we resample examples such that batches contained an approximately equal number of examples which were used to penalize either classification or segmentation head. However, we multiplied by 0.1 the loss for the classification head, because it resulted in better validation ROC AUCs.

550
For each of 5 splits, we evaluated each trained network on the hold-out test dataset described in Sec. 3.5. We report the resulting mean values and standard deviations of the quality metrics in Sec. 5.

Results
In this section we report and discuss the results of the experiments described 555 in Sec. 4. In Tab. 3 we compare all the methods described in Sec. 2 using quality metrics defined in Sec. 2.5 and evaluated on the test dataset described in Sec. 3.5. Table 3: Quantitative comparison of all the methods discussed in Section 2. Trade-off between qualities of COVID-19 identification and ranking by severity is observed for segmentationbased methods. The proposed Multitask-Spatial-1 model yields the best identification results. Results are given as mean ± std.

Segmentation-based methods
In this subsection we discuss the performance of four methods: the thresholdbased baseline, 3D U-Net, 2D U-Net and 2D U-Net+. 560 We expect two major weaknesses of the threshold-based method: False Positive (FP) predictions on the vessels and bronchi, and inability to distinguish COVID-19 related lesions from other pathological findings. It is clearly seen from the extremely low ROC AUC scores (Tab. 3). One could also notice massive FP predictions even in healthy cases (Fig. 4, column B). However, the 565 method often provides a reasonable segmentation of the lesion area (Fig. 4, column A).
As one could expect, training on images with non-small cell lung cancer tumors from NSCLS-Radiomics dataset results in the enhancement of ROC AUC vs. Nodules (0.91 for 2D U-Net+ compared to 0.79 for 2D U-Net). Interestingly, in this experiment we observe a degradation in terms of Spearman's ρ for 580 ranking of COVID-19 positive cases (0.8 for 2D U-Net+ compared to 0.97 for 2D U-Net). We conclude that one should account for this trade-off and use an appropriate training setup depending on the task.
All the segmentation-based models perform poorly in terms of classification into COVID-19 and bacterial pneumonia (ROC AUC COVID-19 vs. Bac.

ResNet-50
Despite that validation ROC AUCs for all the trained ResNet-50 networks exceed 0.9, their performance on the test dataset is extremely unstable: ROC AUC COVID-19 vs. All varies from 0.43 to 0.85, see also high standard deviation 590 values for all tasks in Tab. 3.

Multitask models
In this subsection we discuss the performance of Multitask-Latent, Multitask-Spatial-4 and the proposed Multitask-Spatial-1 models on identification, segmentation and severity quantification tasks in comparison to each other, ResNet-595 50 and segmentation-based methods.
As seen from mean values and standard deviations of ROC AUC scores in Tab. 3, Multitask-Latent model yields better and more stable identification results than ResNet-50. Both these models classify the latent representations of the input images. We show that sharing these features with the segmentation 600 head, i.e. decoder of the U-Net architecture improves the classification quality. Moreover, one can see in Tab. 3 that this effect is enhanced by sharing the spatial feature maps from the upper levels of the U-Net's decoder. The proposed Multitask-Spatial-1 architecture (see Fig. 3) with shallow segmentation and classification heads directly sharing the same spatial feature map shows the top clas- As seen in Tab. 3 and Fig. 4 there is no significant difference in terms of segmentation and severity quantification qualities between the multitask models 610 and the neural networks for single segmentation task.
Therefore, the single proposed Multitask-Spatial-1 model can be applied for both triage problems: identification of COVID-19 patients followed by their ranking according to the severity. In Fig. 5 we visualize these two steps of triage pipeline for the test dataset, described in Sec. 3.5. One can see the 615 several false positive alarms in cases with non-COVID-19 pathological findings. We discuss the possible ways to resolve them in Sec. 6. The overall pipeline for triage, including preprocessing, lungs segmentation, and multitask inference takes 8s and 20s using nVidia V100 and GTX 980 GPUs respectively.

620
We have highlighted two important scores: COVID-19 Identification and Severity and discussed their priorities in different clinical scenarios. We have shown that these two scores aren't aligned well. Existing methods operate either with Identification or Severity and demonstrate deteriorated performance for the other task. We have presented a new method for joint estimation of COVID-625 19 Identification and Severity score and showed that the proposed multitask architecture achieves top quality metrics for both tasks simultaneously. Finally, we have released the code and used public data for training, so our results are fully reproducible. Besides classification between COVID-19 and healthy patients we evaluate 630 classification between COVID-19 and other lung abnormalities: bacterial pneumonia and cancerous nodules. As shown in Fig. 5, our method yields false positive alarms, mainly in patients with bacterial pneumonia (COVID-19 vs. bacterial pneumonia specificity 22/30 = 73.4%). However, we find this result promising, given the fact that we do not use any explicit training dataset with 635 bacterial pneumonia patients. The proposed multitask model can be trained with the addition of bacterial or/and viral (not COVID-19) pneumonia cases, which can partially reduce the classification error. However, there is also an irreducible classification error in cases when radiomic features are not allow to distinguish between COVID-19 and non-COVID-19 pneumonia. Fortunately, in 640 practice, usage of an automated triage system always implies second reading, so the model's false positives are assumed to be resolved by a radiologist, while the most controversial cases can be resolved by the RT-PCR testing. Thus, we conclude that the identification part of our triage system may be used as a highly sensitive first reading tool.

645
The role of the Severity Quantification part is more straightforward. As we mentioned in Section 1, radiologists perform the severity classification into groups from CT0 (no COVID-19 related lesions) and CT1 (up to 25% of lungs affected) to CT4 (more than 75%) in a visual semi-quantitative fashion. We believe that such estimation may be highly subjective and may contain severe 650 discrepancies. To validate this assumption, we additionally analyzed Mosmed-1110, which includes not only 50 segmentation masks but also 1110 multiclass labels CT0-CT4. Within our experiments, we binarized these labels and effec-tively removed information about COVID-19 severity. We examined mask predictions for the remaining 1050 cases, excluding healthy patients (CT0 group) 655 and grouped the predictions by these weak labels, as shown in Fig. 6. An expert radiologist validated analyzed the most extreme mismatches visualized in Fig. 6 and confirmed the correctness of our model's predictions. As we see, the severity of many studies was highly underestimated during the visual semi-quantitative analysis. This result implies that deep-learning-based medical image analysis 660 algorithms, including the proposed method, are great intelligent radiologists' assistants in a fast and reliable estimation of time-consuming biomarkers such as COVID-19 severity.