Value and prognostic impact of a deep learning segmentation model of COVID-19 lung lesions on low-dose chest CT

Highlights • Deep Learning (DL) pipeline, based on supervised convolutionnal neural networks achieve Dice coefficient of overall COVID-19 lesions on low-dose chest CT (ground-glass opacity and consolidation) of 0.75 ± 0.08 on low-dose computed tomography.• The developed pipeline computes clinical parameters: lesion volume (cm3) and extend (%). Lesion extent automatic quantification had a mean absolute error of 2.1% ± 2.4 with good correlation to manual ground-truth reference (r = 0.947: p<0.001).• After stepwise selection and adjustment on clinical characteristics of 1621 patients, DL driven automatic quantification was shown to be a strong prognostic marker of adverse events during COVID-19 infection (prognosis accuracy of the model from 0.82 without DL to 0.90 with DL-driven quantification (p<0.0001)).


Introduction
In December 2019, an outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spread worldwide from Asia to Europe [1,2].SARS-CoV-2 is responsible for coronavirus disease 2019 .It was declared a worldwide pandemic by the World Health Organization on March 11 th 2020.One of the main risks is the congestion of the health care system due to an unusually rapid inflow of patients, especially in the intensive care unit (ICU).Thus, there is a need for precise patient selection and risk stratification to focus on severe cases [3].This stratification is based on clinical criteria, viral Abbreviations: ACE, angiotensin-converting enzyme; BMI, body mass index; Cons, consolidation; CNN, convolutional neural network; COVID-19, coronavirus disease 2019; CT-SS, chest tomography severity score; DL, deep learning; DSC, Dice similarity coefficient; GGO, ground-glass opacity; ICU, intensive care unit; LDCT, low-dose computed tomography; MAE, mean absolute error; MVSF, mean volume similarity fraction; ROC, receiver operating characteristic Institution from which the work originated: Department of Radiology, Hôpital de la Timone Adultes, AP-HM.264, rue Saint-Pierre 13385 Marseille Cedex 05, France.load on reverse-transcription polymerase chain reaction (RT−PCR) and pulmonary lesions on chest CT.
Low-dose computed tomography (LDCT) is more effective than chest X-ray for depicting ground-glass opacity (GGO) and consolidation (Cons), with a lower dose of radiation than conventional chest CT [4−7].Some investigators have shown that a semi quantitative clinical score reflecting the extent of lesions might be useful for patient risk stratification [8,9].Nevertheless, the computation of semi quantitative scores remains a time-consuming process that is prone to intra-and interobserver variability.Hence, there is a need for a fast, reproducible and fully automated COVID-19 lung lesion segmentation method that can be applied to a large cohort as a predictive risk stratification tool in disease management and prediction.
Deep learning (DL) techniques, especially convolutional neural networks (CNNs), have shown promising results in the automation of medical imaging measures [10].In thoracic imaging, these techniques have shown excellent performance in nodule detection, lesion segmentation and disease classification [11,12].
The main purpose of this study was to develop and evaluate a complete DL pipeline that allows a fully automated segmentation of COVID-19 pulmonary lesions on LDCT and the computation of lesion volume and extent.Our secondary purpose was to investigate whether automatic lesion quantification was associated with adverse events among COVID-19 patients.

Study design
This single-center retrospective study was conducted from March 3 rd to July 2 nd , 2020, and approved by the local Institutional Review Board (N°: 2020-0012, RGPD/Ap-Hm: 2020-48).Training, validation and test datasets including LDCT from 124, 20 and 30 patients, respectively, were included to build a pipeline based on CNNs adapted to assess automatic segmentation and quantification of COVID-19 lesions on LDCT as well as computation of lesion volume and extent.A flow diagram of the procedure is shown in Fig. 1.Then, we evaluated the predictive value of deep learning (DL)-driven quantification of lung lesions on adverse event occurrence in a dataset of 1621 patients, excluding data from the training, validation and test datasets.Among those 1621 patients, 983 have been previously reported [13,14].The authors did not receive any financial or material support from any industrial company in the execution of this study.

Population
All patients were enrolled from a single center (La TIMONE Hospital − Assistance Publique Hôpitaux de Marseille (APHM)).All patients who presented between March 30 th and June 2 nd 2020 with a confirmed COVID-19 infection using SARS-CoV RNA detection from a nasopharyngeal swab sample [15,13] and were eligible for unenhanced LDCT were retrospectively included.LDCT was performed on all patients who were over 55 years old or had risk factors for adverse outcomes for COVID-19, such as hypertension, diabetes, obesity (BMI>30), dyspnea or abnormal lung auscultation.The exclusion criteria were refusal to participate in the protocol and an age below 18 years.

Clinical data
The following clinical parameters were recorded by infectiologists (M.M. and J-C.L., with 25 and 20 years of experience, respectively) the same day as the LDCT: age, sex, date of the first symptoms, temperature, heart rate, systolic and diastolic blood pressures, respiratory rate, oxygen saturation, cough, rhinorrhea, dyspnea, diarrhea, myalgia, and lung auscultation abnormalities.Medical history was recorded: heart disease, tobacco use, chronic obstructive pulmonary disease, asthma, diabetes, obesity, sleep apnea syndrome, oncological status and immunosuppression status.The time between the first symptoms and the LDCT was recorded.Patient follow-up lasted 10 days for patients with no adverse events, and the follow-up period was extended to cover the in-hospital stay for patients who required hospitalization.The primary endpoint of the second objective was a combined outcome consisting of either a need for oxygen therapy, a need for transfer to the ICU, hospitalization ≥10 days and/or death.

Radiological data
All patients underwent unenhanced, deep-inspiration LDCT on the same system (Revolution EVO − GE Healthcare, WI, USA) with parameters detailed in Appendix A. To develop our pipeline, we used a training dataset composed of 124 LDCT examinations (68767 CT slices) and a validation dataset of 20 LDCT examinations (6317 CT slices) from consecutive patients in clinical care.To obtain a training dataset including all types of lesions and with a homogeneous repartition of lesion extent and severity, the chest tomography severity score (CT-SS) developed by Yang et al. was used on the whole cohort [16].This score, ranging from 0 to 40, has been validated as a semiquantitative clinical method to quantify the extent and severity of lung abnormalities in COVID-19.All CT-SS images were evaluated by two experienced chest radiologists (J-Y.G. and P.H., with 25 and 7 years of experience, respectively).Patients for the training and validation datasets were chosen depending on their CT-SS, resulting in 13/144 (10.5%) and 2/20 (10%) severe patients (CT-SS >19.5) and 111/144 (89.5%) and 18/20 (80%) mild patients (CT-SS < 19.5).The test dataset was composed of 30 consecutive patients (15587 CT slices) from clinical care and did not overlap with the training dataset nor the validation datasets.

Manual segmentation
Manual image segmentation was undertaken for the combined training, validation and test datasets by a single observer (Observer 1 (O1), A.B., with 5 years of experience).For each patient, all images from the lung window LDCT were anonymized.Images were imported in DICOM format into the validated post processing software 3D Slicer (https://www.slicer.org,2014) [17].Manual segmentation of the lung window CT was applied to the entire lung volume, including all slices, using thresholding, painting and erasing methods to obtain the segmentation masks of three distinct labels: GGO, Cons, and normal pixels within the lungs (LungN).GGO and Cons were distinguished using a threshold based on the attenuation values in HU compared to that of the pulmonary artery [18].Distal vascular and bronchial trees were not extracted from the labels.The non-segmented part of the image was classified under a fourth label: background (BG).After being validated by one experienced chest radiologist (J-Y.G.), the obtained segmentation masks were considered the ground truth, especially for GGO and Cons.Clinical parameters were obtained from the ground-truth segmentations as follows: lung volume (cm 3 ) was the sum of the LungN, GGO, and Cons labels.The GGO and Cons volumes (cm 3 ) were extracted from the respective labels.The GGO and Cons extents (%) were the ratios of the GGO and Cons volumes, respectively, to the total lung volume.Lesion extent (%) was the sum of the GGO and Cons extents.The user interaction time was recorded for all manual segmentations.All ground-truth manual segmentations and extracted clinical measures were labeled O1a.

Network architecture
Our pipeline was composed of three 2D slice-based CNN models and aimed to produce automated segmentation of GGO and Cons on LDCT images with corresponding measures in terms of volume (cm 3 ) and extent (%).All automated segmentations and extracted measures

Performance evaluation 2.4.1. Segmentation evaluation
To assess the segmentation accuracy of our model, we compared the manual ground-truth segmentations (O1a) to the automatically obtained segmentations (Auto) in terms of technical metrics and clinical parameters on the test dataset (n= 30).For the technical metrics, we evaluated the model performance with the Dice similarity coefficient (DSC) and mean volume similarity function (MVSF) [19].Our DSC calculation method was identical to those from [20] and [21].A 2D-CNN is trained and then used to segment the volumetric (3D) image of a patient.This is obtained through slice-by slice 2D interference and the resulting 2D segmentations are concatenated with regard to the z-axis to produce the final 3D segmentations.The metrics (i.e.DSC) can then be computed at 3D-level.The O1a and Auto clinical parameters were evaluated using lesion volume (cm 3 ) and lesion extent (%) using mean absolute error (MAE), bias and correlation.Significance of the bias was evaluated by Wilcoxon signed-rank test.Efficiency, defined in terms of the user interaction time, was evaluated and compared.

Reproducibility
The reproducibility of the Auto method was compared to the inter-and intraobserver segmentation performances.Observer 1 performed a second analysis, labeled O1b, 2 weeks after the groundtruth segmentation; the tasks within the second analysis were performed in randomized order to minimize bias.Two other independent observers (Observer 2 (O2), A.M., with 3 years of experience; observer 3 (O3), B.M., with 3 years of experience) manually segmented the same test dataset; their segmentations were labeled O2 and O3, respectively.The observers were blinded to the subjects' characteristics and the segmentations made by the other observers.

Prognostic value
To assess the prognostic performance of the radiological quantification of lesion extent and type for adverse events among COVID-19 patients, we evaluated both forms of radiological quantification: the CT-SS and the automatic quantification, corresponding to disease extent (%) obtained with the presented DL pipeline.For automatic quantification, we evaluated GGO, Cons and lesion extent scores.Lesion extent was the sum of the GGO and Cons extents.To assess the predictive performance of these quantifications, we performed multivariate logistic regression on the combined outcomes.To this end, we used all the patients fulfilling our inclusion criteria and included in the study between March 3 rd and July 2 nd , excluding patients from the training and validation datasets.

Statistical analysis
Quantitative variables are expressed as the mean § standard deviation and range or median, Q1-median-Q3 and range.Categorical data are expressed as raw numbers, proportions and percentages.To assess the predictive performance of the DL-driven automatic lesion extent quantification on the prognostic value dataset, we performed multivariate logistic regression on the following outcome: "transfer to ICU and/or death and/or hospitalization ≥ 10 days and/or oxygen therapy".We randomly divided the prognosis value dataset (n=1621) into a training subset (70% of the initial sample size, n=1135) and a validation subset (30% of the initial sample size, n=486).Model parameters were estimated on the training dataset, and prognosis performance was assessed on the validation dataset.A reference model (A) was first tested and adjusted for the following covariates: age, sex, comorbidities (cancer, diabetes, coronary artery disease, hypertension, chronic respiratory diseases, obesity), and time from symptom onset to scan date.Next, we tested a second model (B) where we added CT-SS as an independent variable and a third model (C) where we added the automatic lesion extent quantification obtained by DL-driven segmentation.Second-order interaction terms between the scores and the covariates were tested in Models B and C. We used likelihood ratio tests for comparing models.To estimate the models' ability to discriminate individuals, we computed the C-statistic on the validation dataset [22].The optimal cutoff value for the automatic lesion extent quantification was selected based on the Youden index to maximize accuracy (sensitivity + specificityÀ1).
A two-sided a of less than 0.05 was considered statistically significant.All analyses were carried out using SAS 9.4 statistical software (SAS Institute, Cary, NC).

Results
A total of 1785 patients were included, and the clinical characteristics, CT-SS and pulmonary lesion distributions of the training, validation, test and prognostic value datasets are shown in Table 1.Pulmonary lesions evaluated on LDCT from the training, validation and test datasets were extracted from the manual ground-truth segmentations (O1a).Those from the prognosis value dataset were derived from the model segmentation.An example of the automated segmentation results is shown in Fig. 3.The overall test dataset of LDCT scans had a median mean dose−length product of 38.75 §39.9 mGy.cm.

Segmentation evaluation
The results for the DSC and clinical parameters between the automatic and manual segmentations, as well as a comparison to the inter-and intraobserver performances, are shown in Table 2.The correlations between automatic and manual measures of lesion extent are presented in Table 2 and Fig. 4.

Segmentation accuracy
The DSC was 0.75 §0.08 for the overall lesion segmentations, 0.71 §0.10 for GGO segmentation, and 0.64 § 0.09 for Cons segmentations.The MVSF results are presented in Appendix C.
The MAE was 70.3 §65.8 cm 3 for the GGO volume, 29.5 §35.9 cm 3 for the Cons volume and 71.4 §72.6 cm 3 for the lesion volume.The biases were -18.3 §95.4 cm 3 for the GGO volume, 14.4 §44.4 cm 3 for the Cons volume and -3.9 §102.6 cm 3 for the lesion volume, and none of these biases was found to be significant.In terms of disease extent, the MAE was 2.2 §2.1% for the GGO extent, 1.0 §1.3% for the Cons extent and 2.1 §2.4% for the lesion extent.The biases were not significant for the lesion extent quantification (-0.1% § 3.2; p = 0.59).Disease extent measures were highly correlated with ground truth, with a lesion extent correlation of 0.947 (p<0.001).
Concerning segmentation efficiency, the mean interaction time was significantly different between manual and automated segmentation: 14.74 § 2.9 min versus 19 seconds (p<0.001) for each patient.

Prognostic value
There were 227 patients (14%) in the prognostic value dataset who presented with the combined outcome (Table 3).After adjustment for baseline clinical characteristics, the global scores were significantly associated with outcome occurrence ("transfer to ICU and/ or death and/or hospitalization ≥ 10 days and/or oxygen therapy".)and the addition of GGO or Cons did not modify the prognostic prediction for either the human or automatic radiological score.The adjusted odds ratios were 3.02 (95% CI: 2.44; 3.73) for the CT-SS and 3.86 (95% CI: 2.96; 5.05) for automatic quantification.The C-statistic was 0.82 (0.79−0.88) in Model A excluding all radiological scores, 0.89 (0.95−0.93) in Model B including CT-SS and 0.90 (0.86−0.94) in Model C including DL-driven quantification.The differences between Models A and B and between Models A and C were statistically significant (likelihood ratio tests: p<0.001).ROC curve analysis for lesion extents DL-driven quantification is shown on Appendix E.

Discussion
The main finding of the study was that the proposed automatic quantification pipeline provides an accurate and reproducible segmentation of GGOs and consolidations in COVID-19 infection.With respect to the human ground-truth segmentation, the variability of the model was lower than the inter-or intraobserver variability.The presented model was computationally efficient, requiring less than 20 seconds for complete DL-driven segmentation.Its accuracy was similar regardless of the extent of the lesions.Furthermore, the presented data showed that the automatic quantification of lesion extent provides a strong prognostic marker of adverse events during COVID-19 infection.
During the COVID-19 pandemic, diagnostic imaging has multiple roles, including diagnosis, prognosis, and follow-up [23].One potential method to obtain a precise evaluation of disease-related lesions and prognosis is to quantify the extent of the lesions.This study proposes a distinct segmentation of different COVID-19 lesions, differentiating GGO from consolidation.Most previously published works have focused on automated algorithms that help distinguish COVID-19 infection from other pulmonary infections [24,25].One of the main strengths of the present paper was the use of LDCT as input data.COVID-19 patients might undergo multiple CT examinations for diagnosis, follow-up and evaluation of complications of SARS-CoV-2 infection.At times when LDCT is encouraged in pneumonia diagnosis, automated algorithms should be adapted to these technical modifications [26].The training dataset had substantial variability in pulmonary lesion extent and disease severity (from 0% to 36%).One of the main strengths of our study was that manual segmentation was conducted on all LDCT images in the training, validation and test datasets.Contrary to many segmentation models, the algorithm and obtained results were tested on all images in the test dataset (which was numbered at 15587 images for the 30 patients) rather than selected slices.
The literature has seen a wide number of CNN-based methodologies for automatic segmentation of lung abnormality on CT scan.Works may be divided in three categories: those that base the training on CT scans fully annotated by experts [21,27], those that make use of weak/noisy labels to lower the annotation load [20,28,29]) and those using transfer learning to transfer knowledge from non-COVID19 lesions [30]).Regarding the network architectures, 2D CNNs [20,21,27] and 3D CNNs [27,28,30] are both represented.Some researchers focus on the detail of the architectures and advocate for additional modules, such as attention blocks [21,29].Despite of the vast number of papers proposing new architectures and modules, 8 out of the 10 finalists in the COVID-19 Lung CT Lesion Segmentation Challenge chose an UNet architecture as we propose [27] In 2020, Belfiore et al. highlighted the need to quantify the percentage of ventilated lung parenchyma as distinguished from the affected lung parenchyma [31].Here, we propose a segmentation tool that differentiates normal from affected lungs (GGO and Cons).Cons DSC, volume and extent measures were always lower than the GGO measures.Interestingly, this demonstrates the difficulty of producing Cons segmentations.This could be due to the anatomic presentation of COVID-19 consolidations, which mostly have a sub pleural distribution and affect the lower segments [9].Hence, consolidations are sometimes in continuity with the sub pleural fat and the chest wall, which can lead to segmentation failure.Consolidations were the only measure whose correlation was lower for the automated measure (Auto vs. 01a) than for the interhuman measure (O1a vs. O3).This finding suggests that our model might fail partially in cases of peripheral and lower lobe consolidation.Liu et al. proposed CT quantification of pneumonia lesions to predict the progression of severe disease and distinguished three labels: consolidation, semiconsolidation and ground glass [32].They used a simple threshold to differentiate these labels.The authors obtained a DSC of 0.82 for COVID-19 pneumonia but did not publish the algorithmic details, biases or correlations.Chassagnon et al. presented a COVID-19 segmentation algorithm with a mean lesion DSC between automated and manual segmentation of 0.69 [25].For GGO lesion segmentation, the DSC of 0.71 § 0.10 in the present study was below that of Jung et al. for the automated segmentation of GGO (0.78 § 0.07) [33].This discordance can probably be explained by the difference in morphological patterns between parenchymal lesions and nodules.
Among all tested factors, age remained the best predictor of clinical outcome.However, the C-statistic was significantly improved when DL-driven quantification was added for the combined outcome, which confirms the benefit of adding the radiological score to evaluate the prognosis.DL-driven quantification was not superior to the CT-SS in predicting the occurrence of clinical outcomes but did not require any human input.Concerning gender, 'men' is no longer a risk factor after adjustment on CTSS and automatic CT scores.This was due to a significant difference in CT scores between men and women.The same statistical reason can explain hypertension results.The present code is protected (IDDN.FR.001.220003.000.S. C.2020.000.31235) and can be shared upon the signing of a collaboration agreement.
Our study has some limitations.All CT images were acquired on the same CT scanner in one clinical center.Additionally, the presented algorithm cannot provide a segmentation of the distal vascular and bronchial trees.A future goal of our work should be to include arterial and bronchial segmentation in our algorithm for even more precise lesion segmentation.

Conclusion
A complete DL-driven pipeline for LDCT, which allows minimum radiation exposure, was developed to segment GGOs and consolidation due to COVID-19 lung involvement.The algorithm produces automatic lesion volume and extent measures that can be directly provided to physicians.DL-driven segmentation was more reproducible than human measures, achieving lower biases and mean absolute error than human inter-and intraobserver comparisons of lesion volume and extent.Lung involvement as quantified by our DL-driven pipeline was significantly associated with the occurrence of adverse events.This framework should be tested on multicenter datasets to evaluate disease severity at the time of the first LDCT evaluation.

Institutional review board statement
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of ASSISTANCE-PUBLIQUE DES HOPITAUX DE MARSEILLE (AP-HM) (N°: 2020-0012, RGPD/Ap-Hm: 2020-48).

Data availability statement
None.Notea: Adjusted odds ratios with 95% confidence intervals.b: The C-statistic is a measure of goodness of fit for binary outcomes in a logistic regression model.It is equal to the area under the receiver operating characteristic (ROC) curve and ranges from 0.5 to 1. Models were based on the training set of the prognostic value dataset (n=1135), and the C-statistic was estimated on the validation set (n=486) of the prognostic value dataset.All scores were standardized (mean=0, standard deviation=1) prior to the analysis.LCDT: low-dose computed tomography; ICU: intensive care unit.

Fig. 2 .
Fig. 2. Pipeline description in 3 steps.Note -Step 1: The algorithm selects all consecutive slices containing lung parenchyma.Step 2: The algorithm automatically segments all labels (ground-glass opacity, lung, and condensation).Step 3: The algorithm computes clinical metrics derived from automatic segmentation.

Fig. 3 .
Fig. 3. Examples of the obtained automatic segmentations (Auto) compared to the corresponding LDCT images and manual reference segmentations (Manual).Note -Normal lung (purple), consolidation (yellow), ground-glass opacity (green).A. Example 1: Mid-thoracic carina level.The second row shows higher-magnification views of the areas in the red rectangles.B. Example 2: Inferior mediastinum level.The second row shows higher-magnification views of the areas in the red rectangles.

Table 1
Characteristics of the different datasets.

Table 2
Model segmentation performances in comparison to human reproducibility on the test dataset (n=30).

Table 3
Multivariate logistic regressions for the primary endpoint.