Elsevier

Medical Image Analysis

Volume 67, January 2021, 101857
Medical Image Analysis

Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes

https://doi.org/10.1016/j.media.2020.101857Get rights and content

Highlights

  • We create the RAD-ChestCT data set of 36,316 volumes from 19,993 unique patients.

  • We develop a method to automatically extract 83 abnormality labels from CT reports.

  • We develop a deep learning model for multi-abnormality classification of CT volumes.

  • The CT volume model achieves mean AUROC of 0.773 for all 83 abnormalities.

  • Training the CT volume model on more abnormality labels improves performance.

Abstract

Machine learning models for radiology benefit from large-scale data sets with high quality labels for abnormalities. We curated and analyzed a chest computed tomography (CT) data set of 36,316 volumes from 19,993 unique patients. This is the largest multiply-annotated volumetric medical imaging data set reported. To annotate this data set, we developed a rule-based method for automatically extracting abnormality labels from free-text radiology reports with an average F-score of 0.976 (min 0.941, max 1.0). We also developed a model for multi-organ, multi-disease classification of chest CT volumes that uses a deep convolutional neural network (CNN). This model reached a classification performance of AUROC >0.90 for 18 abnormalities, with an average AUROC of 0.773 for all 83 abnormalities, demonstrating the feasibility of learning from unfiltered whole volume CT data. We show that training on more labels improves performance significantly: for a subset of 9 labels – nodule, opacity, atelectasis, pleural effusion, consolidation, mass, pericardial effusion, cardiomegaly, and pneumothorax – the model's average AUROC increased by 10% when the number of training labels was increased from 9 to all 83. All code for volume preprocessing, automated label extraction, and the volume abnormality prediction model is publicly available. The 36,316 CT volumes and labels will also be made publicly available pending institutional approval.

Introduction

Automated interpretation of medical images using machine learning holds immense promise (Hosny et al., 2018; Kawooya, 2012; Schier, 2018). Machine learning models learn from data without being explicitly programmed and have demonstrated excellent performance across a variety of image interpretation tasks (Voulodimos et al., 2018). Possible applications of such models in radiology include human-computer interaction systems intended to further reduce the 3 – 5% real-time diagnostic error rate of radiologists (Lee et al., 2013) or automated triage systems that prioritize scans with urgent findings for earlier human assessment (Annarumma et al., 2019; Yates et al., 2018). Previous work applying machine learning to CT interpretation has focused on prediction of one abnormality at a time. Even when successful, such focused models have limited clinical applicability because radiologists are responsible for a multitude of findings in the images. To address this need, we investigate the simultaneous prediction of multiple abnormalities using a single model.

There has been substantial prior work on multiple-abnormality prediction in 2D projectional chest radiographs facilitated by the publicly available ChestX-ray14 (Wang et al., 2017), CheXpert (Irvin et al., 2019), and MIMICsingle bondCXR (Johnson et al., 2019) datasets annotated with 14 abnormality labels. However, to the best of our knowledge, multilabel classification of whole 3D chest computed tomography (CT) volumes for a diverse range of abnormalities has not yet been reported. Prior work on CTs includes numerous models that evaluate one class of abnormalities at a time – e.g., lung nodules (Ardila et al., 2019; Armato et al., 2011; Pehrson et al., 2019; Shaukat et al., 2019; Zhang et al., 2018), pneumothorax (Li et al., 2019), emphysema (Humphries et al., 2019), interstitial lung disease (Anthimopoulos et al., 2016; Bermejo-Peláez et al., 2020; Christe et al., 2019; Christodoulidis et al., 2017; Depeursinge et al., 2012; Gao et al., 2018, 2016; Walsh et al., 2018; Wang et al., 2019), liver fibrosis (Choi et al., 2018), colon polyps (Nguyen et al., 2012), renal cancer (Linguraru et al., 2011), vertebral fractures (Burns et al., 2016), and intracranial hemorrhage (Kuo et al., 2019; Lee et al., 2019). The public DeepLesion dataset (Yan et al., 2018) has enabled multiple studies on detection of focal lesions (Khajuria et al., 2019; Shao et al., 2019). There are three obstacles to large-scale multilabel classification of whole CTs: acquiring sufficently large datasets, preparing labels for each volume, and the technical challenges of developing a large-scale multi-label machine learning model for the task. In this study, we address all of these challenges in order to present a fully automated algorithm for multi-organ and multi-disease diagnosis in chest CT.

Acquiring a large CT dataset appropriate for computational analysis is challenging. There is no standardized software for bulk downloading and preprocessing of CTs for machine learning purposes. Each CT scan is associated with multiple image sets (“series”), each comprising on the order of 100,000,000 voxels. These volumes need to be organized and undergo many pre-processing steps.

To train a multilabel classification model, each volume must be associated with structured labels indicating the presence or absence of abnormalities. Given the number of organs and diseases, manual abnormality labeling by radiologists for the thousands of cases required to train an accurate machine learning model is virtually impossible. Instead, methods that automatically extract accurate labels from radiology reports are necessary (Irvin et al., 2019; Johnson et al., 2019; Wang et al., 2017).

Prior work in automated label extraction from radiology reports can be divided into two primary categories: whole-report classifiers (Banerjee et al., 2017; Chen et al., 2018; Pham et al., 2014; Zech et al., 2018) that predict all labels of interest simultaneously from a numerical representation of the full text, and rule-based methods that rely on handcrafted rules to assign abnormality labels. Whole-report classifiers suffer two key drawbacks: they are typically uninterpretable and they require expensive, time-consuming manual labeling of training reports, where the number of manual labels scales linearly with the number of training reports and with the number of abnormalities. Rule-based systems (Chapman et al., 2001; Demner-Fushman et al., 2016; Irvin et al., 2019; Peng et al., 2018) are a surprisingly good alternative, as radiology language is rigid in subject matter, content, and spelling. We propose and validate a rule-based label extraction approach for chest CT reports designed to extract 83 abnormality labels.

Development of a multi-label classification model is challenging due to the complexity of multi-organ, multi-disease identification from CT scans. We will show that the frequency of particular abnormalities in CTs varies greatly, from nodules (78%) to hemothorax (<1%). There are hundreds of possible abnormalities; multiple abnormalities usually occur in the same scan (10±6); and the same abnormality can occur in multiple locations in one scan. Different abnormalities can appear visually similar, e.g., atelectasis and pneumonia (Edwards et al., 2016), and the same abnormality can look visually different depending on severity (e.g., pneumonia of one lobe vs. an entire lung) (Franquet, 2001), shape (e.g., smooth nodule vs. spiculated nodule), and texture (e.g., reticular vs. groundglass) (Dhara et al., 2016). Variation in itself is not necessarily pathologic – even among “normal” scans the body's appearance differs based on age, gender, weight, and natural anatomical variants (Hansell, 2010; Terpenning and White, 2015). Furthermore, there are hardly any “normal” scans available to teach the model what “normality” is. We will show that <1% of chest CTs in our data are “normal” (i.e., lacking any of the 83 considered abnormalities). This low rate of normality is likely a reflection of requisite pre-test probabilities for disease that physicians consider before recommending CT and its associated exposure to ionizing radiation (Costello et al., 2013; Purysko et al., 2016; Smith-Bindman et al., 2009).

Previous single-abnormality CT classification studies have relied on time-intensive manual labeling of CT pixels (Kuo et al., 2019; Li et al., 2019; Walsh et al., 2018), patches (Anthimopoulos et al., 2016; Bermejo-Peláez et al., 2020; Christodoulidis et al., 2017; Gao et al., 2018), or slices (Gao et al., 2016; Lee et al., 2019) that typically limits the size of the data set to <1000 CTs and restricts the total number of abnormalities that can be considered. Inspired by prior successes in the field of computer vision on identifying hundreds of classes in whole natural images (Deng et al., 2009; Rawat and Wang, 2017), we hypothesize that it should be possible to learn multi-organ, multi-disease diagnosis from whole CT data given sufficient training examples. We build a model that learns directly from whole CT volumes without any pixel, patch, or slice-level labels, and find that transfer learning and aggregation of features across the craniocaudal extent of the scan enables high performance on numerous abnormalities.

In this study we address the challenges of CT data preparation, automated label extraction from free-text radiology reports, and simultaneous multiple abnormality prediction from CT volumes using a deep convolutional neural network. We hope that this work will contribute to the long-term goal of automated radiology systems that assist radiologists, accelerate the medical workflow, and benefit patient care.

Section snippets

Methods

An overview of this study is shown in Fig. 1.

Automatic label extraction from free-text reports

The performance of SARLE for automatic extraction of nine labels is shown in Table 3. The SARLE-Hybrid approach achieves an average F-score of 0.930 while the SARLE-Rules approach achieves an average F-score of 0.976, indicating that the automatically extracted labels are of high quality using both approaches. For the common labels, the Hybrid and Rules approaches perform equally well, e.g., atelectasis where both SARLE-Hybrid and SARLE-Rules achieve an F-score of 1.0. For the rarer findings –

Discussion

The three main contributions of this work are the preparation of the Report-Annotated Duke Chest CT data set (RAD-ChestCT) of 36,316 unenhanced chest CT volumes, the SARLE framework for automatic extraction of 83 labels from free-text radiology reports, and a deep CNN model for multiple abnormality prediction from chest CT volumes.

The RAD-ChestCT data set is the largest reported data set of multiply annotated chest CT volumes, with 36,316 whole volumes from 19,993 unique patients. We plan to

CRediT authorship contribution statement

Rachel Lea Draelos: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization. David Dov: Conceptualization, Writing - review & editing, Software. Maciej A. Mazurowski: Conceptualization, Writing - review & editing. Joseph Y. Lo: Conceptualization, Writing - review & editing, Funding acquisition. Ricardo Henao: Conceptualization, Resources, Writing - review & editing, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We would like to thank Mark Martin, Justin Solomon, the Duke University Office of Information Technology (OIT), and the Duke Protected Analytics Computing Environment (PACE) team. We also thank the anonymous reviewers for providing insightful comments that improved the manuscript.

Funding Sources

This work was supported by NIH/NIBIB R01-EB025020, developmental funds of the Duke Cancer Institute from the NIH/NCI P30-CA014236 Cancer Center Support Grant, and GM-007171 the Duke Medical Scientist Training Program Training Grant.

References (80)

  • M. Annarumma et al.

    Automated Triaging of Adult Chest Radiographs with Deep Artificial Neural Networks

    Radiology

    (2019)
  • M. Anthimopoulos et al.

    Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network

    IEEE Trans. Med. Imaging

    (2016)
  • D. Ardila et al.

    End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

    Nat. Med.

    (2019)
  • S.G. Armato et al.

    The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans

    Med. Phys.

    (2011)
  • I. Banerjee et al.

    Intelligent Word Embeddings of Free-Text Radiology Reports

    AMIA .Annu. Symp. proceedings. AMIA Symp.

    (2017)
  • Y. Benjamini et al.

    Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing

    J. R. Stat. Soc. Ser. B

    (1995)
  • D. Bermejo-Peláez et al.

    Classification of Interstitial Lung Abnormality Patterns with an Ensemble of Deep Convolutional Neural Networks

    Sci. Rep.

    (2020)
  • P. Bojanowski et al.

    Enriching Word Vectors with Subword Information

    Association for Computational Linguistics

    (2017)
  • J.E. Burns et al.

    Automated detection, localization, and classification of traumatic vertebral body fractures in the thoracic and lumbar spine at CT

    Radiology

    (2016)
  • M.C. Chen et al.

    Deep learning to classify radiology free-text reports

    Radiology

    (2018)
  • K.J. Choi et al.

    Development and validation of a deep learning system for staging liver fibrosis by using contrast agent–enhanced ct images in the liver

    Radiology

    (2018)
  • A. Christe et al.

    Computer-aided diagnosis of pulmonary fibrosis using deep learning and CT images

    Invest. Radiol.

    (2019)
  • S. Christodoulidis et al.

    Multisource transfer learning with convolutional neural networks for lung pattern analysis

    IEEE J. Biomed. Heal. Informatics

    (2017)
  • J.E. Costello et al.

    CT radiation dose: current controversies and dose reduction strategies

    Am. J. Roentgenol

    (2013)
  • B. de Hoop et al.

    Screening for lung cancer with digital chest radiography: sensitivity and Number of secondary work-up CT examinations

    Radiology

    (2010)
  • E.R. DeLong et al.

    Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

    Biometrics

    (1988)
  • D. Demner-Fushman et al.

    Preparing a collection of radiology examinations for distribution and retrieval

    J. Am. Med. Informatics Assoc.

    (2016)
  • Deng, J., Dong, W., Socher, R., Li, L.-.J., Li, K., Fei-Fei, L., 2009. ImageNet: a Large-Scale Hierarchical Image...
  • T.D. DenOtter et al.

    Hounsfield Unit

    (2019)
  • A.K. Dhara et al.

    A Combination of shape and texture features for classification of pulmonary nodules in Lung CT images

    J. Digit. Imaging

    (2016)
  • R.M. Edwards et al.

    A quantitative approach to distinguish pneumonia from atelectasis using computed tomography attenuation

    J. Comput. Assist. Tomogr.

    (2016)
  • T. Franquet

    Imaging of pneumonia: trends and algorithms

    Eur. Respir. J.

    (2001)
  • M. Gao et al.

    Holistic classification of CT attenuation patterns for interstitial lung diseases via deep convolutional neural networks

    Comput. Methods Biomech. Biomed. Eng. Imaging Vis.

    (2018)
  • M. Gao et al.

    Multi-label deep regression and unordered pooling for holistic interstitial lung disease pattern detection

    Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    (2016)
  • C.A. Gatsonis et al.

    The national lung screening trial: overview and study design

    Radiology

    (2011)
  • J.M. Gibbs et al.

    Lines and stripes: where did they go? —from conventional radiography to CT

    RadioGraphics

    (2007)
  • D.M. Hansell

    Thin-section CT of the lungs: the hinterland of normal

    Radiology

    (2010)
  • He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep Residual Learning for Image...
  • A. Hosny et al.

    Artificial intelligence in radiology

    Nat. Rev. Cancer

    (2018)
  • N. Howarth et al.

    Missed lung lesions: side by side comparison of chest radiography with MDCT

    Diseases of the Chest and Heart 2015–2018

    (2015)
  • Cited by (0)

    View full text