A machine learning model for detecting invasive ductal carcinoma with Google Cloud AutoML Vision
Introduction
Invasive ductal carcinoma (IDC) is a type of breast cancer that begins in the cells of milk ducts and then grows and invades the surrounding breast tissue. It is the most common type of breast cancer among all breast cancer diagnoses, accounting for nearly 70–80% of diagnoses from various reports [1,2]. IDC is routinely diagnosed by a pathologist through visual examination of a patient's breast tissue sample under a microscope. In order to accurately delineate the region of IDC and assess its aggressiveness, a pathologist typically needs to scan large areas of the mounted whole slide at various levels of microscopic magnification (2.5x - 40x). This task is laborious and time-consuming, often subject to inter-observer variability in diagnosis and interpretation [[3], [4], [5]]. It is especially challenging in certain clinical situations when a timely pathology report is expected.
With digital scanners, a microscope slide can be digitized to produce high-resolution whole slide images (WSI). Automatic image processing can then be applied to detect IDC in WSI through sophisticated image analysis and pattern recognition algorithms [[6], [7], [8], [9]]. Nuclei detection and segmentation are common functional blocks in such algorithms. They are used to extract morphological features from WSI such as cell size, shape and nucleoli appearance [6,8]. Caner nuclei exhibit distinct morphology in comparison with normal cells. They are typically larger and have coarse chromatin texture and irregular shapes [8]. These pattern recognition algorithms often combine the nuclear features with other features such as texture, topology and color for malignancy detection. One challenge for these algorithms is that their performance is very sensitive to the staining procedure and the quality of stained slides used [6,8].
Machine learning (ML), in particular deep learning with convolutional neural network (CNN), is another approach that has gained tremendous success in image classification in recent years [[10], [11], [12], [13]]. A CNN-based image classifier uses layers of convolutional operations to extract important features from the pixels of input images. It then feeds the extracted features into fully-connected layers of neurons for classification. In 2014, a group of researchers from universities and hospitals published a pioneering work in the field of IDC identification with CNN [14]. In their work, over 200,000 histopathology image patches were created from hundreds of WSIs collected from patients. Each WSI was carefully delineated by pathologists, resulting in a positive or negative IDC label for each image patch. This provided a valuable dataset from which a supervised ML model could learn IDC patterns. Using this dataset, they built a custom ML model with CNN and validated their learning algorithm. This same dataset was later used by other researchers in validating their custom neural network models and algorithms [15]. Currently, this dataset is made publicly available in Kaggle [16].
While ML algorithms show great potential for IDC identification, building an effective ML model through conventional processes has been a daunting task. This is not only due to the limited availability of high-quality IDC image datasets, but also the complexity of deep learning CNN's algorithms and architecture [[10], [11], [12], [13]]. Handcrafting a CNN-based ML model for IDC identification requires an experienced data scientist to carefully design, validate and tune the model [14,15,17]. This ML barrier, however, has been recently reduced with the rapid advancement of AutoML technology [[18], [19], [20]]. AutoML provides methods and processes to automatically select an appropriate model, optimize its hyperparameters and analyze the results. It can significantly simplify the process of a model's creation, meanwhile improving the model's accuracy through extensive search and optimization. AutoML is particularly attractive when combined at scale with cloud computing such as AWS and Google Cloud Platform [21,22], where elastic cloud infrastructure and resources can be taken full advantage of.
This study is aimed to assess the feasibility of using current cutting-edge AutoML technology for IDC identification. The study extends the earlier researches in this area by building and evaluating an experimental ML model using Google Cloud AutoML Vision [22] instead of a custom handcrafted CNN architecture. The paper starts with a description of the public dataset used, as well as outlining how the data is augmented and split. Then it presents the method of how the AutoML Vision model is built and the results of evaluation and generalization tests. Finally, the paper concludes with remarks on the main objective of this study, the challenges we met and suggestions for further studies in the future.
Section snippets
Original dataset
We select the IDC image dataset publicly available in Kaggle [16] as the original dataset for this study. The dataset originated from a pioneering research published in Ref. [14]. The image dataset consists of 277,524 patches of size 50 × 50 px images extracted from hundreds of IDC whole slide images. Each image patch was individually labeled with a positive or negative IDC class. This same dataset was also used in another published research in an effort to verify their custom neural network
AutoML Vision
Google Cloud AutoML Vision [22,24] is selected as the cloud ML service to build our experimental IDC model. AutoML Vision is Google's implementation of AutoML technology on Google Cloud Platform (GCP) for image classification and object detection. The main features of AutoML Vision include:
- •
Enable users with less experience to build high-quality custom ML models for their specific domains. In the spectrum of Google AI/ML offerings, AutoML Vision sits between the pretrained ready-to-use generic
Conclusions
In this study, we build and evaluate an experimental AutoML model with Google Cloud AutoML Vision for IDC identification. From the results of this study, the following conclusions can be drawn:
- •
The current cutting-edge AutoML technology is mature and feasible for IDC identification
- •
With 91.6% average accuracy (AuPRC) from the model evaluation and 84.6% balanced accuracy from the held-out test, our experimental AutoML Vision model outperforms the ones reported in the earlier studies
- •
We can take
Declaration of competing interest
Yan Zeng and Jinmiao Zhang declare that they have no conflict of interest. Their research was partially sponsored by Cardinal Health, Inc. and Guanganmen Hospital for their usage of work computer and equipment. No other types of financial support were received for this research. They do not own Google stock or have any other types of financial investment in Google.
Acknowledgements
This work is partially sponsored by Cardinal Health, Inc. and Guanganmen Hospital of China Academy of Chinese Medical Sciences. The new histopathology image samples in this study are collected from a patient in Guanganmen Hospital.
References (27)
Interobserver agreement and reproducibility in classification of invasive breast carcinoma: an NCI breast cancer family registry study
Mod. Pathol.: Offc. J. U. S. A. Canadian. Acad. pathol. Inc.
(2006)- et al.
Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases
J. Pathol. Inf.
(2016) Invasive Ductal Carcinoma (IDC)
(2020)Breastcancer.org, “Invasive Ductal Carcinoma (IDC)”
(2020)Interobserver reproducibility of the Nottingham modification of the Bloom and Richardson histologic grading scheme for infiltrating ductal carcinoma
Am. J. Clin. Pathol.
(1995)Inter-observer variability between general pathologists and a specialist in breast pathology in the diagnosis of lobular neoplasia, columnar cell lesions, atypical ductal hyperplasia and ductal carcinoma in situ of the breast
Diagn. Pathol.
(2014)- et al.
Computerized classification of intraductal breast lesions using histopathological images
IEEE (Inst. Electr. Electron. Eng.) Trans. Biomed. Eng.
(2011) - et al.
“Log-gabor Wavelets Based Breast Carcinoma Classification Using Least Square Support Vector Machine,” 2011 IEEE International Conference on Imaging Systems and Techniques
(2011) - et al.
Breast cancer histopathology image analysis: a review
IEEE (Inst. Electr. Electron. Eng.) Trans. Biomed. Eng.
(2014) - et al.
A review of emerging themes in image informatics and molecular analysis for digital pathology
Annu. Rev. Biomed. Eng.
(2016)
Deep learning
Nature
Deep Learning
Deep convolutional neural networks for image classification: a comprehensive review
Neural Comput.
Cited by (45)
Femtosecond laser micro-machining of three-dimensional surface profiles on flat single crystal sapphire
2024, Optics and Laser TechnologyOptimized Bayesian convolutional neural networks for invasive breast cancer diagnosis system[Formula presented]
2023, Applied Soft ComputingComparison of Gray-scale Inversion to Improve Detection of Pulmonary Nodules on Chest X-rays Between Radiologists and a Deep Convolutional Neural Network
2023, Current Problems in Diagnostic RadiologyCitation Excerpt :AutoML Vision (Google Inc., Mountain View, CA) is a code-free machine learning product by Google, which is designed for developers with basic or limited machine learning expertise and cloud knowledge to train custom high-quality image classification or object detection models in a few hours.21-27 It leverages Google's automatic deep transfer learning, utilizing an existing deep neural network trained on other data and neural architecture search technology in order to find the right combination of extra network layers.22,28 Although training AI models theoretically require a large dataset, Google's AutoML Vision can work well with datasets as low as 100 images using transfer learning technology which enables users to train on top of Google's model with the last layer of the neural network21,29 AutoML is a supervised AI algorithm in which the classifier is trained on an existing database containing images that are labeled with the required outputs.30-31
A review: The detection of cancer cells in histopathology based on machine vision
2022, Computers in Biology and MedicineCitation Excerpt :Here, quantitative and qualitative analysis of cancer tissues are involved, which can be captured through deep learning. For example, computer analysis by deep learning model can distinguish normal tissue, atypical hyperplasia and adenocarcinoma in esophageal biopsy [27], invasive breast cancer, DCIS in situ, ductal carcinoma and benign tissue in breast cancer [28], and differentiate the subtypes of glandular polyp and polyp in colorectal cancer [29]. Gleason classification is the main classification scheme of prostate cancer.
Evaluating generic AutoML tools for computational pathology
2022, Informatics in Medicine Unlocked