A fully automatic deep learning system for L3 slice selection and body composition assessment on abdominal computed tomography.

Background and aims: As sarcopenia research has been gaining emphasis, the need for quanti�cation of abdominal muscle on computed tomography (CT) is increasing. Thus, a fully automated system to select L3 slice and segment muscle in an end-to-end manner is demanded. We aimed to develop a deep learning model (DLM) to select the L3 slice with consideration of anatomic variations and to segment cross-sectional areas (CSAs) of abdominal muscle and fat. Methods: Our DLM, named L3SEG-net, was composed of a YOLOv3-based algorithm for selecting the L3 slice and a fully convolutional network (FCN)-based algorithm for segmentation. The YOLOv3-based algorithm was developed via supervised learning using a training dataset (n=922), and the FCN-based algorithm was transferred from prior work. Our L3SEG-net was validated with internal (n=496) and external validation (n=586) datasets. L3 slice selection accuracy was evaluated by the distance difference between ground truths and DLM-derived results. Technical success for L3 slice selection was de�ned when the distance difference was <10 mm. Overall segmentation accuracy was evaluated by CSA error. The in�uence of anatomic variations on DLM performance was evaluated. Results: In the internal and external validation datasets, the accuracy of automatic L3 slice selection was high, with mean distance differences of 3.7±8.4 mm and 4.1±8.3 mm, respectively, and with technical success rates of 93.1% and 92.3%, respectively. However, in the subgroup analysis of anatomic variations, the L3 slice selection accuracy decreased, with distance differences of 12.4±15.4 mm and 12.1±14.6 mm, respectively, and with technical success rates of 67.2% and 67.9%, respectively. The overall segmentation accuracy of abdominal muscle areas was excellent regardless of anatomic variation, with the CSA errors of 1.38–3.10 cm 2 . Conclusions: A fully automatic system was developed for the selection of an exact axial CT slice at the L3 vertebral level and the segmentation of abdominal muscle areas.


Introduction
The segmentation of muscle and fat areas on abdominal computed tomography (CT) has gained huge emphasis in the last decade, as sarcopenia research has been growing rapidly. According to the revised European Working Group on Sarcopenia in Older People (EWGSOP2) 1 , the muscle area on CT measured at the third lumbar vertebral level is used as a representative value because it can re ect the whole-body muscle mass [2][3][4][5] .
Recently, the necessity to measure muscle and fat areas on CT has increased rapidly 6 , increasing the demand for automatic muscle and fat measurement technologies such as deep learning model (DLM).
Accordingly, there have been several previous studies that developed automatic segmentation for body composition analysis using DLM [7][8][9][10][11][12][13] , and some of them are commercially available 14 . These new automatic segmentation methods can reduce the time to measure abdominal muscle and fat areas to some degree. Still, these techniques have required manual selection of L3 slice CT images, which might be the greatest hurdle to achieve fully automatic body composition measurements. In general, it takes several minutes (around three minutes) to nd L3 slice level on abdominal CT even by experts 15 .
So far, only a few studies have attempted to develop a fully automatic technique for L3 slice selection and muscle segmentation 16,17 . However, these studies have not been clinically validated well; especially, it is unclear whether or not these studies have developed automatic L3 slice selection technique with consideration of thoracolumbar/lumbosacral variations. Thoracolumbar/lumbosacral variations may occur in around 20% (4-30%) of normal population 18, 19 . Therefore, developing a DLM-based automatic L3 slice selection technique requires training data with full consideration of anatomic variations.
The primary objective of this study was to develop a DLM to automatically select L3 slices on abdominal CT scans and then automatically segment areas of the abdominal muscle, visceral fat, and subcutaneous fat. The secondary objective was to validate the accuracy of DLM to select L3 slices with consideration of anatomic variations. The third objective was to validate the segmentation accuracy of DLM to measure muscle and fat areas at the L3 level. This article reports on and complies with the methods and terms described in the most recently published guidance on reading literature about machine learning for medical applications 20 .

Data acquisition: study subjects
The datasets used for this study were as follows: (1) development dataset used for developing the DLM, which was further split into the training set and tuning set; (2) validation dataset for independent testing of model performance, including an internal validation set and an external validation set. An overview of dataset composition is described in Fig. 1.
The development dataset was composed of 922 patients (560 men and 362 women; mean age, 54.4 ± 14.0 years), with 1496 abdominal CT images. The development dataset was used in our previous study 7 .
The overlapped dataset composed of 922 patients was used for the development and internal validation of the DLM segmenting body composition area in the prior article. In this manuscript, the dataset was used as the development dataset for the DLM selecting L3 slice. The development dataset included patients with various diseases and healthy subjects who underwent CT scanning for potential kidney donation. To identify anatomic variations in accurate manner, we also obtained chest CT scans in 910 patients.
The internal validation set was composed of 500 healthy subjects who had both chest CT and abdominal CT scans acquired in our institution from March through December 2012. Four subjects who underwent interbody lumbar vertebra fusion surgery were excluded, and a total of 496 subjects with 496 CT scans were used for validation (301 men and 195 women; mean age, 53.7 ± 8.7 years). The external validation dataset included 600 patients who had both chest and abdominal CT scans, acquired between September 2011 and March 2019 from three other institutions (KHUH, AUH, and UUH). A total of 586 patients were included after excluding 14 subjects who underwent lumbar interbody fusion surgery (347 men and 239 women; mean age, 58.5 ± 12.3). The clinical characteristics of subjects included in the validation dataset are summarized in Table 1.
Healthy subjects conducted CT for evaluation for organ donation including liver and kidney. Abdominal CT scan is a part of routine clinical management for potential liver of kidney donors. Healthy subject who underwent CT scan for benign lesion were also included in internal and external validation group. Ultrasonography is a screening method widely used due to absence of radiation hazard. If a focal lesion detected on the ultrasonography, CT scan is usually conducted for further characterization of the lesion in clinical practice.
CT scanners from various manufacturers were used. Detailed speci cations of the abdominal CT acquisition are summarized in Supplementary Table 1.

Generation of the ground truth
For each CT scan, the axial CT slice number of the third lumbar vertebra inferior endplate was annotated, and the lumbar vertebral anatomic variant was identi ed by a board-certi ed radiologist (J.H.) and double-checked by another radiologist (K.W.K.). In most cases, we counted the number of thoracolumbar spines and ribs in chest CT and abdominal CT scans to identify the anatomic variations accurately.
Disagreement was resolved by reaching a consensus through discussion.

Deep learning model development
Our DLM was composed of two algorithms, as follows: (1) a YOLOv3-based algorithm for selecting the L3 slice and (2) a fully convolutional network (FCN)-based algorithm for segmentation. These two algorithms were packaged in a DLM toolkit, named L3SEG-net.
Several preprocessing steps were used to generate input data to increase the effective dataset size and improve over tting and accuracy. Data augmentation was performed to generate 15,226 maximum intensity projection (MIP) images from 1496 CT scans. Of these, 12,180 MIP images were used as a training set, and 3,046 images were used as a tuning set.
YOLOv3-based L3 slice selection algorithm A YOLOv3-based algorithm was adopted because YOLOv3 can detect objects and extract features more e ciently than conventional convolutional neural networks, accomplished via object detection and classi cation 25 . Our YOLOv3-based algorithm generated multiple bounding boxes to extract features from MIP images using a concept similar to feature pyramid networks 26 . The L3 endplate was localized in a MIP image using extracted features of multiple bounding boxes and its relative coordinates. Network architecture and an example of bounding boxes are shown in Fig. 3.

FCN-based segmentation algorithm
Our FCN-based algorithm for automatic segmentation is described elsewhere 7 . We added several postprocessing steps to separate the intramuscular adipose tissue from the SMA based on Houns eld units.
The network architecture of our FCN-based algorithm is illustrated in Supplementary Fig. 1. Our FCNbased segmentation algorithm yielded cross-sectional areas (CSAs) of SMA, Vfat, and Sfat in cm 2 at the selected L3 slice CT images. Currently, the FCN-based segmentation algorithm is available as a webbased iAID toolkit 27 .
Validation of deep learning model Accuracy of automatic L3 slice selection In both internal and external validation cohorts, the L3 slice selection accuracy of the YOLOv3-based algorithm was evaluated by the absolute distance difference between the GT and the DLM-derived CT slice. The differences in CT slice numbers between the GT and the DLM-derived results were calculated and multiplied by slice thickness to generate the actual distance difference in millimeters. Technical success was de ned when the distance difference between the GT and the DLM-derived results was less than 10 mm ( Supplementary Fig. 2). The distance difference and technical success were separately evaluated in the normal anatomy group and anatomic variant group.
Segmentation accuracy of the DLM In both internal and external validation datasets, the CSA error was calculated to evaluate the accuracy of the DLM-derived segmentation, which is a result of a combination of the YOLOv3-based L3 slice selection and the FCN-based segmentation of abdominal muscle and fat. The CSA error is a standardized percentage difference in measured areas of muscle and fat between the GT values and the DLM-derived values. Thus, a low CSA error implies a high segmentation accuracy. The CSA error was calculated using the following equation: In subjects with concordant L3 levels, i.e., identical CT slice numbers from both the GT and the DLMderived results, the Dice similarity coe cient (DSC) was also used to evaluate the segmentation accuracy of our DLM.

Subgroup analysis according to anatomic variation
The in uence of anatomic variation on the performance of the DLM when selecting the L3 slice and segmenting muscle and fat areas were explored by subgroup analysis. The whole validation cohort, i.e., combined internal and external validation cohorts, was divided according to spinal anatomic variations. The accuracy of L3 slice selection and the segmentation accuracy of the DLM was compared between these subgroups.

Statistical analysis
The average values of distance differences between the GT and the DLM-derived L3 slices were compared between the normal anatomy group and the anatomic variant group using a Student t-test. The technical success rate, i.e., the percentage of subjects with technical success among all subjects, was compared between the normal anatomy group and the anatomic variant group using the chi-square test.
The average CSA values of SMA, Sfat, and Vfat were compared between the GT and the DLM-derived results using paired t-tests. The CSA errors were compared between subjects with technical success and subjects with technical failure in the internal and external validation datasets.

Accuracy of automatic L3 slice selection
The accuracy of the YOLOv3-based algorithm for automatic L3 slice selection in the internal and external validation datasets is summarized in Fig. 4. The mean distance differences between the GT and the DLMderived L3 slices were 3.7 ± 8.4 mm and 4. Segmentation accuracy of DLM-derived abdominal muscle and fat areas The average CSAs of SMA, Sfat, and Vfat derived from GT and DLM are presented in Table 2. In all subjects of internal and external validation datasets, there were no signi cant differences in CSAs between the GT and DLM-derived measurements in SMA, Sfat, and Vfat (p > 0.05 for all comparisons). Even for subjects with technical failure of L3 slice selection, the CSAs did not differ signi cantly between the GT and the DLM-derived measurements (p > 0.05 for all comparisons).
The average CSA errors of SMA, Sfat, and Vfat in all subjects of the internal and external validation datasets ranged from 1.38-4.54% (Table 2), indicative of excellent segmentation accuracy of the DLM. When we divided them into subjects with technical success and subjects with technical failure in terms of L3 slice selection, the average CSA errors of subjects with technical failure were higher than those with technical success in both the internal and external validation groups (p < 0.05 for all comparisons).
However, such CSA errors in subjects with technical failure were relatively small in the SMA, compared with the Sfat and Vfat in both the internal and external validation datasets.
In both internal and external validation cohorts, the Bland-Altman plots also demonstrated that agreement of CSAs between the GT and DLM was higher for subjects with technical success than for subjects with technical failure (Fig. 5 and Supplementary Fig. 3).
The mean difference of SMA between GT and DLM-derived results ranged from 0.2 to 3.0 % regardless of technical success on Bland-Altman plot. The mean difference of Sfat ranged from − 5.6 to 2.2 %, and Vfat ranged from − 3.5 to 1.9 %. The mean differences between GT and DLM-derived results were probably within an acceptable range of measurement variability.
The DSC values in subjects with concordant L3 level between the GT and DLM-derived results were very high. The DSC values of SMA, Sfat, and Vfat were 0.98, 0.98, and 0.98, respectively, in the internal validation dataset and were 0.96, 0.97, and 0.97, respectively, in the external validation dataset. Regarding the CSA errors, anatomic variation signi cantly in uenced Sfat and Vfat measurement, with CSA error higher than 5%, while less signi cantly in uenced SMA measurement, with CSA error less than 5% (Table 3). Speci cally, the average CSA errors between GT and DLM-derived results were 2.22% in normal anatomy subgroup and ranged from 2.37-4.06% in subgroups with anatomic variations. Note.-CSA = cross-sectional area, Sfat = subcutaneous fat area, SMA = skeletal muscle area, Vfat = visceral fat area.

Discussion
We were able to develop the L3SEG-net, a fully automatic DLM for selecting axial CT slice at L3 vertebral level and segmenting abdominal muscle area in an end-to-end manner. The L3 slice selection accuracy was accurate with mean distance differences less than 5 mm between GT and DLM-derived results. The overall segmentation accuracy of abdominal muscle areas was also excellent, with the average CSA errors of 1.38-3.10 cm 2 between GT and DLM-derived results.
There are several unique characteristics in the L3SEG-net. First, the L3SEGnet is composed of two algorithms running sequentially as one process: a YOLOv3-based L3 slice selection algorithm and a FCNbased segmentation algorithm. When we upload one or multiple series of full abdominal CT images in the L3SEG-net, it automatically selects L3 slice CT images, segments muscle and fat areas, and provides color maps with measurement values. The L3SEG-net can process approximately 1,000 abdominal CT scans per day in a setting of Intel® CoreTM i7-7700K GPU (8M Cache, 4.20 GHz, Santa Clara, CA, USA).
Thus, the L3SEG-net can be helpful to perform large-scale researches 28 .  29 . In near the future, we will keep training the L3SEG-net for automatic spine labelling using further data.
Third, we demonstrated that the L3SEG-net's overall segmentation accuracy of muscle areas is accurate regardless of anatomic variation in both internal and external validation cohorts. We used CSA error as a representative value of segmentation accuracy, instead of DSC. DSC evaluation was limited on the group showed the same CT slice of GT and L3SEG-net selection. Then DSC value can present only accuracy of segmentation algorithm. Thus we suggested CSA error as an indicator re ecting accuracies of both L3 selection algorithm and segmentation algorithm, regarding clinical impact. The average CSA errors between the GT and DLM-derived results were 2.22% in normal anatomy subgroup and ranged from 2.37-4.06% in subgroups with anatomic variations. These results may be attributable that the distance difference between GT and DLM was less than the height of a vertebral body, as the maximum distance difference was 40 mm. According to a recent study, the muscle area measurements were similar between the L2 inferior endplate level and L4 inferior endplate level 15 .
Overall segmentation accuracy of SMA was consistent regardless of CT parameters or machine. The results were reported in prior study 30 . Various CT machine and parameters from four other hospital were used in this study, but only portal phase abdominal CT scans were used for the analysis. The segmentation accuracy was consistent measuring SMA, Vfat and Sfat.
There have been two prior studies which reported performance of automatic L3 level slice selection models. However, these studies did not consider the anatomic variations in the training and validation process. Belharbi et al. 17   Example of multiple bounding boxes for training of the YOLOv3-based model and architecture of our YOLOv3-based network. Multiple bounding boxes were generated in the maximum intensity projection images based on the following prerequisites as illustrated in (A): (1) the L4 vertebra was located at the iliac crest level, (2) the L3 vertebra was located superiorly to the L4 vertebra, (3) the morphologies of the lumbar vertebrae were the same. The YOLOv3-based model used an objectness score for each bounding box obtained from logistic regression to predict the width and height of the box as well as its location relative to grid cell. The sum of the squared error loss was used to train the model for minimizing differences between the ground-truth object and the bounding box. Any error between the bounding box over the ground-truth object was incurred for both classi cation and detection loss. Our model extracted features of the bounding boxes using the network architecture illustrated in (B). Our network architecture used successive 3×3 and 1×1 convolution layers and a set of residual blocks with shortcut connections. A total of 53 convolutional layers were formed like Darknet-53. YOLOv3 predicted boxes at three different scales to support detection on varying scales.