Bone age assessment method based on fine-grained image classification using multiple regions of interest

Bone age assessment is commonly used to determine the growth status and growth potential of children. In this paper, the bone age assessment is regarded as a fine-grained image classification problem as bone age assessment is usually performed on radiographs of the left hand. An end-to-end bone age assessment model was proposed. This model is composed of four parts: feature extractor, Region of Interest (ROI) selection subnet, guidance subnet, and assessment subnet. Feature extractor is implemented based on Convolutional Neural Networks (CNNs), ResNet50 was used to extract image features. ROI selection subnet is used to select multiple informative ROIs that contain representative images features in the radiograph. Guidance subnet can guide the ROI selection subnet to select ROI more appropriately. Assessment subnet is used for bone age assessment by utilizing the extracted image features. The proposed model can extract the most informative ROIs in the radiographs, and use these ROIs to improve the accuracy of bone age assessment. In this paper, the bone age assessment model is tested on a public data set. The experimental results show that the proposed bone age assessment model has the highest accuracy, and the Mean Absolute Error (MAE) reaches 6.65 months.


Introduction
Bone age can reflect the maturity of the individual's skeleton. Compared to chronological age, bone age can better reflect the growth status and growth potential of an individual, especially for children in critical growth spurts. The final height of an individual can be predicted by combing one's current height and bone age. Therefore, bone age is widely used in the selection of athletes. In addition, bone age is often used as one of the diagnostic criteria for some endocrine diseases and genetic diseases as these diseases can lead to abnormal bone age. Given the high application value of bone age, bone age assessment has been widely studied.
Bone age assessment is usually performed on the radiographs of the left hand. Traditional bone age assessment methods are performed manually, experts make bone age assessments by analyzing the shape, size, texture, and other characteristics of the bones on the radiograph. Traditional bone age assessment methods are usually divided into three categories: counting method, mapping method, and scoring method. The counting method CONTACT Guanglin Dai dgl@zjut.edu.cn simply makes bone age assessment based on the number of ossification centers present in the left hand. The mapping method takes the whole radiographs as the comparison object, compares it with the radiographs in the standard library, then selects the most similar radiograph, and uses its bone age as the result of the bone age assessment. The GP (Greulich-Pyle) atlas method (Bayer, 1959) is the most representative method among the mapping methods. The scoring method first sets several bones on the left hand as reference bones, then assesses the developmental stage of each reference bone according to the shape, size, and texture of each bone. Each developmental stage corresponds to a certain maturity score. When using the scoring method for bone age assessment, experts first assess the developmental stage of all reference bones, then integrate the maturity scores of all reference bones to convert the final bone age. TW (Tanner-Whitehouse) series of scoring methods (Morris, 1986) are the most representative scoring methods among the scoring methods for their high accuracy. Long assessment time and low assessment accuracy are two major drawbacks of traditional bone age assessment methods. In addition, both expertise and practical experience are necessary to perform an accurate bone age assessment. Traditional bone age assessment methods are affected by intra-and inter-observer variability (Berst et al., 2001). This means that the results of the bone age assessment may be different from one expert to another. Even the same expert may have different assessment results for the same radiograph at different times.
Thanks to the development of deep learning techniques, deep learning-based bone age assessment methods have been extensively studied. Bone age assessment can be done in seconds using these methods. The assessment accuracy of these methods is comparable to that of experts. Convolutional neural networks are commonly used in these deep learning-based methods for their powerful feature extraction and abstraction capabilities. Some methods directly extract image features of whole radiographs and use these features to perform bone age assessment. An end-to-end model was usually used in these methods, for the input radiographs, the model can directly output the bone age assessment results. However, there exist many irrelevant areas in the radiographs, those areas can affect the accuracy of bone age assessment. So some other methods were brought up. These methods first segment the radiographs to extract representative ROIs, then use these ROIs for bone age assessment. These methods are highly interpretable, while the selection of regions of interest requires a certain level of expertise. In addition, a large amount of additional annotation is required for ROI extraction, such as the coordinates of key points and the size of the bound box.
In this paper, a novel end-to-end bone age assessment model is proposed, which can extract multiple informative ROIs in radiographs, and use these ROIs to improve the accuracy of bone age assessment. During model training, the only required annotation is the bone age corresponding to the radiographs. Fine-grained classification classifies different subcategories within the same category, resulting in more accurate classification results. The most prominent feature of fine-grained classification is the large intra-class variation and small inter-class variation. Bone age assessment is considered a fine-grained classification problem in this paper for the following reasons: (1) Large intra-class variation. Bone age assessment is performed by combining the shape, size, and texture information of several reference bones. For two radiographs with the same bone age, they may have only a small part of the reference bone that are similar, while the others are quite different. (2) Small inter-class variation. Two radiographs with different bone ages may differ significantly in only one reference bone, as some bones play a significant role in bone age assessment.
The main contributions of this paper are as follows: (1) In this paper, a novel end-to-end bone age assessment model was proposed. For the input radiographs, multiple informative ROIs can be selected, and these ROIs are used to make accurate bone age assessments.
(2) Improved the image feature fusion method in the proposed model. After the feature fusion method is improved, the spatial and semantic information in different image feature maps can be better fused, so the model can extract ROI more accurately.
(3) Improved classification loss function. With the improved classification loss function, the model also produces good evaluation results in the case of sample imbalance. (4) Used a public dataset to validate the proposed model. The experimental results demonstrate that the proposed can be well applied to the bone age assessment task, and the accuracy of the bone age assessment outperforms other methods, with the assessment results achieving a MAE of 6.55 months.
The remainder of this paper is organized as follows: Section 2 of this paper summarizes the related work in the field of bone age assessment in China and abroad; Section 3 provides a detailed description of the methods used in this paper; Section 4 describes the experimental setup and results of experiments; Section 5 presents the conclusions and future directions.

Related work
In the past few years, various bone age assessment methods have been proposed, among which deep learningbased bone age assessment methods have been widely studied. Spampinato et al. (2017) used the Overfeat network (Sermanet et al., 2013) to extract image features from radiographs using a regression network for bone age assessment. In their method, a deformation layer was added to the network to increase the robustness of the network thus making it adaptable to different hand postures. Tong et al. (2018) designed a bone age assessment model combining heterogeneous feature learning. In their method, a convolutional neural network based on VGG (Simonyan & Zisserman, 2014) was used for extracting image features from radiographs. Then a support vector regression machine with multiple kernel learning was used for assessing bone age, utilizing the extracted features in combination with information such as race and gender. Wu et al. (2019) designed a bone age assessment model based on the TW scoring method, using a Mask R-CNN subnet to extract the ROIs of the reference bones, and then using a residual attention network to perform bone age assessment. Pan et al. (2020) also started from the perspective of image preprocessing to improve bone age assessment. They combined UNet (Ronneberger et al., 2015) with reinforcement learning methods to extract ROIs of reference bones from radiographs with only a small number of annotated images. Lee et al. (2020) manually labeled key points on the radiographs and used these points to select several ROIs from large to small. These ROIs were classified using convolutional neural networks to make bone age assessments. Ultimately, the smallest ROI yielded the best bone age assessment results. In order to train a bone age assessment model on small-scale radiograph datasets, a light convolutional neural network was designed and proposed by Miao et al. (2020). A Super-Point network was used to extract robust feature maps. These extracted feature maps are then downscaled and fed into an optimized classification network to produce bone age assessment results. Marouf et al. (2020) considered gender information as a fundamental aspect of bone age assessment. The bone age assessment model they designed contains two parts, both using convolutional neural networks to reach the target task. The first part is used for gender classification to determine the gender of the input image. The second part is used to perform the bone age assessment, and the input to this part includes the gender information obtained in the previous part with the image information from the radiographs.
Most of the above studies have performed bone age assessments based on entire radiographs. These methods use features extracted by CNNs or manually designed features for bone age assessment, thus the accuracy of bone age assessment depends on the merit of the extracted features. Other methods first extract different ROIs in the radiographs and then use the extracted ROIs for bone age assessment. A large amount of extra annotation is needed in these methods. In addition to the bone age values corresponding to the radiographs, the localization of ROIs often requires information such as the size of the localization frame and the coordinates of the key points of the ROIs. Moreover, most of the models proposed in these approaches are implemented in multiple stages and cannot be trained end-to-end.
Our proposed model is an end-to-end trainable bone age assessment model. The model is able to extract multiple informative ROIs in radiograph and use these ROIs to improve the accuracy of bone age assessment. During model training, no extra annotation is required, only the image-level annotation of the real bone age value is needed.

Method
In this paper, a novel bone age assessment model is proposed. The proposed model was based on the finegrained classification model as the baseline model, which is proposed by Yang et al. (2018). The proposed bone age assessment model consists of four parts, which are feature extractor, ROI selection subnet, guidance subnet, and assessment subnet. The overall structure of the bone age assessment model is shown in Figure 1.
The input of the model is a radiograph of the left hand. The feature extractor is used to extract image features from the radiograph. Then the ROI selection subnet selects K ROIs with the most representative image features. The guidance subnet evaluates the ROIs selected by the ROI selection subnet, calculates the classification confidence of each ROI. These classification confidences are feedback to the ROI selection subnet, guide the ROI selection subnet to better select ROIs. Finally, the assessment subnet combines the image features of the whole radiograph as well as the K selected ROI regions to perform bone age assessment and output the corresponding results. In this paper, K is set as 4.
The entire bone age assessment network can be trained end-to-end. The core idea of the model is to select the most informative ROIs to help improve the accuracy of bone age assessment. During the model training, the ROI selection subnet is continuously optimized based on the feedback from the guidance subnet, so that the most informative ROIs can be selected. Ultimately, these extracted ROIs can effectively improve the accuracy of bone age assessment.

Feature extractor
In this paper, ResNet50 (He et al., 2016) is used as the base network of the feature extractor. The fully connected layer and Softmax layer of the ResNet50 network are removed and only the feature extraction part of the network is used. Image features in the radiographs are extracted by the feature extractor and feature maps are generated. Then feature maps are input to the ROI selection subnet. The use of residual connection structure is the most important feature of Resnet, which is shown in Figure 2.
For the input feature map X, the purpose of a convolution layer is to fit a potentially constant mapping function: H(X) = X. In the residual connection structure, the residual connection link is introduced (dashed line in Figure 3). With this structure, the input X can be passed directly to the next layer of the network. The function to be fitted becomes H (X) = F(X) + X. Thus the objective of this convolutional layer changed to learning a residual function F(X) = H (X) − X. In this way, the output of  the mapping function is more sensitive to changes in the input after introducing residual connections. During backpropagation, small loss gradients are more likely to reach the shallow neuron residual connectivity structure solves the network degradation problem in deep networks, allowing deeper network structures to be trained.

ROI selection subnet
The role of the ROI selection subnet is to select multiple most informative ROIs in the radiographs based on the image features extracted by the feature extractor.
Image features extracted by CNNs often have two aspects of information, respectively semantic information and spatial information. Semantic information plays a key role in image classification, and spatial information plays a key role in the localization of key features. In the shallow part of the network, simple image features such as size, textures, and edges in the image are usually extracted. These shallow features have weak semantic information but strong spatial information. As the depth of the network increases, more complex and abstract features in the image are extracted, and these deep features have strong semantic information but lack spatial information. In order to make full use of the semantic and spatial information in the radiograph, the FPN (Feature Pyramid Networks) (Lin et al., 2017a) + PAN (Path Aggregation Network) (Liu et al., 2018) structure is used in this paper for feature fusion. The structure of FPN + PAN is shown in Figure 3. M 1 , M 2 , M 3 are the feature maps extracted by the feature extractor, including shallow features and deep features. These feature maps are first fused using the FPN structure. The FPN is a top-down pyramid structure, and the different feature maps are fused by using the lateral connection, new feature maps M 1 , M 2 , and M 3 are obtained after fusion. Figure 4 shows the structure of the lateral connection, we use the feature maps M 1 , M 2 as examples to illustrate how lateral connection works. We use the bilinear interpolation method to upsample the feature map M 2 , a new feature map M 2 ' is obtained, which has the same size as the feature map M 2 . Then a 1 × 1 convolution operation is performed on M 2 '. After that M 2 ' is fused with the feature map M 1 by elementwise operation. Since deep features tend to have more semantic information, the semantic information in the feature map is all enhanced after fusing by the FPN structure. The structure of PAN is used after the FPN structure, which is a bottom-up pyramid structure. The shallow feature maps with rich spatial information are fused with the deep feature maps to enhance the spatial information. In the PAN structure, averaging pooling is used to downsample the shallow features, after which the new features are fused with the deep features using the element wise operation. Finally, image features M 1 , M 2 , M 3 with rich spatial information and strong semantic information are obtained.
Using the feature maps M 1 , M 2 , M 3 , the ROI selection subnet is able to calculate the amount of information of each ROI in the original map, which is defined as I(R) and is used to evaluate how informative the ROI is. The purpose of the ROI selection subnet is to select K ROIs in order according to the amount of information of the ROIs, denoted as R 1 , R 2 , ... , R K . Here, K is a configurable parameter. The K sorted ROIs should satisfy the following conditions: I (R 1 ) > I (R 2 ) > ... > I (R k ).

Teaching subnet
The role of the guiding subnet is to optimize the ROI selection subnet. For the K ROIs selected by the ROI selection subnet, the guidance subnet calculates their classification confidence, defined as C(R). The calculated classification confidence, C (R 1 ), C (R 2 ), ... , C (R K ) are feedback to the ROI selection subnet. The classification confidence refers to the probability of classifying an ROI as its corresponding true label. Greater classification confidence means that the corresponding ROI is more helpful for the classification of the whole radiograph, and the more informative the ROI should be. The ROI selection subnet is continuously optimized in the model training process based on the feedback from the guidance subnet so that the final selected K ROIs satisfy formula (1): That is, for the ROI selected in the ROI selection subnet, if the amount of information is greater, the confidence level should be also higher.

Assessment subnet
The role of the assessment subnet is to perform bone age assessment. For the inputted radiograph, the feature extractor extracts the feature from the entire picture, denote as F A . The ROI selection subnet selects K most informative ROIs, R 1 , R 2 , ... , R K , resize them to the same size as the whole film. Then feature extractor is used to extract image features of these K ROIs, denote as: F 1 , F 2 , ... , F K . These K+1 image features including that from the whole film and that form K ROIs are connected to form a new feature F ALL . After the calculation of F ALL by two fully connected layers, the results of bone age assessment are obtained. To increase the nonlinearity of the model, Relu is used as the activation function in the middle of the last two fully-connected layers.

Loss function
The overall loss function of the proposed model consists of three components, which are ROI selection loss, guidance loss, and assessment loss, denoted as L R , L G , L A , respectively. The ROI selection loss uses a pairwise ranking loss to align the informativeness ranking with the confidence ranking of K ROIs. The ROI selection loss L R is defined as follows: Here, the hinge loss function f (x) = max{1 − x, 0} is used as the loss function f. When there is any inconsistency between the informativeness ranking and the confidence ranking, the inconsistent ranking pairs are penalized, leading to an increase in loss. And i, s ∈ [1, K]. After training, the network is eventually able to select K ROIs whose informativeness ranking and confidence ranking then have the same order. The guidance loss L R is defined as follows: Here, the first term represents the sum of classification loss for the selected K ROIs, the second term represents the classification loss for the whole film. R X represents the whole radiograph. In this paper, Focal Loss (Lin et al., 2017b) is used to calculate the classification loss, which is defined as follows: Here, p t is the classification confidence of classifying the images as true labels and γ is a settable parameter. Focal Loss forces the network to focus more on the difficult and misclassified samples during training. When the images are correctly classified, the loss will be small, and conversely, when the images are incorrectly classified, the loss function will be larger. In this paper, γ is set as 2 throughout. The definition of assessment loss L A is shown in formula (5): The assessment loss represents the classification loss obtained when the bone age assessment is performed by the assessment subnet. Where R ALL represents the image features of the whole radiograph combined with the K selected ROIs. The loss of the whole proposed model L All is defined as follows:

Data overview
The dataset used in this paper is a publicly available dataset for bone age assessment, which is published by the Radiological Society of North America (RSNA) in 2017 (Halabi et al., 2018), namely the RSNA dataset. The dataset contains 14,236 radiographs of the l eft hand from two children's hospitals with corresponding reports. The whole dataset is divided into three parts: training set, validation set, and test set, in which there are 12,611 samples in the training set, 1,423 samples in the validation set, and 200 samples in the test set. For each image, the average of the bone age values from the corresponding report and the bone age assessment values from 3 additional experts were used as its reference bone age value, which was used as the training label in the model training. All images corresponded to a range of bone ages from 0 to 228 months, and their distribution is shown in Table 1.

Evaluation metric
For bone age assessment tasks, the commonly used evaluation metric is the MAE, which is defined as follows: where m is the total number of samples for which bone age assessment was performed, BA p i is the bone age assessment result of the i-th sample, and BA gt i is the label value of the i-th sample. In the assessment of bone age, we usually use one year as a threshold to assess whether an adolescent is developing early or late. Using MAE as an assessment metric does not reflect the superiority of our model. Therefore, in this paper, we use additional metrics, which are the accuracy of absolute error within 1 year and the accuracy of absolute error within 0.5 years, respectively.
In this paper, both MAE and accuracy within 1 and 0.5 years are used as evaluation metrics.

Experimental environment configuration and experimental parameter
The configuration of the experimental environment is shown in Table 2.
In the model training phase, the batch size is set to 4 and the epoch is set to 120, and an epoch is completed when all images in the training set have been trained.
The Stochastic Gradient Descent (SGD) optimizer was used in model training also, with the learning rate set to 0.0001, the weight decay set to 0.01, and the momentum set to 0.9.
In addition, MultiStep LR was used in the experiments. It is a learning rate decay strategy that dynamically adjusts the learning rate during the model training. The learning rate of the model changes to 10% of the current learning rate after 60 and 100 epochs of training are completed.

Base network selection
In the proposed bone age assessment model, all three subnets share one feature extractor. Therefore, in order to obtain better accuracy of bone age assessment, it is especially important to choose a convolutional neural network with powerful feature extraction ability as the base network.
We conducted experiments to compare several popular convolutional neural networks, which include ResNet50, Inceptionv3 (Szegedy et al., 2016b), Inception-ResNetv2 (Szegedy et al., 2016a), and Densenet (Huang et al., 2017). In all experiments in this paper, the data set was divided into months, with different months of bone age considered as different classifications. The evaluation results obtained from training on the dataset using the four above-mentioned CNNs are shown in Table 3. It shows that the experiment using ResNet50 obtained the best experimental results with an MAE of 9.40 months, outperforming the other three networks.

Feature fusion methods
The ROI selection subnet uses the feature maps extracted by the feature extractor to select the most representative ROIs, therefore improving the bone age assessment accuracy of the model. The key point here is the quality of the feature maps. It is essential to adopt the feature fusion method to fuse the shallow feature maps with the deep feature maps. After feature fusion, both semantic and spatial information in the feature map is enhanced, which will be beneficial with the ROI selection subnet.
In the baseline model, the structure of FPN is used for feature fusion. However, the FPN is a top-down pyramid structure that can fuse the semantic information from the deep feature map to the shallow feature map well, but not the spatial information from the shallow feature map to the deep feature map. PAN structure can solve this problem by fusing the spatial information from the shallow feature map to the deep feature map. Therefore, in this paper, a PAN structure is added behind the FPN structure to form an FPN + PAN structure for feature fusion. A set of experiments was designed to verify the usefulness of different feature fusion methods, and the results are shown in Table 4. The experimental results showed that when the FPN + PAN structure was used for feature fusion, the bone age assessment accuracy of the whole model was higher and the MAE decreased by 0.18 months.

Determination of parameters in focal loss
In this paper, Focal Loss was used as the classification loss function. γ is a settable parameter in the Focal Loss function. When γ is set to greater than 0, it makes the model focus more on hard-to-classify samples during training by reducing the loss of easy-to-classify samples. We set γ to 0.25, 0.5, 2, and 5 to conduct the experiments separately,  Table 5. When γ is set to 2, our model can obtain the best experimental results. Table 6 shows the experimental performance of several different methods of bone age assessment, including the model proposed in this paper as well as those proposed in other papers. The first three rows of the table indicate the experimental performance of the methods proposed in recent years, which have been experimented on the RSNA dataset. The last four rows of the table indicate the experimental performance of the experiments designed in this paper. Since the radiographs of different genders with the same bone age have large differences, the data set was separated by gender and the experiments were conducted separately. These three sets of experiments are used to verify the improvement of the baseline model by different feature fusion structures and different classification loss functions. The experimental results showed that the experimental performance improved with the use of Focal Loss as the loss function, and the MAE decreased by 0.22 months. After using the FPN + PAN structure as the feature fusion structure, the MAE decreased by 0.18 months. These results demonstrate that changes in the feature fusion method and improvements in the classification loss function both contribute to the accuracy improvement of the baseline model.

Performance comparison
The proposed bone age assessment model finally reached the highest accuracy of bone age assessment with an MAE of 6.65 months, which outperforms other bone age assessment methods.
In addition, we counted the accuracy of each method's assessment error within 1 year of age and within 0.5 years of age, show them in the last two columns of Table 6. The results show that our proposed model obtains the highest accuracy rate. The accuracy within 1 year of age is 88.53%, and the accuracy within 0.5 years of age is 62.87%.

Conclusion and future work
In this paper, bone age assessment is considered a finegrained classification problem. Based on the fine-grained classification model, a novel bone age assessment model that can be trained end-to-end is designed and implemented. This model is able to select multiple informative ROIs from radiographs and use these ROIs to improve the accuracy of bone age assessment. The experiment results demonstrate that the model can successfully extract the most informative ROIs and perform bone age assessment with an MAE of 6.65 months, which is better than other current studies. In future studies, two improvements are expected: first, the morphology of bones in the left hand at different age stages has large differences. Samples of radiographs at different stages are considered to be trained separately to obtain better classification accuracy. Second, when performing the analysis of experimental results we found that some very poor bone age assessment results were caused by abnormal hand posture. Therefore data enhancement methods are considered to improve the robustness of the model so that the model can adapt to radiographs with different hand postures.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Data availability statement
The data that support the findings of this study are available in kaggle at https://www.kaggle.com/kmader/rsnabone-age