Pediatric age estimation from thoracic and abdominal CT scout views using deep learning

Age assessment is regularly used in clinical routine by pediatric endocrinologists to determine the physical development or maturity of children and adolescents. Our study investigates whether age assessment can be performed using CT scout views from thoracic and abdominal CT scans using a deep neural network. Hence, we retrospectively collected 1949 CT scout views from pediatric patients (acquired between January 2013 and December 2018) to train a deep neural network to predict the chronological age from CT scout views. The network was then evaluated on an independent test set of 502 CT scout views (acquired between January 2019 and July 2020). The trained model showed a mean absolute error of 1.18 ± 1.14 years on the test data set. A one-sided t-test to determine whether the difference between the predicted and actual chronological age was less than 2.0 years was statistically highly significant (p < 0.001). In addition, the correlation coefficient was very high (R = 0.97). In conclusion, the chronological age of pediatric patients can be assessed with high accuracy from CT scout views using a deep neural network.

For the three methods using ordinal classification, a direct conversion of age to a label would not allow for prediction accuracy below one year; accordingly, ages were converted into classes by multiplying by 4 to allow for a higher precision. This conversion was reversed after prediction, so as not to affect the MAE by this conversion.

Augmentations
Neural networks benefit strongly from small image transformations called augmentations which change the appearance of an image without destroying the information contained. For example, rotating a CT scout view will generally result in a different appearance but does not change the patient's age. Therefore, only small rotations (between -10 and 10 degrees), small changes in brightness and contrast (up to 10%), and random paddings (up to either 32 pixels for images that are 224x224 pixels or 64 pixels for images that are 512x512 pixels) with subsequent cropping to the original image size were applied. The latter transformation amounts to randomly shifting the image in horizontal and vertical directions. In addition, since most CT scout views are centered in the upper half of the image, padding was applied anisotropically, i.e., padding was not applied to the bottom of the image since this cropping could cut off a large part of the patient's body.

Hyperparameter Optimization
The neural network architecture and its training depend on several hyperparameters one must choose appropriately. To optimize all these parameters, "Optuna", a hyperparameter optimization library based on Tree Parzen Estimators, was chosen for efficient tuning 1 . In detail, the following parameters were optimized by Optuna: Scheduling was performed by multiplying the learning rate with the chosen gamma after a given number of epochs (called step-size scheduling). Finally, the freeze parameter was optimized, which determined how many parameters of the backbone were trainable or fixed during training.
For this, the layers of each network architecture were roughly divided into four parts. In more detail (using the naming conventions of the respective studies by He et al. 2  Weights of fully connected layers were initialized using the method by He et al. 4 and were always trainable. A full overview of the amount of trainable and non-trainable parameters can be found in Table S1.  will freeze all weights of the backbone; only the network head with fully connected layers will be trainable in this case.
The total rounds of parameter sets to be searched by Optuna were fixed to 100. In each optimization round, Optuna selected the hyperparameters of the network, which were then trained on the training data set and evaluated on the validation set by computing the MAE. After optimization, the hyperparameters of the best-performing model in terms of MAE was chosen as the final model.
After the best network structure was determined, the network was trained using both training and validation data sets. This is because, in general, neural networks benefit from more data. The training was conducted as many epochs as during the training of the best model. The retrained model was considered to be the final model. Its performance was then evaluated on the independent test data set. This final evaluation took only place once to avoid introducing any bias by repeatedly optimizing for the test set, which would lead to severe overfitting.

Loss curves
To judge the overall quality of the training of the best model and whether overfitting occurred, the loss curves of the best-performing models were plotted (Fig. S1).

Figure S1
Loss curves for the best-performing model. Note that since the methods use different loss functions, their absolute value cannot be compared directly (except for the two methods using L1 loss).

Benefit of DICOM tags
Since the acquisition of scout views depends on several imaging parameters like exposure time and kilovoltage peak, we tested whether such information could help in improving the network predictions. For this, we extracted the following DICOM tags from each scout view: kilovoltage peak (KVP), x-ray tube current, exposure (in mAs), the computed tomography dose index (CTDIVol), the maximum of the pixel spacing in both directions. In addition, the sex of the person was extracted from the DICOM tags as it is known to be an important factor in child and adolescence maturity and growth. Missing DICOM tags affected mainly the CTDIVol tag (N = 170 in the training set, N = 7 and N = 1 in the test and validation set resp.) and only marginally the Exposure tag (N = 5 in the training and N = 2 in the test set). All missing DICOM tags were replaced by the mean of the corresponding tag in the training set, i.e., CTDI was replaced by 0.11 while Exposure was replaced with 184.9.
These tags were then input into a small network with three fully connected layers and ReLU activations. The sizes of these layers were considered to be hyperparameters and were subjected to tuning. The output of this network was then concatenated to the encoded feature vector from the backbone and then processed with several fully connected layers. Apart from this change, the very same procedure was performed, i.e. the same hyperparameter tuning was used.
The optimized network used the ResNet-50 as backbone, and used the CT scout views at full resolution; it used pretrained weights, but all layers were trainable. The fully connected layers processing the DICOM tags had sizes of [4,4]; the head layers processing the concatenated feature vector had a size of [8, 1024, 1024] while the learning rate was 1.5e-4 and was multiplied by 0.9 every 22 steps.
This network showed an MAE of 1.43 ± 1.45 years on the validation set, which was slightly higher than the 1.39 ± 1.44 years of the same procedure without DICOM tags. Therefore, it seemed that adding DICOM tags to the network did not improve predictive performance.

Software
The neural network was developed using the Python 3.8, Pytorch 1.13 5 . For reproducibility, the code for training the neural network and evaluation will be made available on GitHub (https://github.com/aydindemircioglu/scout.view.age).