A Deep Convolutional Architectural Framework for Radiograph Image Processing at Bit Plane Level for Gender & Age Assessment

: Assessing the age of an individual via bones serves as a fool proof method in true determination of individual skills. Several attempts are reported in the past for assessment of chronological age of an individual based on variety of discriminative features found in wrist radiograph images. The permutation and combination of these features realized satisfactory accuracies for a set of limited groups. In this paper, assessment of gender for individuals of chronological age between 1-17 years is performed using left hand wrist radiograph images. A fully automated approach is proposed for removal of noise persisted due to non-uniform illumination during the process of radiograph acquisition process. Subsequent to this a computational technique for extraction of wrist region is proposed using operations on specific bit planes of image. A framework called GeNet of deep convolutional neural network is applied for classification of extracted wrist regions into male and female. The experimentations are conducted on the datasets of Radiological Society of North America (RSNA) of about 12442 images. Efficiency of preprocessing and segmentation techniques resulted into a correlation of about 99.09%. Performance of GeNet is evaluated on the extracted wrist regions resulting into an accuracy of 82.18%.


Introduction
Age is a reflection of the level of development of bones, in this regard the bones present in left hand wrist region plays a dominant role due to its non-dominant usage in most of individual life. Assessment of gender based on the bone development in wrist images is one of the challenging problems in pattern recognition. The several factors of bone development are employed in prediction of an individual ultimate height, gender, approximation of pubertal age of a child and in diagnosis of various growths related disorders ]. Assessing the gender of an individual via bones helps in archaeological and various crime investigations of forensic identification process [Santosh (2019)].
Giordano et al. [Giordano, Spampinato, Scarciofalo et al. (2010)] to predict the bone maturity sate of 0-10 years for males and 0-7 years for females. Fuzzy theory with principle component analysis is applied on carpal bones to estimate the maturity of skeletons. Hsieh et al. [Hsieh, Liu, Jong et al. (2010); Villar, Alemán, Castillo et al. (2017)] proposed fuzzy-based growth model for bone age prediction and for age estimation from the pubic bone. Bag of features method along with random forest classifier is used for classification of features on phalanges bones by Simu et al. [Simu and Lal (2017)]. In a different work, Lee et al. [Lee, Tajmir, Lee et al. (2017)] proposed a method to extract the region of interest using deep learned pre-trained Image Net CNN. Also, a study on carpal and phalangeal bone analysis by Gertych et al. [Gertych, Zhang, Sayre et al. (2007)] proved that for boys of age 1-7 and girls from 1-5, carpel bone found to be efficient for bone age assessment and Phalangeal regions can be used for both male and female above age 13. Bull et al. [Bull, Edwards, Kemp et al. (1999)] carried out a comparative study on bone age assessment based on Greulich and Pyle, TW2 methods. 362 radiograph images are employed for study using TW2 method and found to be more accurate than Greulich and Pyle method. Thangam et al. [Thangam, Mahendiran and Thanushkodi (2012)] described various alternative ways of developing a model for bone age assessment including fuzzy sets, dynamic thresholding, region based technique, bone labeling, Bayesian and regression Technique etc. Mansourvar et al. [Mansourvar, Ismail, Herawan et al. (2013)] surveyed on various challenges involved in bone age assessment related to birth documents and legal issues. Pietka et al. [Pietka, Kaabi, Kuo et al. (1993)] proposed a method for bone age assessment based on carpal bone features using thresholding techniques. Tanner et al. [Tanner and Gibbons (1994)] proposed a method using the physical changes and their rate of change as a measure to estimate the rate of maturation of bones. As it is analyzed above, some of the critical observations from the literature in the proposed area of study are; i) poor efficiency of segmentation due to lack of image preprocessing protocols ii) Region of interest extraction is not fully automated that poses the main challenge for image segmentation, iii) Lack of preprocessing being noticed include hand-to-background ratio, hand orientation correction, limited success due to faint bone edges and iv) enhanced feature engineering to identify an optimal set of bones is lacking. Few challenges in the non-technical point of view; i) Poor visibility of carpal bones in radiographs of very young children, ii) Lack of an established sequence of appearance of some of the carpal bones as the child grows iii) Discriminating the different levels of merges between these groups of ages iv) Overlap of phalangeal bones introduces challenges in the development of computer vision algorithms. As the perspective of gender classification may provide a reliable source for classification of gender for generically for the age groups of range 1-17. In the reminder of the paper, Section 2 presents proposed methodology, section depicts the results of experimental analysis and finally Section 5 concludes the work.

Proposed methodology
In this work, as a part of initial pre-processing procedure a radiograph image is subject to histogram equalization protocol leading to even illumination effect on . Let indicate the intensity in a non-illumination image and denote the transformed gray level of uniformly illuminated image obtained after implication of as given Eq. (1).

( )
Specifically transformation with respect to a gray level is given by (2).

( ) ( ( )) I s H I r =
(2) where s is subject to integral function of probability density of gray level in an image as given in (3)  where x the distance from the origin in the horizontal axis, y is the distance from the origin in the vertical axis, and σ is the standard deviation of the Gaussian distribution and ( , ) g x y is the smoothened image obtained using Gaussian smoothing filter. Subsequently, smoothened image ( , ) g x y is manipulated in terms of its bit planes to extract the region of interest from enhanced wrist radiographs.

Region of interest extraction-bit plane slicing
Emphasizing analysis towards only the specific regions of interest in the problem domain increase the chances of reliability of the computerized diagnosis system. Performing manipulations at bit level further reduces the computational burden involved in processing of images. In this work, the region of interest extraction from smoothened image is performed using bit plane slicing. Number of bit planes in an image depends on the bit depth of pixels and categorized into high order and lower order bit planes. High frequency content is defined in higher order bit planes and low frequency content is defined in lower order bit planes. The information in the bit planes is employed for extraction of wrist region in the images.
In this work, 8-bit images are subject to bit plane slicing and masks are generated by considering different combinations of higher order bit planes. If 1  M are carried out on basis of (5) and (8).
M are employed as reference images for region of interest from gray scale counter parts of radiograph images. The optimal masks with in 1 M to 4 M leading to efficient performance of segmentation technique are analyzed subsequently in experimental analysis section.

Feature extraction and classification-convolutional neural networks (CNN)
Masks of wrist region generated from the preprocessed radiographs are used for accumulation of inputs to be fed to CNN, which automates feature extraction and classification of the wrist radiographs in two levels as indicated in the Fig. 1.

Prediction of Gender-GeNet
GeNet is a pre trained model proposed in this work which is trained with masks generated from the pre-processed radiographs generated from bit plane based segmentation technique. The masks used to train the GeNet comprises of about 12559 images of both male (6833) and female (5726) including age groups ranging from 1 to 17 years. Fig. 2(a) and Fig. 2(b) depicts the instances of the masks adapted for training the GeNet for automatic feature extraction and recognition. Masks of wrist region generated from the preprocessed radiographs are used for accumulation.

Architecture of GeNet
Convolutional neural network is a deep feed forward neural network where the conventional classifier models like support vector machine, regression network or any other fully connected networks are integrated with multiple levels of convolution process with max pooling as one of the core operation. A typical CNN architecture is built of input layer, multiple convolution layers with RELU+max pooling layer, hidden layers of feature vectors extracted through convolution process, fully connected layer where conventional classifiers are integrated and output layer as shown in Fig. 3. The performance of the CNN with numerous layers is better compared to other perceptron models of artificial neural networks which are intense in terms of number of computations due to presence of dense interconnections from layer to layer. Unlike perceptron models, CNN's can provide better performance due to feed forward design and less number of layer to layer connections which leads to reduced computation of weights in the hidden layers.

Figure 3: Architecture of GeNet
CNN's are widely used for image and data analysis tasks are a form of artificial neural networks defined of a set of hidden layers called convolution layers and non-hidden layers. Convolution layer accepts input from input layer and trains the input to the nodes in the current layer and forwards the output to the next layer. Precisely in the process of convolution, the patterns are detected and this requires the specification of number of filters using which convolution has to be carried along with the size of filters. Number of filters is usually designated as net width and the number of channels to be analyzed with respect to every pixel is net depth. Filters are instrumental in detection of variety of patterns including edges, corners, circles, texture and other geometrical structures that are inherent in the image. In GeNet, two hidden layers of convolution are integrated with 32 filters in level 1 and 64 filters in level 2 of convolution. More the deeper the network is, CNN's will be able to predict wider variety of patterns in the images leading to higher degrees of robustness of the deep learning system. Each filter plays a crucial role in detecting a particular type of pattern or objects as available in the image. Convolution layer of GeNet is defined with 32 filters each of size 5×5 initially defined with some random numbers as indicated in the Fig. 4. Origin of the filter is overlaid with each pixel encompassing the neighborhood of 5×5 in the image. Each filter is convolved with 5×5 neighborhood of every pixel in the image until all the pixels exhausts. Convolution process provides the abstract representation of the image patterns in the form of the dot products, every 5×5 neighborhood of the image will be replaced by one pixel in the output of the image. Output of the convolution layer is the dot product representation of the input image. The feature visualization of dot product representation of input image after level 1 convolution is as depicted in Fig. 6. Each filter emphasizes specific gradient details in the input image comprising of complete gradient patterns. All these patterns detected are forwarded to max pooling layer for subsequent level of abstraction. Prior to the data being fed to max pool layer, the output of convolution layer will be fed to RELU (REctified Linear Unit) which is an integral part of max pool layer. RELU greatly improves the speed of training by adapting non linearity through tanh function being subject to differentiation. Let x be input to RELU, gradient computation is rectified to 0 or 1 depending upon the sign of x and response produced by RELU will be bound by (9).
Response produced by RELU may lead to increase in nonlinearity in the output generated by convolution layer and overall effect of RELU on output image is resembled in terms of abruptly varying gray level discontinuities in the image. Subsequent to this, the response produced by RELU is fed to Max pool layer to perform down sampling of the data to provide higher level feature maps of an image leading to dimensionality reduction. In GeNet, max pool layer adapted a kernel of dimension 4×4 with stride length of "2" indicating the stepping down over the input in x and y direction without any overlap.
Consider m n × is dimension of an image f with net depth of d then after down sampling using max pool function it will be reduced to 2 2 m n d × × as represented in the Fig. 5.
Down Sampling m n d × × 2 2 m n d × × Figure 5: Protocol of max pool function The higher level feature map obtained as result of max pool layer is further directed to undergo another level abstraction though hidden layer at convolution layer level 2, followed by RELU and max pooling functions. Finally fully connected layer employs a conventional classifier to classify the feature maps generated from the hidden layers. The probabilities of the predicted outcomes with respect to the number of classed are normalized with the help of soft max function prior to redirecting them to output layer. Genet is a pre trained model proposed in this work which is trained with masks generated.

Experimental analysis
In this work, performance of the algorithms implemented has been analyzed individually for bit plane slicing based segmentation and GeNet. Performance analysis of algorithms are conducted on datasets used in the Pediatric Bone Age Challenge contributed by Stanford University, the University of Colorado and the University of California-Los Angeles available in Radiological Society of North America (RSNA), Radiology Informatics Committee (RIC) for pediatric bone age prediction. The bone age is represented in terms of number of months in the datasets considered and includes the children of age group starting from one year to 19 years of male and female. The results of radiograph enhancement and outcome of bit plane slicing based segmentation technique along with the outcome of mask generation for combination of mask 1 and mask 2 clusters are as presented in the Fig. 6. Performance of bit plane slicing segmentation technique is evaluated using the Jaccard similarity index Ind Jacc , dice coefficient coe Dice , false positive ratio rate fp and false negative ratio rate fn .
Tab. 1 represents the performance metrics of bit plane slice segmentation technique for few instances of masks extracted using combination of higher order bit planes 5, 6, 7 and 8 with its ground truths as depicted in Fig. 7. a b c d e f g  In the Fig. 8(a) and Fig. 8(b) counter parts of presented images indicates the ground truth and corresponding mask image generated using bit plane slicing segmentation algorithm. The ground truth masks are generated for over 1000 images and corresponding masks for the same are generated using segmentation algorithm for which the average Jaccard index is found to be 0.9255. Performance of GeNet is depicted in Fig. 9(a) and Fig. 9(b) indicating the prediction rate and loss of convergence between actual to mapped test instances for about 312 epochs.

Conclusions
In a nut shell, in this work gender assessment along with chronological age prediction of individual of age groups between one to 18 years is performed. Algorithmic technique contributions is done in terms of pre-processing, mask generation for wrist radiograph based on bit plane slices of radiograph image. A fully automated technique, GeNet architecture is devised for classification of radiographs based on the Gender using deep learning networks. Overall accuracy of the method developed seems to falls above 85% accuracy. The challenging part of the work lies in classification of the wrist radiographs into large number of classes when compared to other works in the same are focusing only 2 to 5 classes. Algorithms proposed are robust even with inconsistent illumination conditions of images indicating the uniqueness of proposed work.