BE-FNet: 3D Bounding Box Estimation Feature Pyramid Network for Accurate and Efficient Maxillary Sinus Segmentation

,


Introduction
Maxillary sinus is an important part of the body which has multiple functions including olfaction, filtering, heating, and humidifying the inhaled air. People who suffer from nasal function impairment may have a reduced quality of life [1].
In the last few years, functional endoscopic sinus surgery (FESS) has been established as the state-of-the-art technique for the treatment of endonasal pathologies. Recently, robotassisted FESS replaces the traditional one that grows inconvenient for the surgeon. To exactly define the workspace, the knowledge about the anatomical structure of maxillary sinus is required. Manual segmentation costs about 900 minutes for one patient's CT scans which become infeasible for daily practice [2]. Consequently, automatic segmentation approaches with high accuracy should be imperative. However, there are some difficulties in practice. At first, the high rate of structure variations exists in maxillary sinus like location, size, and shape. In addition, plenty of lesions frequently appear in the cavity with different intensities, scales, and positions, which lead to extremely heterogeneous textures in volume of interests (VoI). Figure 1 illustrates some distinct cases indicating a general segmentation method encountering more challenges.
Medical image analysis has played a crucial role in clinical practice for a long period, and related techniques including partial differential equation (PDE), machine learning, and deep learning models have achieved rapid and efficient development to promote computer-aided application in clinics. For example, Wei et al. [3] proposed an adaptive variational PDE model for image reconstruction. Compared with the state-of-the-art models, their work could obtain more accurate reconstruction results and increase the probability of the doctor's diagnosis being correct avoiding future follow-on examinations. Since large scale and complexity of magnetic resonance imaging (MRI) data require effective preprocessing approaches, Ke et al. [4] presented an adaptive independent subspace analysis (AISA) method to discover meaningful electroencephalogram activity in MRI scans for supporting diagnostics. In experiments, their proposed model achieved 94.7% of accuracy and 0.9356 of f-score from the real autism spectrum disorder dataset. Połap and Woźniak [5] proposed a model of bacteria recognition based on a composition of region covariance with convolutional neural networks. e process of recognition is divided into two stages. In the first stage, an input microscopy image is segmented by the use of the region covariance model. en, these segments are forwarded to CNN for recognition of visible bacteria strains. Experimental results show high potential of the proposed methodology. Połap Woźniak [6] gave a segmentation technique based on medical image processing methods and swarm algorithm for lung segmentation on X-ray images for the subsequent diagnosis. e swarm methodology was used for extraction of interested portions with the convolutional neural network as a detector. Khan et al. [7] proposed a novel deep learning framework for the detection and classification of breast cancer in breast cytology images using the concept of transfer learning.
is technique facilitates detection and classification of breast cancer in the early stages of its development that may allow patients to have proper treatment. With benefits from related progress of computer-aided skills, people enjoy more and more efficient health services and protections to improve the quality of lives as far as possible.
3D convolutional kernel reflects competitiveness at extracting discriminative features along X, Y, and Z directions for 3D classification, detection, and segmentation. However, 3D FCNs, in general, produce an explosion of investment into training parameters and related computational resources. Previous works discuss technical limitations when employing a 3D CNN on medical imaging data [31][32][33]. In order to incorporate 3D contextual information, multiple works optimize FCN baseline with 2D [18], 2.5D [20,33], or small 3D patches [34,35]. Although alleviating the pressure by 3D kernels, these methods pay more attention on irrelevant backgrounds, waste lots of computation resources, and cause a large number of undersegmentations or oversegmentations. To resolve above problems, region of interest (RoI) localization modules are individually designed as a discrete part of the workflow, such as image registration based on atlas with prior knowledge [36,37]. However, in cases of maxillary sinus segmentation, because of inhomogeneous intensity distribution, their registration performances are poor with a slow speed workflow. en, RoI detection based on deep learning shows great potentials [25,[38][39][40][41][42]. Some works [38,39,42] extract region proposals using external modules like selective search strategy or multiscale combinatorial grouping (MCG) [43], drawbacks of which include time-consuming process of searching the best candidates with limited features considered. Later works introduce an additional well-trained segmentation model in low dimension for RoI localization slice by slice [25,44]. Although the approaches reduce computation costs dramatically, they have limitations on extracting features between adjacent slices and tend to provide more false positives in localization of the 3D objective.
Multitask models combining bounding box (bbox) localization and segmentation emerge as promising development, such as Mask R-CNN [34] and Multitask Cascaded Convolutional Networks (MTCNN) [45]. Related methods can analyze significant features within bbox and save computation resources to achieve better speed and accuracy. For accurate localization, the region proposal network (RPN) and RoIAlign are introduced to detect and refine bbox localization. en, multitask networks can manage classification or segmentation work within the bbox to obtain advanced outcomes. In the state-of-the-art, feature pyramid network (FPN) [46] holds competitiveness in multilevel object detection by fusing different levels of feature maps to preserve better details. In this work, authors extend FPN on the Mask R-CNN model. By extracting ROI on each level feature map with RPN, the evaluation of segmentation with intra-RoI FCN acquires more accurate results. However, extending Mask R-CNN with FPN to the 3D mode directly encounters some difficulties. At first, 3D FPN that has symmetric encoder-decoder construction serves excessive GPU memory especially for the high-resolution dataset. Meanwhile, the 3D RPN network produces many 3D anchors of different sizes that also cause a great cost of computation resources. In addition, the distributions of ground truth and background are always imbalanced in medical images. Plenty of different 3D anchors overfit the object so that the 3D bbox cannot have stable localization estimation. Moreover, in Mask R-CNN, the RoIAlign module for localization of RoIs runs bilinear interpolation to resample feature tensors in the anchors to fixed dimensions. Such mechanism results in losing features of details, giving challenges to medical images with the low-level resolution.
To address these issues and inspired by the Mask R-CNN with ResNet-FPN [34] and residual attention network (RAN) [47], we propose a novel multitask framework for segmentation with 3D bounding box estimation, named as 3D bounding box estimation feature pyramid network (BE-FNet), which is designed to effectively extract 3D volumetric maxillary sinus from CT scans in an end-to-end manner. Compared with traditional 3D segmentation models, our proposed model serves more advanced accuracy and computation efficiency as a result of crucial components in deep convolutional neural network architecture. Sufficient ablation studies on collected 50 CT scans demonstrated the superiority of our proposed model with the following main contributions: (1) We propose a deep neural network with multitask of 3D bounding box estimation and in-region segmentation branches. BE-FNet holds symmetric encoder-decoder architecture with shared parameters. Image encoder is responsible for bounding box estimation and decoder for in-region 3D segmentation similar with U-net. As a result of exploring the target in more significant shrunk space, our proposed model can reduce the computation cost remarkably compared with traditional 3D semantic segmentation neural networks.
(2) To avoid overfitting problems in the state-of-the-art multitask models, we design an overestimation strategy to generate a reasonable 3D bounding box that cannot cause any redundant memory cost. In addition, to increase the depth of the network, we Mathematical Problems in Engineering introduce residual dense blocks as the backbone to enhance the flow of residuals, substantially increasing the depth of the neural network. Moreover, we design a mechanism of attention excitation to improve salient detection applied in bounding box estimation process, which does not give any computation burden for 3D deep neural networks. Especially, the structure of multilevel feature fusion in the pyramid network strengthens the ability of identification to global and local discriminative features in foreground and background, achieving more advanced segmentation results in space. (3) To resolve the problems of class imbalance and blurring boundary of sinus cavity for segmentation, we define a hybrid loss function of Dice and contouraware loss. Besides, a multiresolution model ensemble strategy has been introduced to boost segmentation robustness, generating more reliable results and suppressing false positives. (4) Our model does not depend on any pretrained model or commonly used postprocessing techniques such as 3D conditional random field (CRF). e generalization of the proposed approach is demonstrated through testing on extensive experiments. Not only does our model extract accurate maxillary sinus volume but also achieve competitive performances in related research areas compared with the state-ofthe-art methods, which can be generalized in other applications and proved a great promising technique in future.
To the best of our knowledge, this is the first use of attention excitation mechanism to locate and estimate the 3D bounding box for maxillary sinus segmentation with a remarkable performance using the multitask neural network, resulting in a generalized segmentation solution than methods available to date. e entire model is built up based on the backbone of FPN with residual dense blocks depending on different hierarchical feature fusions. ese innovations make sure that our model pays attentions on more significant VoI reducing massive computing resource costs and providing more advanced segmentation results.
Our paper is organized as follows. In Section 2, we describe our model in detail and report the experimental results compared with the state-of-the-art methods in Section 3. Section 4 further discusses some insight as well as issues of the proposed method. e conclusions are drawn in Section 5.

Overview of Our Proposed
Architecture. Our multitask neural network architecture for segmentation is depicted in Figure 2. e proposed architecture consists of three main stages which are responsible for preprocessing, bounding box estimation, and in-region segmentation, respectively. At first, in order to reduce computational cost on redundant information of CT image backgrounds, we adopt the Otsu segmentation algorithm to extract RoIs coarsely based on connectivity analysis [48] in each slice. Besides, the backbone of the network is divided into image encoder and decoder branches. Image encoder focuses on salient detection to estimate the 3D bounding box with a fixed size.
rough cropping feature fusions between the encoder and the decoder, the image decoder branch employs in-region segmentation for maxillary segmentation in an end-to-end manner.

Data Preprocessing.
For a medical image, Hounsfield units (HU) are a measurement of relative densities determined by CT. Normally, the HU values range from − 1000 to 1000. Any smoothing method is not adopted in our work. Since inhomogeneous texture in the maxillary sinus cavity with lesions provides the significant character in 3D segmentation, noise management is not necessary and destroys potential rules for effective segmentation. Especially, deep neural networks have the capability of learning discriminative features in background or foreground of original medical images, serving great adaptiveness in complicated condition.
erefore, for generalization of our proposed model, we kept the original range of intensity without any preprocessing methods to avoid possible artifacts from image resampling, preserving original details for segmentation. To CT scans, some contents of the image that belong to background waste too much computation resources and are possible to ignore. erefore, we use Otsu segmentation to quickly extract foreground and its coarse bounding box with connectivity analysis. en, these cropped images are fed into the BE-FNet and performed data augmentation like scaling, flipping, intensity jittering, and translation. A few examples of the comparison between original and cropped images are illustrated in Figure 3.

BE-FNet Architecture-Building Deeper Network.
To implement fully automatic segmentation for maxillary sinus in CT scans, we design a hybrid neural network unifying 3D bounding box estimation and in-region segmentation with shared features and weights over different tasks. e baseline of BE-FNet borrows spirits from FPN [46], where different level feature maps are fused to promote discriminative feature extraction. Image encoder branch is responsible for exploring the attention of objective to estimate the 3D bounding box. With the decoder in significant VoI, in-region segmentation for maxillary sinus grows benefited from shared features from different levels in the pyramid network. Deeper networks have greater discriminative power due to the additional nonlinearities and better quality of local optima [49]. However, convolutions with 3D kernels are computationally expensive in comparison to the 2D variants, which hamper the addition of more layers. Moreover, 3D architectures have a large number of more trainable parameters, with each layer adding C l C l− 1 i� x,y,z { } k (i) l weights to the model. C l is the number of feature maps in layer l, and k x,y,z { } l is the size of its kernel in the respective spatial dimension. Overall this makes the network increasingly prone to overfitting, which increases GPU memory dramatically.
In order to set up a deeper 3D architecture, we adopt the sole use of small 3 3 kernels that are faster to convolve with and contain less parameters.
is design approach was previously found beneficial for classification of natural images but its effect is even more drastic on 3D networks [22]. When compared to other size choices such as 5 3 , the 3 3 kernels reduce the element-wise operations and regarding trainable parameters in a large scale. erefore, 3D deeper network that use smaller kernels are more efficient to deal with 3D medical image segmentation maintaining acceptable accuracy. However, deeper networks are more difficult to train, where the forward or backwards propagated signal may explode or vanish if care is not given to retain its variance [50]. is phenomenon especially appears in the 3D neural network. Consequently, we introduce a variant of the residual dense block [51] to build the backbone of BE-FNet for 3D image encoder-decoder extracting significant feature maps. e residual dense block holds more depth concatenations and fewer parameters, which serves great discriminative capability and more efficient end-to-end training process. Besides, to avoid the problem of "internal covariate shift" [52], more seriously in the 3D network, we adopt Batch Normalization (BN) technique to all hidden layers [52], which allows normalization of the FM  Figure 2: Overview of the proposed multitask network for maxillary sinus segmentation. e network consists of three parts: preprocessing, image encoder for bounding box estimation, and decoder for in-region segmentation. Preprocessing extracts coarse RoIs that are passed to the image encoder to estimate the 3D bounding box attributes. en, cropped bounding boxes are fused to the image decoder following FPN architecture for sinus segmentation with softmax function at the end. Mathematical Problems in Engineering activations at every optimization step in order to better preserve the training signal. e structure of the residual dense block is described in Figure 4.

3D Bounding Box Estimation.
is strategy includes salient object detection and the size estimation of the 3D bounding box within the image encoder branch. In semantic segmentation, Share-net is used to set up an attention probability map assisting evaluation to foreground, achieving an excellent performance [53]. More recently, in medical image analysis, many works adopt this idea to localize the objective effectively [54,55]. Residual attention network (RAN) integrates the residual blocks into architecture of attention branch to enhance the saliency in backgrounds [47]. However, converting RAN to 3D version directly costs massive computation resource in soft mask branch (SMB). Inspired by SE-Net [56], instead of element-wise operations between SMB and trunk branch (TB), we apply the block of attention excitation (AE) for strengthening significant 3D feature expression that forms the attention probability map for 3D bounding box localization. Figure 5 describes the structure of the AE block in details using attention excitation strategy. At the beginning, the AE block receives the output z × h × w × c of the residual dense block. In the upper branch, the 3D feature maps are squeezed to z × c numbers with global pooling. For example, feature maps with z × h × w × produce z × values to concatenate a vector feeding one fully connection network, where hidden layers first squeeze the input size to (1/8) × z and the output layer restore it to z. e last sigmoid function makes sure the final trainable weights to fall in [0, 1]. At the end of both branches, feature fusion happens relying on element-wise multiplication. With the AE block, the trainable attention weights enhance the salient features of the foreground, which facilitates the effective localization of the 3D bounding box eliminating false positives remarkably on maxillary sinus segmentation.
Traditional methods on size estimation, such as RPN, generally set up a trainable regression network to predict the geometric attributes of the bounding box. However, this approach leads to overwhelming computing resource cost for automatic anchor generations and overfitting, which brings unacceptable issues in the deeper 3D neural network. In this paper, we extract the largest salient area with connection analysis. Based on prior knowledge to maxillary sinus volume, we estimate a fixed and overdesigned size d × h × w such as 100 × 150 × 150 in the original cube. In other layers of the image encoder, the corresponding sizes are calculated easily according to the up or down sampling ratios.
e image encoder branch takes over the entire training process for salient object detection, and to overcome the class imbalance problem in medical images, we introduce Dice loss and weighted crossentropy loss over pixels for RoI attention. Its advantages include free hyperparameter and weak saliency detection.
e Dice loss is defined as follows: where p and g represent predicted mask and ground truth, respectively. e sums are computed over the N voxels of the predicted volume. ϵ is a smoothness term which prevents from devision by 0. In the optimization stage, the Dice loss is minimized by gradient descend using the following derivate equation (2). Equation (3) gives the hybrid loss for 3D salient detection, where L wce denotes the weighted crossentropy loss and L d the Dice loss. ‖W‖ 2 2 is the sum of squares of two norms in subnetwork and λ 1 , λ 2 , and λ 3 weight three terms in L roi :

Cropping Fusion Layer.
Feature fusion with the skip connection in different levels has promoted the convolutional neural network to acquire more advanced learning ability [18,46,57]. For better 3D maxillary sinus segmentation, we set up the network designing multilevel feature fusion layers such as FPN, where the estimated bounding box is cropped directly without any resampling and fused into the decoder branch. In every fusion node, the higher level features are convolved up two times by the residual dense block to concatenate the cropped one of the low level. Related details are illustrated in Figure 2. is mechanism reduces the size of fed data improving efficacy of training and inference in BE-FNet, getting rid of limitations of computing resources to the 3D deeper neural network.

In-Region Segmentation with Hybrid Loss.
Relying on significant 3D bounding box estimation and shared features fusion, the decoder branch is constructed for in-region segmentation with two trainable tasks. One is responsible for evaluation on the performance of entire segmentation. e other pays more attention on identifying the blurring boundary. Consequently, at the end of BE-FNet, a hybrid loss function is designed increasing more constraints to the network, followed by where L d and L c denote Dice loss and contour-aware loss. λ 1 , λ 2 , and λ 3 are weighted coefficients and ‖W‖ 2 2 belongs to the regularization term of equation (4). e Dice loss discussed before ensures the performance even though the foreground accounts for a relatively smaller portion in the background. In addition, the object boundary plays a critical role in segmentation task. Especially, in maxillary sinus cavity, as a result of interference caused by lesions, some parts of the sinus boundary are ambiguous and lacks necessary information for feature extraction. Some examples are illustrated in Figure 6. To address this problem, we adopt the recently proposed strategy contour-aware loss of the deep contouraware network (DCAN) to learn and predict the blurring boundary in medical images [26]. Both of the tasks are trained in a parallel and end-to-end way together.

Multiscale Pathway.
e final version of the proposed network BE-FNet is built by extending the primary model with multiple resolution pathways that are identical with the architecture completely, which includes high, normal, and low resolution branches, named H-BE-FNet, N-BE-FNet, and L-BE-FNet, respectively. At the end of each network, we resample the 3D images to original resolution rate and vote the final segmentation based on three predicted probability maps. e spacing between pixels along z, y, and x axes of acquired CT scans fall from 0.5 × 0.35 × 0.35 mm to 0.625 × 0.39 × 0.39 mm in our dataset. en, we resample the input images spacing ranging from the original to 1.0 × 1.0 × 1.0 mm, 1.5 × 1.5 × 1.5 mm, and 2.0 × 2.0 × 2.0 mm for H-BE-FNet, N-BE-FNet, and L-BE-FNet, respectively.

Implementation Details.
e BE-FNet architecture was implemented using Pytorch [58] and Tensorflow [59] libraries. All the models were trained from scratches. e Concat + Figure 4: Example of the residual dense block as the building module for our proposed network, which holds the obvious characteristics of residual and dense blocks.    parameters of the network were initialized with random values and trained with backpropagation based on Adam [60], using an initial learning rate (LR) of 0.001, β 1 � 0.9, and β 2 � 0.999. e learning rate would be reduced by 0.1 if the network went to plateau after 20 epochs. Five-fold crossvalidation was conducted on 50 scans. e detailed architecture of the BE-FNet network is shown in Table 1. In general, based on prior experience to maxillary sinus anatomical structure, we preferred 3 × 3 × 3 and 1/2 × 1/2 × 1/2 for 3D convolution kernel and max pooling. Preprocessing is responsible for extracting coarse VoI and dropping redundant backgrounds. With the prediction of well-trained LocR, we localized the objective and estimated an overdesigned 3D bounding box in the RDBlock4 layer. To the bounding box sizes in RDBlock3 and RDBlock2, we up sampled them along x, y, and z axes by two or four times with trilinear interpolation in order to match the dimensions of different level pyramids for feature fusions. SegR and SegC represent two different tasks with hybrid loss. e AE block denotes a mechanism to excite the object's attention in maximum that does not change the scales of the feature maps. In this multitask network, we first train the LocR branch for 3D bounding box estimation, and then SegR and SegC branches for maxillary sinus segmentation in a parallel way.

Dataset.
In our study, approved by an institutional review board for restricted domain in our project, we used 50 CT volume scans (12.13 GB) by SOMATOM Definition AS + SIEMENS containing maxillary sinus to evaluate the proposed multitask network BE-FNet. All of them have the same 512 × 512 in-plane resolution but with different number of axial slices. e spacing between pixels along z, y, and x axes of the acquired dataset falls within from 0.5 × 0.35 × 0.35 mm to 0.625 × 0.39 × 0.39 mm. e corresponding ground truth is provided by two experienced radiologists manually. e training and inference of our proposed model are run with two NVIDIA GTX1080 Ti 11 GB GPUs, 32G RAM and Intel i7-7700K CPU with 8 cores 4.20 GHz. Especially, we did not adopt any preprocessing of noise management for our dataset in order to preserve original details and avoid possible artifacts for training and inference process.

Evaluation Metrics.
Our segmentation method was evaluated using four quantitative metrics, including Dice Similarity Coefficient (Dice) [61], Volumetric Overlap Error (VOE), Average Symmetric Surface Distance (ASD), and Inference time cost on GPU. In such case, we assume the maxillary sinus as the foreground and the others as the background. e ground truth and predicted region of a maxillary sinus is denoted as A and B, respectively. e Dice is used for precise evaluation of the segmentation results, with a higher number indicating a better result, which is an important indicator for the evaluation of segmentation. Dice ∈ [0, 1]. As for perfect segmentation, Dice � 1: e VOE is the nonoverlapping ratio of the segmentation result and ground truth data. It is also used to evaluate the precision of the results, with a lower number indicating a better result, as shown in the following equation: ASD (in millimeters) evaluates the distance of the border voxels of segmentation and ground truth. For ASD, a lower value denotes a better segmentation. Related metric is defined as follows:

ASD is calculated based on the surface voxels. S(A) denotes the set of surface voxels. e shortest distance from a voxel v to the set S(A) is defined as d(v, S(A)) � min s A∈S(A) ‖v − s A ‖,
where ‖·‖ is the Euclidean norm. As for perfect segmentation, this quantity is zero. Inference time is used to evaluate the cost of computational resource and complexity of the BE-FNet model in the inference process.

Ablation Studies.
e challenges regarding the maxillary sinus segmentation fully automatic process include (1) abnormal and ambiguous anatomy structure of maxillary sinus in CT scans, (2) ubiquitous lesions resulting in interferences to accurate segmentation, and (3) the relationship of overwhelming scale of data with the 3D deeper neural network. In experiments, we have evaluated our proposed model for comparison with other state-of-the-art methods on the performances of maxillary sinus segmentation. Examples of predicted vs. ground truth of comparison methods are shown in Figure 7. Eight volume predictions are illustrated in Figure 8.
Our proposed BE-FNet multitask network is divided into two subnetworks bounding box estimation and in-region segmentation. By bounding box estimation, the backbone of the entire network enjoys the acceleration based on the effective VOI eliminating false positives notably. To stick out our proposed model for improved segmentation on speed and quality, we compared our proposed model with the state-of-the-art methods 3D U-Net [21], V-Net [29], HL-FCN [62], 3D CNN + CRF [22], 2D FCN + RNN [63], 3D CNN + Level Set [64], and 3D Deep Nested Level Set [65] to demonstrate the predominance of BE-FNet with the efficient bounding box estimation and in-region segmentation strategy. For the sake of fairness, all models were evaluated on the same receptive field of 92 × 92 × 92. Specifically, in the design of 3D U-Net or 3D FCN, the baseline of the image encoder and decoder was built according to BE-FNet's  Table 2. 3D U-Net [21] is the 3D version of 2D U-Net with multilevel feature maps concatenation. In experiments, we evaluated the 3D U-Net model with different resolutions on Dice loss over pixels. Without the significant VoI, the encoder-decoder branches suffered from more false positives in the same receptive field and acquired relatively lower Dice 0.816 ± 0.084 of three resolutions ensemble strategy. V-net [29] optimizes 3D U-Net using Dice loss and a novel training set augmentation strategy with random nonlinear transformations and histogram matching. Consequently, it gave an obvious improvement Dice 0.883 ± 0.053, VOE 13.87 ± 8.06, and ASD 3.95 ± 3.73. HL-FCN [62] presents the hybrid loss function that is designed under a multitask learning framework to tackle the class imbalance issue and improve the discrimination capability, providing a remarkable Dice of 0.905 ± 0.059. 3D CNN + CRF [22] pays more attention on how to facilitate 3D segmentation model efficiently on CT scans with a small 3D convolutional kernel. Meanwhile, the CRF method is selected to join the end of the network for optimization for outputs. 2D FCN + RNN [63] is derived from the 2D model and adopts RNN to extract features between slices. Both of the methods lack insufficient context utilization, resulting in the excessive oversegmentation phenomenon with Dice 0.828 ± 0.087 and 0.835 ± 0.073, respectively. e level set model relies on curve evolution that runs competent in complicated shape of object segmentation but sensitive to noise independently. To address the problem, 3D CNN + Level Set [64] tries to predict subgrid areas on the probabilities of foreground or background with the deep learning network and allocates weights to the energy, which prevent the level set functional from being trapped into local minima. 3D Deep Nested Level Set [65] lies on 3D CNN to generate proper initial contours to guarantee evolutions happening in target regions. However, the lesions in maxillary sinus cavity appear stochastic to locations and the outside is filled with organs with different densities. Both of models cannot fit these complicated conditions showing lower Dice 0.719 ± 0.140 and 0.783 ± 0.106. In contrast, our proposed model is qualified the ability of advanced bounding box estimation and more accurate in-region segmentation with remarkable results Dice 0.929 ± 0.035, VOE 10.89 ± 5.67, and ASD 3.04 ± 2.48 over five-fold crossvalidations on average. Specifically, with the same baseline of encoder-decoder based on FCN or U-Net, BE-FNet reflects obvious predominance that could be generalized in related research fields.
In addition, we also evaluated BE-FNet in comparisons of the aforementioned methods on inference costs to discuss the time complexity of our proposed model. 3D CNN + Level Set and 3D Deep Nested Level Set were not included since they belong to semiautomatic algorithms interacted by users and related achievements were not satisfied enough. Based on the same configuration of training, V-Net and HL-FCN are fundamentally derived Table 1: Architecture of the proposed BE-FNet, consisting of preprocessing, 3D image encoder and decoder, bounding box estimation, cropping fusion, and logits parts. e symbol denotes no information about this item. e first and second columns indicate the descriptions of modules and their sublayers, respectively. e forth and fifth columns tell the size of kernels and their output channels, respectively.
from U-Net, and their performances of time complexity were similar, 11.127 s, 11.173 s, and 11.125 s for inference, respectively. As a result of superior bounding box estimation by encoder branch, proper VoI can be extracted significantly, which tremendously scales down the input size fed into the segmentor and helps our proposed neural network to restrain false positives achieving faster inference process 0.511 s and more accurate results.

Comparison to Other State-of-the-Art Multitask
Networks. Furthermore, we also compared BE-FNet to a discrete VoI localization-based method of the multitask network. In detail, 3D Mask R-CNN [34], RA-UNet [66], 2D FCN + 3D FCN [67], and 3D + 2D FCN [68] were considered and results are listed in Table 3. We provide these results for reference and emphasize benefits of our optimized 3D bounding box estimation strategy to maxillary sinus segmentation that supplies generalization in similar tasks. Among the approaches, 3D Mask R-CNN [34] utilizes the RPN network to produce plenty of anchors for fitting foregrounds. In practice, the objects in medical image backgrounds have characters of low contrast and abnormal anatomy structure that cause RPN to generate overestimated or underestimated 3D bounding box leading to failures of bounding box detection and regression, 0.765 ± 0.121, 30.08 ± 12.39, and 9.39 ± 10.72 for Dice, VOE, and ASD, respectively. 2D FCN+3D FCN [67] employs 2D FCN to localize the possible objective on each 2D slice with the predicted probability map, whereas 2D FCN serving as an RoI locator produces more false positive candidates due to weak ability of extracting features along z-axis. 3D + 2D FCN [68] estimates the 3D bounding box depending on 3D FCN. For facilitating GPU memory cost, it uses 2D FCN to form the final result that equally cannot deal with volume data at last. RA-UNet [66] designs a novel multitask network for attention localization and in-region segmentation. e attention mechanism lives on the strategy of Residual Attention Network (RAN) [47] to improve VoI accuracy. However, their proposed RAN architecture only focuses on 2D slices with lower relative accuracy, and especially the probability map predicted by RAN is selected as a coarse segmentation giving the problem of underestimated sizes for the bounding box. In this section, BE-FNet + HL refers to BE-FNet with hybrid loss to stress advantages of contour-aware loss. BE-FNet + HL adopts the AE block to explore 3D target attention effectively and estimate an overestimated but proper size of the 3D bounding box for preventing from more false positives. In addition, similar with skip connection of FPN architecture, shared features of different levels in pyramid hierarchy with cropping fusion connections enable our proposed model to achieve the best performance Dice 0.947 ± 0.031, VOE 10.23 ± 5.29, and ASD 2.86 ± 2.11, respectively, emphasizing the advantages of tasks joint training and crossmodule feature sharing. Besides, the outperformance of BE-FNet + HL than BE-FNet indicates benefits of hybrid loss attractively.
Besides, we compared our proposed BE-FNet with the state-of-the-art algorithms for evaluations on time complexity in order to verify our achievement of multitask. Because of generating multiple anchors on each pixels in feature maps, RPN had to infer proper bounding box resulting in great cost of computation with 1.020s for 3D Mask R-CNN + Ensemble. Although RA-UNet + Ensemble depended on salient detection to explore candidate bounding boxes saving plenty of time complexity, this method lacked an effective strategy of estimation that causes excessive false positives in prediction. 2D FCN + 3D FCN and 3D FCN + 2D FCN tried 2D convolution operations to  learn discriminative features in context that is not competent in space and leads to unsatisfactory results, even though they served faster speed 0.068 s and 0.067 s, respectively. Since BE-FNet + Ensemble adopts efficient and effective mechanism of bounding box estimation, they provided more advanced result of Dice 0.929 ± 0.035 with time cost 0.511 s. BE-FNet + HL + Ensemble with hybrid loss addressed issues of class imbalance and blurring boundary and ran the best performance of Dice 0.947 ± 0.031, which gave more time complexity 0.539 s in an inference process. Statistics demonstrate that our proposed BE-FNet can not only outperform the state-of-the-art models in segmentation accuracy but also reflect a lower time complexity that could be facilitated in computer-aid diagnosis.

Discussion
For accurate and robust maxillary sinus segmentation in clinical diagnosis, we propose a novel multitask neural network to implement an end-to-end training and inference process. ere are some difficulties for traditional methods including inhomogeneous intensity, plenty of lesions, abnormal anatomical structure, blurring boundary of sinus cavity, and excessive computing costs in the deeper 3D neural network. For a fully automatic segmentation skill in generalization, we provide a novel model BE-FNet adaptive to maxillary sinus in low contrast CT scans. e main advantages of the proposed approach are demonstrated: (1) To facilitate 3D segmentation of large data in CNN, we design an efficient and effective deeper neural network with multitask of estimating 3D bounding box and in-region segmentation. e 3D bounding box estimation helps to reduce great computing cost and eliminate false positives remarkably enhancing capability of generalization in our proposed network. (2) To prevent overfitting problems happened in lots of research studies, an overestimation strategy is devised to generate a proper 3D bounding box that is able to extract the most significant space for in-region segmentation. Besides, for increasing depth of the deep neural network, we design residual dense blocks as the backbone of the model to improve the capability of learning. (3) We supply a mechanism of attention excitation to improve salient detection applied in bounding box estimation process, which does not give any computation burden for 3D deep neural networks. Especially, the structure of multilevel feature fusion in the pyramid network strengthens the ability of identification to global and local discriminative features in foreground and background achieving more advanced segmentation results in space. (4) To resolve the problem of blurring boundary in sinus cavity, we design a hybrid loss function with Dice and contour-aware loss. Moreover, a multiresolution model ensemble strategy has been introduced to boost segmentation robustness, generating more reliable results and constraining false positives tremendously.
In addition, the whole baseline of our proposed model is fully automatic. At the beginning, we need to train the image encoder branch for effective bounding box estimation. en, with the prediction, the entire network completes an end-toend process for in-region segmentation.
is novel automatic framework combining hybrid tasks and loss functions provides more accurate maxillary sinus segmentation especially in low contrast and noisy CT scans. To show the generalization capability of our method in the clinical practice, we tested our trained model on dataset with five crossfold evaluations. First of all, we compared BE-FNet with the state-of-the-art frameworks to stress the importance of 3D bounding box estimation. Figure 7 illustrates that our proposed model can deal with cases in low contrast, heterogeneous, noisy backgrounds, and outperforming commonly used frameworks based on deep learning. As proven, Table 2 and Figure 7 demonstrate that BE-FNet has more accuracy and robustness regardless of possible lesions, holding an average Dice 0.929 ± 0.035, VOE 10.89 ± 5.67, and ASD 3.04 ± 2.48 with obvious superiority among approaches. Meanwhile, we also evaluated different models on the inference time cost of GPU. As a result of significant VoI extraction, the magnitude of trained data is reduced by an exciting extent, which facilitates our research in deeper 3D network. Furthermore, to emphasize the efficacy of our proposed strategy on bounding box estimation, 3D Mask R-CNN [34], RA-UNet [66], 2D FCN + 3D FCN [67], and 3D + 2D FCN [68] based on multitask networks with localization joined in comparison and BE-FNet achieved the state-of-the-art results on maxillary sinus segmentation.
ese findings indicate three key points. At first, AE block mechanism benefits the accurate salient object localization. Besides, the setting of experienced overdesigned size for the bounding box eliminates false positives as far as possible. At last, hybrid loss functions explore the optimized balance of extracting blurring boundary and small object segmentation in noisy texture. Consequently, our proposed network that combines efficient 3D bounding box estimation and in-region segmentation tasks overcomes the aforementioned issues of popular methods, serving a significant advanced result.
e presented work has some limitations. At first, for salient detection with multiple objectives, if they are overlapped or close, our proposed mechanism tends to make mistakes of identification, which influences the performances of 3D bounding box estimations and in-region segmentation. Consequently, we should further discuss how to effectively estimate multiple bounding boxes and segment multiple objects in practice. In addition, in cases, we found that areas of salient detection account for low percentages of ground truth, which causes the estimated center of the 3D bounding box to deviate from excepted position away and more false positives in segmentation. A possible solution could incorporate dilated convolutions to enlarge local receptive fields for exploring the complete attention picture.
Mathematical Problems in Engineering 13

Conclusion
In this paper, we present a multitask neural network for 3D maxillary sinus segmentation from CT scans, which consists of 3D bounding box estimation and in-region 3D segmentation. With AE block mechanism, the proposed model is able to detect the maxillary sinus effectively. en, based on geometrics, an overdesigned size of the 3D bounding box is estimated. Compared with the state-of-the-art methods, this strategy hinders from inappropriate VoIs resulting in oversegmentation or undersegmentation. Moreover, through cropping fusion layers the shared features in different hierarchy of the pyramid network improve the inregion segmentation results remarkably. At last, to address the issue of the blurring cavity boundary, the hybrid loss function guarantees advanced extraction of candidate boundaries and small objective segmentation in noisy backgrounds of the medical image. Compared with the stateof-the-art methods, our BE-FNet is benefited from bounding box localization which saves computing resources and improves the performance of in-region segmentation. To further evaluate the proposed estimation mechanism, we tested it to compete with popular models such as 3D Mask R-CNN. After extensive experiments, the competitive results were found, respectively. Some limitations are presented for future work to be optimized.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.