Vision Detection Based on Modified UPerNet with Component Analysis Modules

Semantic segmentation with convolutional neural networks under a complex background using the encoder-decoder network increases the overall performance of online machine vision detection and identification. To maximize the accuracy of semantic segmentation under a complex background, it is necessary to consider the semantic response values of objects and components and their mutually exclusive relationship. In this study, we attempt to improve the low accuracy of component segmentation. )e basic network of the encoder is selected for the semantic segmentation, and the UPerNet is modified based on the component analysis module. )e experimental results show that the accuracy of the proposed method improves from 48.89% to 55.62% and the segmentation time decreases from 721 to 496ms. )e method also shows good performance in vision-based detection of 2019 Chinese Yuan features.


Introduction
As one of the primary tasks of machine vision, semantic segmentation differs from image classification and object detection. e image classification process involved the recognition of the the type of object but cannot provide position information [1], whereas object detection can be used to detect the boundary and type of the object but cannot provide the actual boundary information [2]. On the other hand, semantic segmentation can recognize the type of the object and divide the actual area at the pixel level, as well as implement certain machine vision detection functions, such as positioning and recognition [3]. As we start from image classification, move to object detection, and finally reach semantic segmentation, the accuracy of the output range and position information improves [4]. In the same manner, the recognition precision increases from the image-level to the pixel-level. Semantic segmentation achieves the best recognition accuracy; therefore, it is useful in (1) distinguishing the entity from the background, (2) obtaining the position information (centroid) clearly physically defined by indirect calculation, and (3) performing machine vision detection and identification organization, which require high spatial resolution and reliability [5,6]. e online semantic segmentation with convolutional neural networks (CNNs) under a complex background is effective for improving the overall performance of online machine vision detection and identification [7] when maintaining the same architecture of the encoder-decoder network and convolutional and pooling layer and equivalently transforming the fully connected layer, thus yielding broad generalization. In recent years, ResNet has been used to replace the shallow CNN to optimize semantic segmentation results significantly [8]. For machine vision detection and identification under a random-texture complex background, it is necessary to eliminate the random-texture complex background to extract the object without affecting the original features of the object [9]. e difficulty lies in the randomness of the textured background, which makes it difficult to employ typical periodic texture elimination techniques, such as the frequency domain filtering and image matrix methods [10,11]. On the contrary, the encoder-decoder semantic segmentation network ultimately retains the classification components in the network backbone, thus exhibiting larger receptive fields and better pixel recognition ability [12,13], as depicted in Figure 1. Unreasonably selected and consequently incorrectly used component analysis modules will lead to an excessively small foreground range, resulting in the misjudgment of component pixels. If the component analysis module is too sensitive, the foreground range will be too broad; thus, it would be difficult to remove misjudged pixels [14]. erefore, in the process of semantic segmentation under the complex background, it is necessary to consider objects, the contradiction between the component semantic response values, and their mutual exclusion relationship, while maximizing the accuracy of the semantic segmentation under the complex background using the encoder-decoder network. Figure 2 shows a flowchart of the semantic segmentation under the complex background using the encoderdecoder network. e process can be described as follows: the component classifier of the encoder-decoder network recognizes the pixel-level semantics and response of the pixels in the image; the object classifier recognizes the pixel-level object semantics and the response and extracts misjudged pixels of the foreground object in semantic segmentation; finally, the mutually exclusive relationship between component semantics and object semantics is considered, and non-background-independent semantics are determined to achieve effective semantic segmentation under a complex background to improve the model accuracy [15].
In this study, we focus on online semantic segmentation under a complex background using the encoder-decoder network to solve the above described mutual exclusion relationship problem between component semantics and object semantics. e main contributions of this study are threefold: (i) We attempted to improve the low accuracy of component segmentation and selected the superior basic encoder-decoder network according to the performance. (ii) We modified the UPerNet based on the component analysis module to maximize the accuracy of the semantic segmentation under a complex background using the encoder-decoder network while maintaining an appropriate segmentation time.
(iii) We show that the proposed method is superior to previous encoder-decoder network and has satisfactory accuracy and segmentation time. We also show the application of the proposed method in billnote anticounterfeiting identification. e rest of this paper is organized as follows. In Section 1, we outline related works. In Section 2, we introduce a method for semantic segmentation under a complex background using the encoder-decoder network. In Section 3, we verify the proposed method. In Section 4, we present the conclusions.

Evaluation of the Semantic Segmentation Performance.
We can generally evaluate the CNN semantic segmentation performance from the accuracy and running speed. e accuracy indicators usually include the pixel accuracy [16], mean intersection over union [16], and mean average precision [17]. e pixel accuracy PA is defined as the number of pixels segmented correctly accounting for the total number of image pixels; the mean intersection over union IoU is defined as the degree of coincidence between the segmentation results and their ground-truth; the mean average precision AP IoU T is the mean of average precision scores for segmentation results, whose intersection over union no less than IoU T , for each classes.
If the object detected by machine vision has k categories, the semantic segmentation model requires the label of the k + 1 categories denoted as L � l 0 , l 1 , . . . , l k , including the background. Denoting the number of pixels of l i mis-recognized as the pixel of l j and l i (i ≠ j) as p ij and p ii respectively, the numbers of detected objects of l i misrecognized as l j and l i (i ≠ j) as N ij and N ii , respectively, the pixel accuracy can be calculated as follows: (2) e running speed of CNN semantic segmentation can be measured by indicators including the segmentation time T seg [18], which is defined as the time needed to segment the image by running the algorithm. e theoretically shortest possible time required to segment the image is also labeled as the theoretical segmentation time T seg−t , and the time required for the algorithm to actually segment the image is known as the actual segmentation time T seg−a . If not otherwise specified, T seg−a is denoted as T seg .

End-To-End Encoder-Decoder Semantic Segmentation
Framework. Although CNN semantic segmentation performs as a single-step end-to-end process, which is not further divided into multiple modules to deal with, the connection of numerous modules directly affects the CNN.
e end-to-end semantic segmentation framework using the encoder-decoder enables the CNN to detect images with any resolution and output prediction map results with constant resolution. Typical networks include fully convolutional networks (FCN) [19], SegNet [20], and U-Net [21]. Figure 3 shows a schematic of the FCN model. e FCN is an end-to-end semantic segmentation framework proposed by Jonathan Long et al. (University of California, Berkeley) in 2014. e main idea is as follows: the operation of a fully connected layer is equivalent to the convolution of a feature map and a kernel function of identical size. e fully connected layer is converted into a convolution layer, which converts the CNN into a full convolution operation network consisting of a complete convolution layer (convolution operation) and pooling layer (convolution operation) to process images of any resolution. In this manner, the limitation of the fully connected layer is overcome, i.e., images with different resolutions can be processed. e original resolution is restored after eight times bilinear upsampling by taking the pooling layer as an encoder, designing a cross-layer superimposed architecture as a decoder, yielding the final output feature map of the network by upsampling, and adding to the output feature map of each pooling layer (namely, the encoder) to obtain a feature map with higher resolution. e CNN can perform end-to-end semantic segmentation through a fully convolutional and cross-layer superimposed architecture; therefore, various CNNs are capable of achieving end-to-end semantic segmentation. Using the framework described, the IoU reached 62.2% in the VOC2012 semantic segmentation testing set, which is 10.6 % higher than the classic methods and 12.2% (its IoU is 50.0%) higher than the SDS [22] further segmented by CNN object detection and classical method. e ResNet proposed by the Amazon Artificial Intelligence Laboratory serves as a basic network for constructing FCNs for semantic segmentation; the IoU in VOC2012 reaches 8.6% [23]. e prediction results of the FCN application are obtained by eight-fold bilinear interpolation of the feature map, including the problems of detail loss, smoothing of complex boundaries, and poor detection sensitivity of small objects. e results ignore the global scale of the image, possibly exhibiting regional discontinuity for large objects that exceed the receptive field. Incorporating full connection and upsampling increase the size of the network and introduces a large number of parameters to be learned. Figure 4 shows a schematic of the SegNet model, which is an efficient, real-time end-to-end semantic segmentation network proposed by Alex Kendall et al. (Cambridge University) in 2015. e idea is that the encoder and the decoder have a one-to-one correspondence, and the network applies the pooled index in the encoder's maximum pooling to perform nonlinear upsampling, thus forming a sparse feature map; then, it performs convolution to generate a dense feature map. SegNet defines the basic network of the encoder-decoder and deletes the fully connected layer to generate global semantic information. e decoder utilizes the encoder information without training, while the required amount of training parameters is 21.7% of that of the FCN. For the prediction of the results, SegNet and FCN occupy a GPU memory of 1052 and 1806 MB, respectively, and the GPU memory occupancy on GPU GTX 980 (video memory 4096 MB) is 25.68% and 44.09%, respectively. erefore, the occupancy of SegNet is 18.41% lower than that of FCN. In [20], the design of SegNet on ResNet was described, and the IoU in VOC2012 reached 80.4% [24]. e IoU of SegNet tested in VOC2012 was reported to be 59.9%, and the

Start
The encoder-decoder network transforms images to feature maps, and the component classifier recognizes the part response map and generates part segment results.
The object classifier recognizes the object response map and generates object segmentation results with the pixels of the foreground object extracted.

End
The relationship between the component and object is analyzed to improve the accuracy of part segmentation. efficiency was found to be 2.3% lower than that of FCN; furthermore, there was the problem of false detection at the boundary. Figure 5 shows a schematic of the U-Net model, which was proposed by Olaf Ronneberger (University of Freiburg, Germany) in 2015. e idea was to design a basic network that can be trained by semantic segmentation images and modify the FCN cross-layer overlay architecture with the high-resolution feature map channels retained in the upsampling section and then connect it to the decoder output feature map in the third dimension. Furthermore, a tiling strategy without limited by GPU memory was proposed; with this strategy, a seamless semantic segmentation of arbitrary high-resolution images was achieved. With U-Net, a IoU of 92.0% and 77.6% was achieved in the grayscale image semantic segmentation datasets PhC-U373 and DIC-HeLa, respectively. e skip connection was used in the ResNet framework to improve U-Net, and a IoU of 82.7% was achieved in the VOC2012 [25]. ere are two key problems with the application of U-Net: the basic network needs to be trained, and it can only be applied to specific task, i.e., it has poor universality. Figure 6 shows a schematic of the UPerNet model, which was proposed by Tete Xiao (Peking University, China) in 2018. In the UPerNet framework, a feature pyramid network (FPN) with a pyramid pooling module (PPM) is appended on the last layer of the backbone network before feeding it into the top-down branch of the FPN. Object and part heads are attached on the feature map and are fused by all the layers put out by the FPN.

Material and Methods
e semantic segmentation under a complex background based on the encoder-decoder network will establish an optimized mathematical model with minimal segmentation time T seg−min , segmentation time T seg , and accuracy PA. Under the encoder-decoder network, the backbone network η main , the depth d main , and the decoder η decoder are obtained to form an encoder. By selecting the relatively better η main and η decoder of the basic network, the component analysis module to improve the optimized architecture is proposed, and the encoder-decoder network with optimized PA for semantic segmentation under a complex background is obtained. In the encoder-decoder network, the encoder    Mathematical Problems in Engineering transforms color images (three 2D arrays) to 2048 2D arrays. e encoder is composed of convolutional layers and pooling layers, and it could be trained on large-scale classification datasets, such as ImageNet, to gain greater feature extraction capability.
Modeling of semantic segmentation under a complex background using the encoder-decoder network and selection of η main and η decoder . e encoder network is determined by the backbone network η main , depth d main , and decoder η decoder . Segmentation time T seg and accuracy PA depend on η decoder , η main , and d main , which can be expressed as PA(η decoder , η main , d main ) and T seg (η decoder , η main , d main ).
Denoting the minimal segmentation time as T seg−min (the recommended value is 600 ms), the mathematical model of the optimization for semantic segmentation under a complex background based on the encoder-decoder network is as follows: e parameters of the model to be optimized are d main , η main , and η decoder .
First, d main , η main , and η decoder are combined. en, the object segmentation accuracy PA obj , component segmentation accuracy PA comp , and T seg are compared to select the relatively better η main and η decoder for the basic network. e ADE20K dataset, which has diverse annotations of scenes, objects, parts of objects, and parts of parts [26], is selected. In this paper, we denote parts of objects as component. Using a GeForce GTX 1080Ti GPU and the training method described in [27], we obtained PA obj and PA comp for improved FCN [19], PSPNet [28], UPerNet [29], and other major encoder-decoder networks for semantic segmentation used in the ADE20K [26] object/component segmentation dataset. We evaluated T seg of different network on the ADE20K test set, which consist of 3000 different resolution images with average image size of 1.3 million pixels. Table 1 displays the pixel accuracy and segmentation time of the main network architectures on ADE20K object/component segmentation tasks, where the relatively better indices are indicated by a rectangular contour.
From Table 1, the following observation can be made. ① In all networks, PA comp is less than PA obj by about 30%; ② η main and d main are equal in networks 1, 2, and 3; PA comp and PA obj are better in η decoder � FPN + PPM compared to η decoder � FCN or η decoder � PPM; ③η main and η decoder are equal in networks 3 and 4. When d main is doubled, PA comp improves slightly and T seg improves significantly. After a comprehensive consideration, we selected the UPerNet [23] encoder-decoder network, where η main � ResNet, d main � 50, and η decoder � PPM + FPN. Figure 7 shows the architecture of semantic segmentation under a complex background implemented by UPerNet [29]. e encoder ResNet reduces the feature map resolution by 1/2 at each stage. e resolution of the output feature maps within five stages is respectively reduced to 1/2, 1/4, 1/8, 1/16, and 1/32. e decoder is PPM + FPN. rough pooling layers with different strides, the feature maps are analyzed in a multiscale manner within PPM. rough three transposed convolution layers, the resolution of the feature maps is increased two times to 1/16, 1/8, and 1/4. e upsampling restores the feature map resolution to 1/1. e component analysis module recognizes the feature map and outputs both the object/component segmentation results. Figure 8 shows  Improvements of UPerNet for semantic segmentation under a complex background based on the component analysis module.
In this subsection, we describe the derivation of the component analysis module, the optimization of the function expression of the module, and the construction of the architecture of the component analysis module.
As shown in Figure 8, the component classifier recognizes N comp component semantics and outputs the component labels C u,v Comp of the pixel with image position (u, v) and the probability vector p u,v Comp corresponding to the various component labels. e relationship between C u,v

Comp and p u,v
Comp [31] is as follows: From equation (4) and (5), we obtain 6 Mathematical Problems in Engineering where p Obj−j is the probability of C u,v Comp . Weighting p Obj−j over p Comp−k to get p Comp−k instead of 1 × argmax k p Comp−k , reducing the weight of low-probability object labels, and increasing PA comp . With Comp−Obj , letting p Comp−k � 0 can increase the detection rate of background pixels. erefore, the module can be expressed as follows: which is the component analysis module yielded by replacing 1 × argmax k p Comp−k with argmax k p Comp−k and considering Comp−Obj . e optimized architecture of the UPerNet component analysis module is proposed based on equation (7). Figures 9(a)-9(c) show the optimized architecture obtained by replacing 1 × argmax k p Comp−k with arg max k p Comp−k by Comp−Obj and by both replacing 1 × argmax k p Comp−k with argmax k p Comp−k and considering Comp−Obj in the component analysis module, respectively.

ADE20K Component Segmentation
Task. For the UPerNet model, the backbone network of the encoder was ResNet, d main � 50, and the decoders are PPM + FPN + component analysis modules (before/after modification). We trained each network on the object/component segmentation task dataset ADE20K [26] to demonstrate the pixel accuracy PA Part and segmentation time T seg . e experiments were run on a GeForce GTX 1080Ti GPU.  Figure 9(c).

CITYSCAPES Instance-Level Semantic Labeling Task.
We trained each UPerNet (with/without component analysis module) on the instance-level semantic labeling task of the CITYSCAPES dataset [32]. To assess the instance-level performance, CITYSCAPES uses the mean average precision AP and average precision AP 0.5 [32]. We also report the segmentation time of each network run on a GeForce GTX 1080Ti GPU and an Intel i7-5960X CPU. Table 3 presents the performances of different methods on a CITYSCAPES instance-level semantic labeling task. Table 4 presents the mean average precision AP on class-level of the UPerNet with/ without the component analysis module in the CITYSCAPES instance-level semantic labeling task. From the table, it can be seen that the modified component analysis modules effectively improved the performance of the UPerNet. With the component analysis module, both AP and AP 0.5 are improved, and the segmentation time T seg increased slightly from 447 to 451 ms. Most of the UPerNet AP on class-level are improved. Figure 10 shows some CITYSCAPES instance-level semantic labeling results obtained with the UPerNet with/ without component analysis module.
Taking banknote detection as an example, we set up the semantic segmentation model by the component analysis modules (before/after modification) to vision-based detection of 2019 Chinese Yuan (CNY) feature in the backlight to demonstrate the segmentation performance of the proposed method.
e vision-based detection system consisted of an MV-CA013-10 GC industrial camera, an MVL-HF2528M-6MP lens, and a LED strip light. e field of view was 18.33°, and the resolution was 1280 × 1024. Under the backlight, we collected 25 CNY images of various denomination fronts and backs at random angles. en, we marked four types of light-transmitting anticounterfeiting features, namely, security lines, pattern watermarks, denomination watermarks, and Yin-Yang denominations. All four features were detected in the CNY images to generate our dataset (200 images). We trained the model with different component analysis modules from our dataset to demonstrate PA Part and T seg . Table 3 presents the pixel accuracy and segmentation time of UPerNet with different component analysis modules for CNY anticounterfeit features via vision-based detection, and Figure 11 shows the segmentation results of the anticounterfeiting features detected by UPerNet with/ without the component analysis module.
From Table 5, it can be seen that the proposed method improved PA Part from 90.38% to 95.29% T seg from 490 to 496 ms. Moreover, AP IoU T �0.5 increased from 96.1% to 100%, detecting all the light transmission anti-counterfeiting features without false detection, missing detection, or repeated detection.
Relation between comp. and obj.
Comp−Obj to optimize the module. (c) Replace 1 × argmax k p Comp−k with argmax k p Comp−k and analyze Comp−Obj to optimize the module.  [29] 54.18 726 Figure 10: CITYSCAPES instance-level semantic labeling by UPerNet.

Conclusions
In this study, we performed semantic segmentation under a complex background using the encoder-decoder network to solve the issue of the mutually exclusive relationship between the semantic response value and the semantics of object/component in the semantic segmentation under a complex background for online machine vision detection. e following conclusions can be drawn from this study.
(i) Considering the mutually exclusive relationship between the semantic response value and the semantics of object/component, we selected the mathematical model of semantic segmentation under a complex background based on the encoderdecoder network for optimization. It was found that

UPerNet
UPerNet with the component analysis module  vision-based detection with the 2019 CNY features, we showed that the proposed method improved PA Part from 90.38% to 95.29% while T seg increased only slightly from 490 to 496 ms; AP IoU T �0.1 also increased from 96.1% to 100%, detecting all the light transmission anticounterfeiting features without false detection, missing detection, or repeated detection. e model in which 1 × argmax k p Part−k was replaced with argmax k p Part−k and the corresponding component analysis module improved the performance of the UPerNet encoder-decoder network. However, the efficiency improvement is affected by the accuracy of object segmentation. In our next study, we will investigate the applicability of machine learning to the component analysis module to achieve a higher performance in different applications.

Data Availability
e ADE20K Dataset used to support the findings of this study is available at http://groups.csail.mit.edu/vision/ datasets/.
e CITYSCAPES Dataset used to support the findings of this study is available at https://www.cityscapesdataset.com. Its pretrained models and code are released at https://github.com/CSAILVision/semantic-segmentation325 pytorch.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.