Recognition of multi-modal fusion images with irregular interference

Recognizing tomatoes fruits based on color images faces two problems: tomato plants have a long fruit bearing period, the colors of fruits on the same plant are different; the growth of tomato plants generally has the problem of occlusion. In this article, we proposed a neural network classification technology to detect maturity (green, orange, red) and occlusion degree for automatic picking function. The depth images (geometric boundary information) information of the fruits were integrated to the original color images (visual boundary information) to facilitate the RGB and depth information fusion into an integrated set of compact features, named RD-SSD, the mAP performance of RD-SSD model in maturity and occlusion degree respectively reached 0.9147.


INTRODUCTION
In our daily life, vegetables are intricately linked to people's lifestyle and health, so the yield and quality of vegetables are closely linked to human life. The growth of fruits makes the dynamics tough to estimate and forecast in the natural environment with the common problems of overlap and obscuration, which make it hard to evaluate the phenotype of fruits.
Recognizing tomatoes based on color images faces two problems: first, tomato plants have a long fruit bearing period, the colors of fruits on the same plant are different, and green fruits are similar to the background color of plants; second, the growth of tomato plants generally has the problem of occlusion. In order to solve the influence of the environment and fruit growth stage on the accuracy of tomatoes recognition, we propose a color and depth image fusion method to enhance the recognition accuracy for tomato fruits.

RELATED WORK
The tomato plant has a long growth cycle, with green immature tomatoes comparable to the plant backgrounds. The intensive growth of tomato plants has the problems of occlusion, overlap and insufficient light, which brings challenges to recognition (Arefi et al., 2011;Baltazar, Aranda & Aguilar, 2008). Color thresholds are often used to segment tomatoes, for example, Khoshroo, Arefi & Khodaei (2014) based on R-G component, approach to detecting tomatoes target is bounding selection method, and one classification standard is the occlusion of fruit (occlusion or non-occlusion), and the other one is maturity degree (green, orange, and red) for combining six types of objects, where 1 i represents non-occluded immature fruits, 2 i represents occluded immature fruits, 3 i represents non-occluded semi-mature fruits, 4 i represents occluded semi-mature fruits, 5 i represents non-occluded mature fruits, and 6 i represents occluded mature fruits.
Each image in the training set is augmented with eight transformed versions by ToSensor, PhotometricDistort, Expand, RandomSampleCrop, RandomMirror, ToPercentCoords, Resize, and the SubtractMeans method in Fig. 2, which perform random translation, scaling, rotation and color transformation. Figure  In the tomato growing environments, obtaining images is susceptible of illumination that affect the identification accuracy. To address the problem, we propose a novel method HRGAN to learn removing highlight from unpaired training data, and learn the potential relationship between the highlight image domain H and the highlight-free image domain F. Train a generator network G, which takes highlight images I h 2 H as input to generate highlight-free image I f 2 F. Then exploit D as the discriminator to identify the generated images, as shown in Fig. 3. For the Generator network G, we explored Long and Short Term Memory (LSTM) to preserve the valuable features to ensure the realistic of detected highlight regions. The LSTM contains an input gate i t , an output gate o t , a forget gate f t , and a cell state C t , where t is time (Ding et al., 2019). The input of LSTM as shown in Eq. (1): where X t is the output feature of the residual module, W and b is the weight and deviation respectively, C t indicates the cell state that will be input to the next step, H t represents the output feature of the LSTM unit. We initialize the value of the highlight intensity as 0.5. At each time step, the current attention map is connected to the input image into the next recursive block of the recurrent network. The loss of HRGAN comes from the highlight detector and discriminator as Eq.
(2): The highlight detector compares the generated intensity mask M f g N 1 with the ground truth T. The detector is generated from 1 to N recursive blocks as in Eq. (3): where M i is the output extracted of the ith layer, and T i is the ground truth with the same size of the ith layer. b i is the weights of the Mean Square Error (MSE) loss for the ith iteration, we design b i ¼ 0:5 NÀiþ1 . L P is designed to calculate the global difference between the ground truth image and the highlight removal result (Qian et al., 2018). We extract image features by VGG16 (Simonyan & Zisserman, 2014) pretrained on ImageNet datasets, the perceptual loss as in Eq. (4): where VGG L i ð Þ and VGG T ð Þ are the feature of image L i and T trained from VGG16 network. The discriminator network validates whether the image produced by the generative network looks real. The generative adversarial loss L Adv can be designed as Eq. (5):

Refinement backbone based on Inception module
Since AlexNet (Krizhevsky, Sutskever & Hinton, 2012), increasingly deeper networks have been proposed to solve more complex problems, such as VGG16, VGG19, and GoogleNet  (Szegedy et al., 2014). In this study, a modified version of multi-scale Inception block was designed for the problems of GoogleNet and the characteristics of recognition. This network draws on the main architecture of Inception v1 block. Multi-scale network structure is shown in Fig. 4, and the following improvements are made in two aspects: The Inception network structure adopts two methods for convolution kernel decomposition. 1 Ã 3 with 3 Ã 1 convolution kernels was used instead of 3 Ã 3 kernel for feature maps. This method can ensure that the receptive field of the decomposed convolution kernel not be changed.
Each branch corresponds to a different size of the receptive fields (RF) (Liu, Huang & Wang, 2018), using the expansion convolution to control their eccentricity, and its dimensions are adjusted to generate the final feature map. For the same case of kernel 7 Ã 7, the conventional convolution can only obtain 5 Ã 5 receptive fields after 3 Ã 3 convolution kernel processing. The receptive field of 7 Ã 7 can be obtained after 3 Ã 3 convolution kernel with dilation rate of 2.

Development recognition model based on SSD algorithm
Multi-modal images used in this article, color images and depth images are denoted as I rgb and I depth respectively. The shape of I rgb is 3 Ã 512 Ã 424, and the shape of I depth is 1 Ã 512 Ã 424. After data augmentation, the image size is scaled to 3 Ã 300 Ã 300 and 1 Ã 300 Ã 300 as input layers to provide a basis for generating features. The feature map generation network is designed separately for the color and depth images to extract feature maps from different stages. Multi-scale feature maps are the results of convolutions, which express different levels of images' information, such as local features, edge features, texture features, and so on. The feature maps generated by color and depth image through the network are recorded as C r n ð Þ and C d n ð Þ, as shown in Eq. (6): where r represents color image, d represents depth image, n represents different feature layers, and f represents convolution and pooling operations on the feature layers. The sizes of six types of feature maps in color and depth image are 38 Ã 38, 19 Ã 19, 10 Ã 10, 5 Ã 5, 3 Ã 3 and 1 Ã 1. Furthermore, as high-level features have larger receptive fields and capture more semantic information, low-level features have higher resolution and contain accurate localization details, which are complementary to abstract features. The characteristic maps of each layer are combined to obtain the color and depth characteristic map set that cover the fruit characteristics from multiple scale receptive fields, which are denoted as F rgb and F depth , as shown in Eq. (7): RD-SSD architecture is composed of two parallel subnetworks, RGB-Network and Depth-Network, both of which form a neural network. Figure 5 shows the characteristic network diagram of RD-SSD model, including six color feature maps (conv4_3-r, conv7 (FC7)-r, conv8-r, conv9-r, conv10-r, conv11-r) and six depth feature maps (conv4_3-d, The resulting prior boxes on the feature maps are then fed to the detection network to produce the result on the reference layers. On the outputs of the detection layer, each image generates six feature maps for n 2 center points, and each center point generates k prior boxes. For color image and depth image fusion methods, conv4_3-r, conv7(FC7)-r, conv8-r, conv9-r, conv10-r, conv11-r, conv4_3-d, conv7(FC7)-d, conv8-d, conv9-d, conv10-d, conv11-d feature layers set (4,6,6,6,4,4,4,6,6,6,4,4) prior boxes respectively. For each feature point of the feature map, we assign the corresponding prior boxes to the feature map layers. B rgb and B depth are fully connected to generate the sum of a prior boxes as B all . The calculation formula is shown in Eq. (8): The size of the SSD prior box is related to the shape characteristics of the identified object, which often contains squares and rectangles of different proportions and sizes. The dimensions of multiple prior boxes should be guaranteed in the occluded fruit recognition model to improve the generalization ability to identify the fruit. The setting of the prior box includes scale and aspect ratio, which is calculated as Eq. (9): m is the number of feature maps, s k is the relative ratio of the prior box to the feature map, s min is 0.2 and s max is 0.9.
For the position of the a prior frame, set the center of the a prior frame as a þ 0:5 , where f k j j is the size of the k characteristic graph, a, b ∈ 0; 1; 2; f k À 1 j j f g and normalize the coordinates of the a prior frame to make it within 0,1. The mapping relationship between the a prior frame coordinates on the feature map and the original image coordinates as Eq. (10): w feature and h feature is the width and height of the feature layer, w img and h img are the width and height of the original image, and the obtained (x min , y min , x max , y max ) is the coordinate mapped to the original image by the a prior frame with the center of a þ 0:5 and the size of w k ; h k ð Þon the feature graph of layer k.
In the natural growth of tomatoes, there are occlusions by leaves, stems and fruits, and the scale setting of boxes is related to the occlusion, with the aspect ratio a r 2 1; 2; 3; . 1 means the aspect ratio is 1:1, 2 means the aspect ratio is 2:1, 3 means the aspect ratio is 3:1, 1 2 means the aspect ratio is 1:2, 1 3 means the aspect ratio is 1:3. As shown in Table 1, we used different scales and aspect ratios parameters for feature maps. In addition, with the deeper of the feature map increases, the receptive field becomes larger.
RGB-D target recognition methods based on early fusion methods can take advantage of the correlation between multiple features from different patterns in the early stage, which helps to better complete the task (Snoek, Worring & Smeulders, 2005). However, the decisions level usually have the same representation, which makes decision fusion easier (Gao et al., 2019).
RD-SSD designs the overlap maximization between prior boxes of feature map with the real target P, which measures the overlap between ground truth boundaries and forecast boundaries for real target. The formula of IoU for all prior boxes is demonstrated in Eq. (11): where i is the number of the priori box, j is the number of ground truth of fruit objects. In this article, a prior box to be true only if IoU of the prior box B all with the ground truth bounding box is greater than 0.5. A large number of default bounding boxes can be generated after sampling and grouping on the same feature points. In the post processing stage of fruit detection, NMS is commonly used to filter the generated boxes. Lastly, an optimal bounding box is reserved for the same fruit to eliminate overlapping prior boxes, the process of IoU bounding box losses is shown in Table 2. The loss function of RD-SSD model consists of position loss L all loss and the classification confidence loss C loss (Liu et al., 2016). The comprehensive loss F loss is the weighted value of the position loss and the confidence loss. The calculation formula is shown in Eq. (12): where l is the position of the target, and c is the classification.

EXPERIMENTS Data and experiment setup
PyTorch deep learning framework is used in this article. The computing resource for the deep learning experiment is CPU2678 v3 Ã 2 (24 cores and 48 threads), 16G memory, GTX 1080ti 11G graphics card, and Ubuntu 18.6 operating system. The dataset contains color and depth images, divided into 64% training data, 16% validation data, and 20% test data. The datasets include six categories, the first setup corresponds to the non-occluded and occluded scene, whereas the second setup corresponds to the maturity of fruits, non-occluded immature tomatoes as tomato1, occluded immature tomatoes as tomato2, non-occluded semi-mature tomatoes as tomato3, occluded semi-mature tomatoes as tomato4, non-occluded mature tomatoes as tomato5, occluded mature tomatoes as tomato6. The statistics about each dataset are shown in Table 3.

Experiment based on color image
The color map is extracted from specified convolution layers, and 8,732 prior boxes are obtained. The localization and classification of tomato fruit in the natural scene are accepted through the basic network architecture. During model training, batch_size is set to 8, iteration is set to 120,000, Learning rate (lr) is set to 1e−3, and test set evaluation is carried out every 500 iterations.
Calcuating intersection I between B and P  Figure 6 shows the loss plots and mAP (six categories) during the training procedure. The 26,000 th iteration model is the best, mAP reaches 0.8914, and the loss value is 1.688, which is at the same level as the minimum loss value. Therefore, the model produced by 26,000 iterations is used as the color-SSD tomato fruit classification model.

Experiment based on depth image
The SSD model is constructed based on the depth image to verify the recognition effect of the model. Similar to the SSD operation process of the color image, the SSD network is used to extract the features of depth map, and obtain 8,732 prior boxes. Figure 7 shows the loss value and mAP change during the SSD model by depth images. At the 112,380 th iteration, the minimum loss value is 1.584, and the number of iterations with stable loss value is higher than color-SSD model. The 93,500 th iteration model was the best with mAP reaching 0.7876. The mAP of the depth-SSD recognition model is lower than the color-SSD model. The depth image reflects the position information of the fruit, during the feature learning process, which is sensitive to the edge information of the fruit, and can identify the fruit that is occluded.

Experiment based on RD-SSD
Based on the RD-SSD model, it realizes tomato fruit recognition, tomato maturity classification and occlusion classification. In the neural network, there are two branches corresponding to images from color and depth, for feature extraction. The number of prior frames N proir of the feature map after fusion is 17,464. The number of maximum iterations is 120,000, the test set is verified every 500 iterations, the learning rate is 1e−3, the batch_size is 8, the optimizer uses the Adaptive Moment Estimation (Adam) method. Figure 8 shows the loss value and mAP during the training procedure of RD-SSD model. The analysis shows that the RD-SSD model reaches a stable state during the training procedure and the number of iterations required less than the color-SSD and depth-SSD models. The loss value is lower than the color-SSD and depth-SSD models, indicating that the recognition deviation on the verification set is smaller, and the model recognition effect is better. The model was optimal at the 92,500 th iteration, mAP reached 0.9147, loss value was 0.72, and the minimum loss value at the same level. The classification accuracy AP of Tomato1 is 0.9141, the AP of tomato2 is 0.9031, the AP of tomato3 is 0.9243, the AP of tomato4 is 0.9173, the AP of tomato5 is 0.9207, and the classification accuracy of tomato6 is 0.9082.
In order to compare the overall effect of the model experiment, the results of the three model methods of color-SSD, depth-SSD and RD-SSD are compared and analyzed as a whole. Table 4 shows the comparison of tomato fruit recognition and classification recognition results, including the overall model recognition effect and the recognition effect of each classification.
In a comparison (Tables 5 and 6)   was significantly more accurate due to the use of decision from multiple feature maps and matching strategy. The RD-SSD model combining the color and depth map feature information improves the recognition rate of occluded tomatoes with color image features mainly contribute to the classification of fruit maturity, and depth image features mainly contribute to occlusion recognition. The fusion of visible and depth images improves the perception ability of tomato fruit system in target maturity classification and occlusion recognition.

DISCUSSION
In order to more intuitively express the recognition effects of the color-SSD, depth-SSD and RD-SSD optimal models, this section analyzes and compares the recognition results of the test images. Figure 9 shows a comparison between the obtained recognition for three different models. The image has a complex background, with many tomato fruits and dense leaves, which is a typical complex background object recognition scene. The color-SSD model identified 14 tomatoes, including immature, semimature, mature, occluded and nonoccluded fruits; the depth-SSD model identified nine tomatoes, mainly immature tomatoes; the RD-SSD model identified 16 tomatoes with two increases relative to color-SSD, one was a ripe tomato obscured by leaves, and the other was an immature tomato overlapped with adjacent fruits. Specifically, the results showed that the RD-SSD model learned the edge information of fruit in depth image, and the recognition effect was improved compared with the other two models.

CONCLUSIONS
In this article, to effectively integrate multi-modal features and generate accurate feature maps, a multi-modal deep aggregation module RD-SSD to facilitate the efficient fusion of texture and depth features. The plant images with different maturity and occlusion degrees were selected to construct the data set, and through data augmentation to improve the generalization ability of the model and the distinguishing degree of features. In terms of the classification effect of tomato fruit maturity and occlusion, the recognition rate AP of the RD-SSD model for the six types of fruits reached 0.9141, 0.9031, 0.9243, 0.9173, 0.9207 and 0.9082. After adding the depth image on the basis of color image recognition, the classification effect of the occlusion of the fruit is improved. The multi-modal fusion method provides a new direction for plant fruit identification and classification, and has certain research value for the study of fruit phenotypes during fruit setting and fruiting period.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This project was funded by the Research and Development of Greenhouse Cluster Control System (s20163081109). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: Research and Development of Greenhouse Cluster Control System: s20163081109.