Counting Crowded Soybean Pods Based on Deformable Attention Recursive Feature Pyramid

: Counting the soybean pods automatically has been one of the key ways to realize intelligent soybean breeding in modern smart agriculture. However, the pod counting accuracy for whole soybean plants is still limited due to the crowding and uneven distribution of pods. In this paper, based on the VFNet detector, we propose a deformable attention recursive feature pyramid network for soybean pod counting (DARFP-SD), which aims to identify the number of soybean pods accurately. Speciﬁcally, to improve the feature quality, DARFP-SD ﬁrst introduces the deformable convolutional networks (DCN) and attention recursive feature pyramid (ARFP) to reduce noise interference during feature learning. DARFP-SD further combines the Repulsion Loss to correct the error of predicted bboxse coming from the mutual interference between dense pods. DARFP-SD also designs a density prediction branch in the post-processing stage, which learns an adaptive soft distance IoU to assign suitable NMS threshold for different counting scenes with uneven soybean pod distributions. The model is trained on a dense soybean dataset with more than 5300 pods from three different shapes and two classes, which consists of a training set of 138 images, a validation set of 46 images and a test set of 46 images. Extensive experiments have veriﬁed the performance of proposed DARFP-SD. The ﬁnal training loss is 1.281, and an average accuracy of 90.35%, an average recall of 85.59% and a F1 score of 87.90% can be achieved, outperforming the baseline method VFNet by 8.36%, 4.55% and 7.81%, respectively. We also validate the application effect for different numbers of soybean pods and differnt shapes of soybean. All the results show the effectiveness of the DARFP-SD, which can provide a new insight into the soybean pod counting task.


Introduction
Soybean is an important crop, containing rich protein and fat, whose safe production contributes to economic development and social stability. As the most effective way to improve yield, cultivating high-quality soybean varieties has attracted many breeders' research interests. In actual breeding, the number of pods per plant is one of the most important indicators to evaluate the quality and yield of soybean varieties. However, the number of pods per plant is mainly obtained through manual counting, which is timeconsuming and laborious, limiting the development of large-scale and high-throughput soybean breeding. In this case, it is urgent to find a fast and efficient method for automatic pod counting.
Thanks to the rapid development of image acquisition equipment and artificial intelligence algorithms, counting the harvest organ based on object detection models has been proved to be a promising artificial alternative, which has been applied to various field objects such as soybeans [1], wheat ears [2][3][4], rice panicles [5,6], fruits [7][8][9][10][11][12], etc. For example, gulzar et al. [11,12] carry out a series of work focus on the fruits classification based on deep learning. Lyu et al. [10] replace the convolution layer of YOLO V5 with the attention convolution module to extract more spatial and semantic information of green oranges, which effectively reduces the missed detection caused by the confusion between oranges and the environment. Sun et al. [13] introduce the AugFPN to narrow the semantic gap between features of different scales and reduce the information loss of the feature map, which improves the detection capability of small wheat ears. For soybean pod counting tasks, Uzal et al. [14] train the pod detection model after manually disassembling pods. Though reducing the work intensity of manual counting to a certain extent, destructive sampling makes the sample unable to maintain the original structural information, which is not conducive to obtaining the phenotypic information of the whole plant. To solve the above problem, Guo et al. [15] directly detect pods on the whole soybean plant and achieve a speed of 240 plants/hour, greatly improving the detection efficiency. Li et al. [16] further propose the SPM-IS, which acquires a material soul type based on instance segmentation.
Existing soybean pod detection works advanced both the counting accuracy and speed, while their performances still cannot meet the actual application demand for counting pods, especially for the following scenes: (1) Crowded pods. A large number of pods are often crowded in a certain area of a single soybean image because of the cluster growth characteristics of soybean pods. The pods with uncertain posture are shielded by the stems and surrounding pods. When the convolution neural network extracts the feature of pods, it will inevitably mix with noise, affecting the final detection accuracy. (2) Uneven distribution of pods. The multi-branched structure of soybean makes the pod density change obviously in different local areas. The detection model not only generates anchors of a fixed size and number for each spatial position uniformly when generating candidate proposals, but also fuses predicted bounding boxes with a given threshold. Both two aspects make it is impossible to achieve adaptive detection according to the pod density, resulting in some pods being missing from the count.
To improve the soybean pod counting accuracy, in this paper, we first adopt the dense anchor-free detection algorithm VarifocalNet [17] as the baseline and qualitatively analyze the advantages when applying the VarifocalNet in the soybean pod detection. Different from previous efforts that mainly follow the Faster-RCNN or YOLO series, VarifocalNet includes an IoU-aware Classification Score (IACS) in the classification branch and a starshaped bounding box representation method in the regression branch, respectively. IACS multiplies the original classification score and IoU between the predicted bbox and its ground truth, whose output will be used as the class label value to improve the reliability of the prediction box ranking. Compared with the classification scores used by existing algorithms such as YOLO, IACS helps integrate the spatial information of the bounding box into the classification score, which can simultaneously evaluate the classification confidence and positioning quality of the box. The star-shaped box representation method selects eight fixed points around the sampling points on the feature map. For pods with variable shapes, this representation method can better capture the geometric shape of the bounding box and the local context information than the diagonal point coordinate representation method used by other algorithms. At the same time, to alleviate the issues of crowded and uneven pods, we further propose a deformable attention recursive feature pyramid network (DARFP-SD) for soybean pod counting based on the VarifocalNet. DARFP-SD first introduces the deformable convolutional networks (DCN) and attention recursive feature pyramid (ARFP), which aims to reduce noise interference during feature learning. DARFP-SD further combines the Repulsion Loss to correct the error of predicted boxes coming from the mutual interference between dense pods. DARFP-SD finally designs a density prediction branch in the post-processing stage, which learns an adaptive soft distance IoU to assign a suitable NMS threshold for different counting scenes with uneven soybean pod distributions.
In summary, our contributions are as follows: (1) A detailed review is conducted to examine the most notable work in soybean pods based on deep learning, and challenges of crowding and uneven distribution in practical applications of pod counting are summarized.
(2) A deformable attention recurrent feature pyramid network is specifically designed, which adaptively extracts fine-grained soybean features and assigns suitable NMS threshold to improve the counting performance of crowded and uneven soybean pods. (3) Extensive experiments are conducted on the constructed soybean pods dataset. Quantitative and qualitative results validate the effectiveness of the proposed method, which significantly outperforms baseline methods in different scenarios and can achieve a state-of-the-art performance compared to previous counting methods.
The layout of this paper is arranged as follows: Section 1 (this section) introduces the background of the research and to highlight the problem statement. The main contribution of this paper is presented in Section 2, where the principles and designs of the DARFP-SD algorithm are described. Section 3 discusses the results and in Section 4 the conclusion of this work is drawn and future work along the line is proposed.

Image Acquisition and Annotation
The soybean plants are placed on the non-reflective black suede (1.5 × 1.5 m) and occurs in the middle of the camera field of view with the basic growth shape. We use a tripod to keep the camera off the ground about 1 m and the angle of the camera is about 75 degrees to the horizontal ground. The shooting scene is shown in Figure 1a. For each soybean plant, we collect 3 images and artificially select the clearer one as the representative image of the plant. According to the opening angle among the main stem, branch and petiole, as shown in Figure 1b   The final soybean dataset contain 230 images with a resolution of 3456 × 5184. The pods are yellowish brown with a relatively different shape, whose number ranges from 10-70 in a single picture. We then manually label the image through the tool of Labelme, marking the pods with a rectangular box and recording the coordinates of the upper left vertex (X min , Y min ) and the lower right vertex (X max , Y max ) of the labeling box. For individual pods, the minimum circumscribed rectangle is marked with a rectangular box. For pods covered by stems, the area where the stems are located is regarded as a part of the pod and marked. For crowded pods, the pods visually in the upper layer are labeled as individual pods, while the pods visually in the lower layer are labeled as a whole by considering the different parts separated by the upper pods. We split the original images into training, validation, and test sets in a ratio of 3:1:1, and the number of pictures in a different set is shown in Table 1.

Design of DARFP-SD
The DARFP-SD algorithm mainly includes the Deformable Attention Recursive Feature Pyramid (DARFP) and the Bounding box Refinement (BR). The DARFP module firstly extracts features through the ResNet-50 backbone with the deformable convolution kernel [18], which aims to increase the size of the effective receptive field so that the sampling point of the convolution operation can avoid the interference of stems to a certain extent and improve the quality of the learning network for sheltered pod features. To select the appropriate feature map to construct a recursive pyramid, DARFP then quantifies the relationship between the pod size and receptive field, increasing the recursive feedback connection with the feature learning network. BR designs an adaptive SDIoU-NMS branch, where the local area density will be predicted to help adaptively assign the NMS threshold. BR is supervised with the Repulsion loss [19] and GIoU loss, which constrains the predicted box close to the corresponding ground truth and away from labeled boxes of other targets, which can improve the position accuracy of the candidate proposals. We will describe the detail of each module step by step in the following subsections.

Deformable Attention Recursive Feature Pyramid
Feature extraction based on deformable convolution. Traditional convolution operation learns the features through window sliding. When the size and stride of the convolution kernel are determined, the receptive field is fixed and its specific weight value will be determined in the network training process. Taking the 3 × 3 convolution kernel and input image X as an example, the pixel p 0 on the feature map F can be calculated as Equation (1): where w(p n ) represents the weight of convolution kernel in position p n . p n is the 8 neighborhood positions of p 0 and can be formulated as Due to the uncertainty of the growth direction of pods in a single soybean plant, as shown in the blue area of the local pod feature map in Figure 2, the fixed receptive field has a large number of sampling points outside the pods during feature learning, which will amplify the interference of background noise (such as stems) on pod features and restrict the quality of the feature map and candidate regions generated based on the feature map.
To this end, we add an additional deformable convolution layer to predict the horizontal and vertical offsets for each pixel in the feature map. The whole feature extraction process of the deformable convolution for pods is shown in Equation (2): where ∆p n is the offset of the predicted pixel p n . For each pixel, its final offset is the superposition of the offset components in the horizontal and vertical direction. X(p 0 + p n + ∆p n ) is obtained through the bilinear interpolation. As shown in the green area of the local pod feature map in Figure 2, for pods with an uncertain attitude, deformable convolution can adaptively capture various shape and scale information of pods, effectively reducing noise interference. Feature enhancement based on attention. For the feature map F ∈ R C×H×W , before constructing the DARFP, we introduce the channel attention and spatial attention based on the convolutional block attention module (CBAM) to enhance the feature quality, as in Equation (3): means the element-wise multiply, and M s and M c are the channel attention and spatial attention. For the original channel-wise feature F, channel attention helps to capture the discriminative information of the object by learning the response relationship between channel features and the category label. For the crowed pods in our research scene, with semantic dependency between different channels, the features are guided to pay more attention to the pod areas rather than the complex background. The feature enhancement process based on channel attention is modeled through a max pooling Maxpoool() and average pooling Avepoool(), as in Equation (4): Here, W 0 and W 1 are the learned weight of a shared Multilayer Perceptron. σ means a Sigmoid activation function. For the uneven pods, fully embedding their spatial position information into the features is obviously helpful to improve the accuracy of detection and counting. Different from channel attention mechanism, we further utilize the spatial dependency between features to generate the spatial attention map, which can complementarily mine the spatial location information of pods ignored by the channel attention module. We calculate the spatial attention based on the feature maps enhanced with the channel attention. Similar to the channel attention, a max pooling and average pooling operation will be added to output F S Max ∈ R 1×H×W and F S Avg ∈ R 1×H×W . Then, the two feature maps will be fused through a convolution operation f 7×7 with a 7 × 7 kernel, as: To make the most use of the semantic and spatial dependency between different channels captured by the M S (F) and M C (F), we add the channel attention and spatial attention to each layer of the recursive feature pyramid.
Selection of feature maps. The area of the input image corresponding to any pixel on the feature map is described as the receptive field. The image information in the receptive field area directly affects the quality of the features learned by the network. The calculation method of thereceptive field in each layer is shown in Equation (6): S RF (t) is the size of the receptive field of the convolution layer t, and N s (t) and S f (t) are the stride and kernel size of layer t, respectively. For the soybean pods distributed by leaves or branches, to suppress the interference of background, we would like to let the receptive field be equivalent to the pod size. According to Equation (6), the receptive field sizes of the C2, C3, C4 and C5 layer of ResNet50 are 35 × 35, 91 × 91, 267 × 267 and 427 × 427. For the images of the single soybean plant collected in this study, the average size (length × width) of a single pod is about 100 × 53 pixels after randomly selecting and manually counting 50 images. In order to make the sheltered pod feature learning network universal for pods of different sizes, without adding additional convolution layers, we select the output of C3, C4 and C5 layer so that the original receptive field of the shallowest feature map is close to the average pod size. Similar to DCNV2 [20], our DARFP introduces the deformable convolution with a 3 × 3 kernel to conv2, conv3, conv4 and conv5 of ResNet50, so that the feature extraction can improve the noise immunity at different scales.
Feature fusion based on recursive feature pyramid. The information contained in the feature maps output by different convolution layers is different. To fully exploit the limited pod features, the classical FPN [21] fuses features of different scales along the top-down direction. However, the feedforward propagation is only conducted between the backbone and the pyramid structure, which means the gradient optimization information obtained during the pyramid constructing process cannot be fed back to the backbone to help the parameter learning. Motivated by DetectoRS [22], we add cross layer feedback links for different feature pyramids. The feature map output from the previous recursive pyramid is first followed by a convolution operation. Then, the original feature and output feature will be stacked together as the feature layer of the next recursive pyramid. The transmission and calculation between the feature layers of the recursive feature pyramid are shown in Equation (7): R l i represents the feature transformation operation with a 1 × 1 convolution kernel. For any layer i = 1, 2, . . . , S, B l i and F l i represent the feature maps of i layer and the i-th topdown operation of the FPN in recursions' step l. After introducing recursions' parameter l, the residual FPN can be expanded into a continuous network to extract and fuse features repeatedly, which can effectively improve the utilization of the priority feature information. The feedback also makes the parameter update optimize the feature extraction. In order to balance the feature quality and model training speed, the maximum number of recursions is set as 2.

Bounding Box Refinement
Non-maximum suppression is a common post-processing for object detection, which aims to suppress redundant predicted boxes in the detection results. However, limited by the cluster growth habit of the pod, only part of the pods can be successfully detected among the crowded multiple pods. Intuitively, the correct predicted bounding box belonging to one pod may be regarded as the offset predicted bounding box of another adjacent pod, which will be suppressed as a redundant predicted bounding box by the NMS algorithm [23]. Increasing the NMS threshold can reduce the missed detection rate of pods theoretically, while it is challenging to manually set an appropriate threshold to handle the uneven pods with different densities at different locations. To this end, we design the adaptive SDIoU-NMS and Repulsion Loss to refine the bounding box.
Adaptive SDIoU-NMS. Adaptive SDIoU-NMS first introduces the DIoU [24] to the Soft-NMS algorithms, which can measure the similarity and overlap between the two predicted boxes better. Compared with classical Soft-NMS, the adaptive SDIoU-NMS also considers the distance R DIoU between the center points of the two boxes. The suppression function in SDIoU-NMS can be calculated as Equations (8)-(10).
For the i-th object, S i is the classification scores of all predicted boxes. M and B i are the box with the highest score and other predicted boxes. b and b gt represent the center points of the predicted box and the ground truth box, and ρ is the Euclidean distance between these two center points. c is the diagonal distance of the minimum closure area that contains both the predicted box and the ground truth box. T is the threshold indicating the maximum IoU with all ground truth boxes.
For pods with an uneven number distribution, we expect a small threshold for sparse pods to remove more redundant boxes while a large threshold for dense pods to improve recall. To this end, based on SDIoU-NMS, the adaptive SDIoU-NMS further designs an independent density prediction branch to estimate the pod density, so that threshold T can be dynamically adjusted according to the pod density. The density prediction branch adopts the VGG16 as the backbone, whose network structure is shown in Figure 3. Note that, in order to consider more context information around the objects, 5 × 5 convolution kernel is used in the final convolution layer to increase the receptive field. The degree of density at the first target is defined as Equation (11): b i and b j are the generated bounding box and ground truth. At the inference stage, the density prediction network outputs the object density at each position. Substituting the entire density value back into Equations (8) and (11), the adaptive SDIoU-NMS finally completes the operation of non-maximum suppression.

Loss Function
Bounding box refinement further introduces the Repulsion Loss [19] to optimize the regression of the bounding box. For the predicted pod bounding boxes close to each other, the Repulsion loss L Rep can constrain each predicted box to stay away from surrounding real boxes belonging to other objects while being close to its corresponding real box. Then, the overall loss of DARFP-SD is defined as Equation (12): α and β are used to adjust the proportion of the GIoU loss L GIoU and Repulsion loss L Rep . Here, we set both of them to 0.5. GIoU loss can reflect the overlap between the predicted box and the ground truth box while retaining all the properties of the IoU. C represents the smallest rectangular area, including two different boxes, A and B. The classification loss L cls is based on VariFocal Loss, which can significantly improve the quality of candidate regions and pod recognition accuracy in crowded regions. p is the predicted IoU-aware classification score. For positive samples, q is the IoU between the predicted bounding box and the ground truth box; for negative samples, the value of q is 0.

Counting Pods Based on DARFP-SD
Based on the proposed DARFP-SD, we further train the pod counting model, whose framework is illustrated in Figure 4. In order to improve the generalization ability of the model, an adaptive training sample selection strategy is adopted, and topk = 9 is set to keep the balance of positive and negative samples. The parameters of the backbone is initialized using a model pre-trained on the Imagenet dataset. The model is trained for 200 epochs with a batch size of 4 and an initial learning rate of 0.00125. The learning rate is adjusted based on the cosine annealing algorithm and Warmup. The density estimation module of the adaptive SDIoU-NMS is initialized with random network parameters. The other training strategies are consistent with those used to train pod object detection networks.

Evaluation Index
The accuracy, recall and F 1 are selected as the evaluation indicators to measure the model performance, which can be calculated as: N cor and N err are the number of pods detected by the model correctly and wrongly, respectively. N real is the actual number of pods contained in the test image, and N dect is the number of detected pods.

Comparison with SOTA Methods
To verify the effectiveness of the proposed DARFP-SD for soybean pod detection, we first compared the results of DARFP-SD with other representative detection algorithms, whose accuracy, recall and F1 are shown in Table 2. For pods with different postures, our DARFP-SD can achieve the best performance with an average accuracy of 90.35%, recall of 85.59% and F1 of 87.90%, which are 8.36%, 4.55% and 7.81% higher than the baseline method VFNet, respectively. The above results first validate the effectiveness of the deformable attention recurrent pyramid. By capturing multi-scale pod information, the deformable recurrent pyramid can significantly enhance the model's ability to express pods with different poses, and further improve feature quality and classification accuracy. As shown in Figure 5d,e, for individual pods or dense pods missed by VFNet due to stalk interference, DARFP-SD can improve the quality of candidate bounding boxes after introducing the repulsion loss.  We also conduct a set of experiments to study the counting performance with various backbones in Table 2, such as VGG16, AlexNet, DarkNet53 and MobilenetV3. Compared with the well-designed ResNet, the counting results demonstrate a slight drop to varying degrees. Compared with Faster R-CNN, DARFP-SD achieves a similar detection accuracy rate, while improving the recall rate by more than 10%. As shown in Figure 5a, though Faster R-CNN performs better in areas where pods are sparsely distributed, it misses more for smaller pods or pods with occlusions, which seriously inhibits its recall rate. For RetinaNet that also incorporates feature pyramids, the average accuracy, recall and F1 of our DARFP-SD outperforms by 6.42%, 11.19% and 8.99%, respectively. With the collaboration of deformable recursive attention pyramid and box optimization, DARFP-SD can alleviate the changing of pod posture and uneven distribution of pod quantity, which meets the actual application requirements of soybean counting per plant.

Effectiveness Analysis of ARFP
To quantify the improvement of the attention deformable recursive pyramid (ARFP) on model feature quality and counting accuracy, the feature extraction module based on deformable convolution and the feature fusion module based on the attention recursive feature pyramid were used to train the pod counting model, whose experimental results are shown in Table 3. After introducing the deformable convolution and attention recursive feature pyramid, the average accuracy, recall and F1 increased to 87.57%, 85.06% and 86.30%, which improved the detection performance far more than only using the deformable convolution or recursive pyramid. The results verify the effectiveness of the deformable attention recursive feature pyramid in this study, where the feature expression ability of individual soybean pods with indeterminate posture can be better improved with ARFP. Effectiveness of deformable convolution for feature extraction. To verify the improvement of deformable convolution, we replace the conv2, conv3, conv4 and conv5 layer of ResNet50 with deformable convolution. Compared with the baseline method using the traditional convolution module, after introducing the deformable convolution module, the average accuracy, recall and F1 are increased by 0.80%, 2.70% and 1.94%, respectively. As shown in Figure 6a,b, the receptive field can avoid the stalk area. By adaptively capturing the various shape and scale information of pods, the deformable convolution can effectively suppress noise interference and improve the feature quality of pods with uncertain poses and the recall of positive samples.
Effectiveness of attention module for feature enhancement. After combining CBAM in the recursive feature pyramid, the average accuracy, recall and F1 can increase by 1.42%, 0.8% and 1.1%. Adding the CBAM can effectively improve the detection effect of pods blocked by stalks. As shown in Figure 6c,d, adding the CBAM can effectively improve the detection effect of pods blocked by stalks. In addition, the accuracy of crowded smallsized pods is also significantly improved. We suppose the improvement comes from the interaction relationship between multi-scale features captured by CBAM, which helps to dynamically assign the optimal weight for the feature fusion of different layers.  Effectiveness of attention recursive feature pyramid for feature fusion. To verify the effectiveness of the recursive feature pyramid, we construct a traditional feature pyramid and a recursive feature pyramid (the number of recursions is set to 2) based on the feature distribution output by the C3, C4, and C5 layers of ResNet50. It can be seen from Table 3 that the average accuracy, recall and F1 of recursive feature pyramid can improve by 4.63%, 2.34% and 3.48%. Visualization results in Figure 7 demonstrate the recursive feature pyramid benefits to the small-sized pods. We also visualizes the feature maps obtained by different backbone networks. The color indicates the weight of the feature in the region. It can be found that the feedback information acting on the backbone network can improve the utilization of feature information after adding the RFP structure. Specifically, areas of pods and stems are evenly covered without differences in Figure 7a, while more areas are activated in Figure 7c.

Effectiveness Analysis of BR
Effectiveness of repulsion loss. To verify the effectiveness of the bounding box refinement, we conduct a set of experiments based on repulsion loss and the adaptive SDIoU-NMS, whose results are reported in Table 4. Compared with the VFNet, the increase of accuracy is only 1.33% when introducing the DIoU loss to optimize the bounding box. After using the repulsion loss, the average accuracy for pods can significantly increase to 86.69%. A similar performance increase trend also occurs in other baselines, such as Faster R-CNN or RetinaNet. As shown in Figure 8, the repulsion loss can guide the model to effectively eliminate the interference of other similar candidate regions when the box returns, while it may exist as a performance boundary for more than four pods.  Effectiveness of adaptive SDIoU-NMS. From the results in Table 5, the average recall rate of the model increased by 0.3% and 1.57% after the introduction of Soft-NMS and SDIoU-NMS, respectively. For our adaptive SDIoU-NMS, a best F1 of 83.57% can be obtained based on VFNet. The results first verify the effectiveness of our SDIoU-NMS, that is, for pods with large differences in quantity distribution, setting the threshold according to DIoU can more finely evaluate the quality of the predicted boxes. We also visualize the detection results of different NMS strategies in Figure 9. For local scenes with clustered growth, relatively scattered and many stalks, the horizontal comparison detection results demonstrate that SDIoU-NMS can not only retrieve the detection frame that was wrongly removed by NMS, but also reasonably distinguish Soft-NMS errors. When using the adaptive SDIoU-NMS strategy for non-maximum suppression, the average accuracy rate, recall and F1 are 82.83%, 84.33% and 83.57%, respectively, which are further improved by 0.39%, 1.23% and 0.80% compared with SDIoU-NMS. It shows that adaptively learning the density to the set threshold can improve the recall. As can be observed in Figure 10, the adaptive SDIoU-NMS outputs less multi-inspection, maintaining a reasonable evaluation and screening of pods in dense areas. Thanks to the dynamic adjustment of the threshold, there is no missed detection in sparser areas due to higher thresholds such as SDIoU-NMS (yellow box). In addition, the accuracy is still stable for different plant shapes such as semi-open and open pods.

Discussion
For the soybean pod counting task, it is a common phenomenon to deal with varieties of soybeans. The soybeans with different plant shapes and numbers of pods result in a different complexity of detection scenarios. To analyze the robustness of our DARFP-SD, we discuss the deviations in the counting accuracy for the different number of pods and different plant shapes.
Robustness for different number of pods. We divided the test images according to the number of pods per plant with a stride of 10. We counted the detection results of the model in each number range and compared them with other algorithm models. The results are shown in Figures 11 and 12. For sparse scenes with less than 30 pods per plant, our DARFP-SD is comparable to the baseline VFNet and far exceeds the results of the Faster-RCNN and RetinaNet methods. For scenarios where the number of pods per plant is 30-60, the average accuracy and recall of DARFP-SD are 90.35% and 85.59%, which are 8.36% and 4.55% higher than VFNet, respectively. For dense or overlapping scenes with more than 60 pods, the proposed DARFP-SD has an average accuracy of 90.33%, which is similar to sub-dense scenes with 50-60 pods, showing better stability. The average recall rate and F1 of DARFP-SD are significantly improved compared with other counting methods. The above results demonstrate that DARFP-SD can more effectively meet the counting task of the single soybean plant with variable pod density.  Robustness for different shapes. To better meet the high-throughput pod counting requirements of different soybean varieties, we further discuss the difference in the counting accuracy of DARFP-SD for soybeans of different plant shapes. We divided the test images into convergent, semi-open and open, and the counting performances are shown in Table 6 and Figure 13. Taking the convergent soybean plant as an example, the average accuracy, recall and F1 of DARFP-SD are 88.87%, 86.37% and 87.60%, and the three evaluation indicators are all improved by more than 5% compared with the baseline method VFNet. The improvement was also evident in the semi-open and open soybean plants. All the results demonstrate that DARFP-SD can better deal with the soybean counting scenarios of different plant shapes, indicating that DARFP-SD can be applied to the task of counting soybean pods in a single plant to solve the change of plant type.  Combination with more fine-grained pod phenomics. Although pod counting is important for both breeding and cultivation tasks, considering the combination with other fine-grained pod phenomics measurements shows a more promising future. The method proposed in this paper can accurately identify and locate dense small-sized pods, which can provide instance-level research objects for the further detailed analysis of pod length, thickness, shape, color, maturity, and disease conditions. In addition, the pod counting task can also be extended to the prediction of the number of pod seeds [25] or even the number of pod fluff, which is of great significance for the breeding of high-yield and disease-resistant soybean varieties. However, we also note the difficulty in constructing the above multi-task models, especially in terms of imaging quality and algorithm performance. For example, for dense small-sized pods, the pixel-level deviation of the predicted foreground area will lead to huge fluctuations in the length of the pod. The thickness of the pod requires additional spatial information from 3D point clouds or RGB-D images or multi-view RGB images. From the perspective of algorithm design, a conventional solution is to directly construct a regression model by combining the measured data of specific traits, which is cost-effective for the pod length and thickness that are easy to measure manually. Meanwhile, the design of the multi-phenomics algorithm can introduce the ensemble learning, contrastive learning, weakly supervised learning and multi-modal learning. In order to improve the accuracy and generalization of the model, we also try to embed agricultural expert knowledge into the learning of vision tasks. What is more, combined with some related latest research [26], mining the relationship between phenotypic trait results and gene sequences can also be considered.

Conclusions
Counting the soybean pods efficiently and accurately has been a challenging task, especially for the crowded small-sized pods with uneven distributions in quantity. In this paper, we propose a novel method termed as DARFP-SD to realize the pods counting of the whole soybean plant. The main contribution of this work is to design a deformable attention recursive feature pyramid network with an additional bounding box refinement module. Through experimental design and results analysis, the conclusions can be summarized as follows: (1) The proposed DARFP-SD can significantly improve the counting accuracy for the scene containing crowded small-sized pods in a single image, which can achieve an average accuracy of 90.35%, recall of 85.59% and F1 of 87.90%, respectively. (2) The attention recursive feature pyramid constructed in DARFP-SD benefits the feature quality, while the bounding box refinement module can alleviate the missing detection issue for dense pods. With the collaboration of the attention recursive feature pyramid and bounding box refinement, DARFP-SD has better stability in counting accuracy for the increasing number of soybean pods. (3) DARFP-SD shows a strong robustness in different scenarios with different pod numbers per plant, different plant shapes and different density levels, which provides a new insight in the soybean pod counting task and can be applied in highthroughput soybean breeding. We believe the proposed DARFP-SD can give some new insights in the automatic counting task of crop organs, and relieve the manual workload when measuring the pods number per plant during soybean breeding. In the follow-up work, we will build a counting model that integrates more fine-grained phenotypic traits and mine the potential genetic relationship between these traits and gene sequences.