Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking

The ability to accurately represent and localise relevant objects is essential for robots to carry out tasks effectively. Traditional approaches, where robots simply capture an image, process that image to take an action, and then forget the information, have proven to struggle in the presence of occlusions. Methods using multi-view perception, which have the potential to address some of these problems, require a world model that guides the collection, integration and extraction of information from multiple viewpoints. Furthermore, constructing a generic representation that can be applied in various environments and tasks is a difficult challenge. In this paper, a novel approach for building generic representations in occluded agro-food environments using multi-view perception and 3D multi-object tracking is introduced. The method is based on a detection algorithm that generates partial point clouds for each detected object, followed by a 3D multi-object tracking algorithm that updates the representation over time. The accuracy of the representation was evaluated in a real-world environment, where successful representation and localisation of tomatoes in tomato plants were achieved, despite high levels of occlusion, with the total count of tomatoes estimated with a maximum error of 5.08% and the tomatoes tracked with an accuracy up to 71.47%. Novel tracking metrics were introduced, demonstrating that valuable insight into the errors in localising and representing the fruits can be provided by their use. This approach presents a novel solution for building representations in occluded agro-food environments, demonstrating potential to enable robots to perform tasks effectively in these challenging environments.


Introduction
An increasing food demand combined with the existing labor shortages have become one of the most important challenges in the agro-food industry (Ince Yenilmez, 2015). Robotic systems are recognized as one of the main technologies that will play a key role in solving these issues (Bogue, 2016). Consequently, agro-food robotics has continued to grow as a field for more than three decades. However, most agro-food robots are still far from being an economically feasible alternative due to the challenging nature of agro-food environments (Roldán et al., 2017;Kootstra et al., 2020). These challenges are the presence of heavy occlusions (Roldán et al., 2017); and the wide variation in environment conditions, cultivation systems, and object and plant positions, sizes, shapes and colors (Bac et al., 2014;Zhao et al., 2016). These challenges form a limiting factor in the performance of most agro-food robotic systems (Montoya-Cavero et al., 2022).
For many tasks, robots need to accurately locate and represent all the objects in the environment that are relevant for the given task. This representation is often called a world model (Crowley, 1985;Elfring et al., 2013). However, localizing the relevant objects is a challenging task in environments with high levels of occlusion. In the agro-food environment, it is known that occlusions have a great impact on the performance of robots and their perception systems (Arad et al., 2020). Furthermore, the objects of interest can be different for different tasks and environments. For instance, depending if the task is to monitor the plant status or to harvest ripe fruits, different parts of the plant are of importance for the robot to complete the task. Besides, different environments can have different objects of interests, an example being different fruits for different crops. Consequently, the agro-food industry is in need of systems that can adapt to different conditions and environments, however it is very common to see robotic platforms that are developed to solve one problem or task under a unique set of environment conditions.
Most current agro-food robotics research aims to improve the perception capabilities of the systems, which increases the robots' understanding of a scene. In recent years, most of the research has been done in increasing detection and localization performance, especially through deep learning (Blok et al., 2022;Ruigrok et al., 2020). Although these methods have shown to be able to deal with variation to some extent, the challenge of occlusion is not adequately addressed (Montoya-Cavero et al., 2022;Kootstra et al., 2021). Barth et al. (2016); Lehnert et al. (2019) showed that multiview perception can be successfully used to reduce occlusions. Barth et al. (2016) developed an eye-on-hand architecture which was able to create a 3D geometrical reconstruction in the form of a point cloud using visual simultaneous localization and mapping (SLAM). However, SLAM algorithms are limited in the description of the objects that they can generate since they only use geometrical features. Therefore, this limits the possible interactions of the robot with them (Nicholson et al., 2019). Lehnert et al. (2019) developed a visual servoing to find the next-best-view and remove occlusions in agrofood environments. They showed how their method, when implemented in a 6-degrees-of-freedom (DoF), can potentially reduce occlusions generated by plants. However, in their experiments they used only a single object, which considerably simplifies the representation needs.
Several authors have shown how a world model based on objects with attributes that are updated over time can be used to deal with the challenges of complex and occluded environments (Elfring et al., 2013;Persson et al., 2020;Wong et al., 2015a). These approaches have in common the use of multiobject tracking (MOT) to link upcoming noisy and uncertain measurements from the sensors of the robot with the representation of the objects that generated those measurements. Unlike SLAM-based representations, MOTbased representations are not only geometrical as they are based on objects which can have different semantic properties and attributes. Furthermore, MOT has the potential to be useful when dealing with occlusions since they allow for merging of sensor information from different viewpoints while the robot moves, even in complex scenes where many objects are present. This is important as new viewpoints can give new information about areas that appeared occluded to the robot before. This means that they can couple well with active perception algorithms, such as next-best view (NBV) methods (Lehnert et al., 2019;Wu et al., 2019;Burusa et al., 2022), to offer efficient mechanisms to explore and represent highly occluded agro-food environments.
MOT methods have been recently used in agro-food robotic systems (Halstead et al., 2018;Kirk et al., 2021;Smitt et al., 2021;Halstead et al., 2021). Nevertheless, their objectives were to perform specific tasks like plant phenotyping and monitoring, and were not aiming for a generic representation to improve the capabilities of an autonomous robotic system in tasks like harvesting and crop maintenance. Halstead et al. (2018) developed a deep learning-based detection system to detect and assess the ripeness of peppers, and used it to feed an image-based Intersection over Union (IoU) tracking system. They used the IoU between the bounding boxes of tracks and detections to generate a cost matrix based on which the data association was performed using the Hungarian algorithm. Their goal was fruit counting and ripeness estimation. The work of Halstead et al. (2018) was extended by Smitt et al. (2021) to autonomously count fruits over a greenhouse plant row. They made use of the 3D data obtained from the camera mounted in the robotic system to re-project the bounding boxes of tracked objects onto the following frames, then the tracking was performed by the previously mentioned IoU-based system. Halstead et al. (2021) further extended the work of Smitt et al. (2021) to different cropping systems, like arable farmland, and evaluated the performance on sugar beet plant counting. Kirk et al. (2021) also developed an IoU-based tracking system assisted by deep learning, inspired by the work done by Wojke et al. (2017). They used their method to count strawberries in different ripeness stages.
The above mention studies of Halstead et al. (2018); Kirk et al. (2021); Smitt et al. (2021); Halstead et al. (2021) have the use of an autonomous systems with only one DoF that moves and collects data in a line parallel to the plant row or soil in common. This could be enough for monitoring applications in cropping systems where the objects of interest grow relatively wide spaced, as is the case for strawberries or already pruned bell peppers. However, it might not be optimal in cropping systems where fruits or plants grow very close to each other, like tomatoes in a truss. Furthermore, the 1DoF motion provides only limited capabilities to deal with occlusions as their feasible motion can be far from the next-best-view. In addition, all aforementioned research used tracking algorithms to perform counting of fruit parts, yet they evaluated the performance over a total count error without studying the actual tracking performance of the algorithms. We consider this a key aspect of the evaluation of a representation algorithm, as even though the total count of objects might seem correct, there might be a lot of unseen errors, like switching tracks between objects or false positives and negatives that cancel each other out, which cannot be assessed using only the count error. This means that good counting of objects does not directly correspond with perfect tracking, and therefore, with a good representation of the objects in the environment. If left without evaluation, the representation algorithm might perform worse on new data or in new environments. In addition, these errors might not be an issue for tasks like yield prediction, however, they will be an important issue for more advanced tasks that require a more accurate representation, such as harvesting or monitoring of crop development.
In this paper, we present a generic method to build a world representation method for robots that can be used in different tasks and in highly occluded environments. Our contributions are as follows: • A method to build 3D representations based on individual objects. The representation is maintained and updated from multiple viewpoints over several time steps using a 3D MOT algorithm based on a Kalman Filter and the Hungarian algorithm.
• A real-world test and evaluation of the proposed method in a tomato greenhouse using a 6DoF eye-in-hand robotic arm.
• A novel analysis of the performance of our method using state-of-the-art MOT metrics, which allows a better understanding of the different types of errors that affect MOT algorithms in agro-food environments.
We present the data collection and labeling process, tracking algorithm and evaluation metrics in Section 2. The results of our experiments are shown in Section 3 and discussed in Section 4. Section 5 contains the conclusions of our research and recommendations for future work.

Robotic set-up
We used an ABB IRB1200 robotic arm with an Intel Realsense L515 LiDAR camera attached to the end-effector as shown in Figure 1. The camera was mounted to the end-effector of the robot. Eye-on-hand calibration was performed to get a correct estimation of the transformation between the camera frame and the robot frame. The whole system was controlled through the robot operative system (ROS) (Quigley et al., 2009). The robot arm was then mounted on a cart designed to move on the rails in between rows present in a Dutch tomato greenhouse to allow collection of data of multiple plants.

Data collection and annotation
Two datasets were used for two different purposes. First, a RGB image dataset was used to train and evaluate a deep learning detection algorithm. The tomatoes in the color images were labeled with a pixel-level mask and a bounding box. This dataset consisted of 1180 images and was based on the Figure 1: The robot standing in front of a tomato plant in a greenhouse. The robot was attached to a platform that was able to move over the rails between plant rows. An Intel Realsense L515 camera was attached to the end effector of the ABB IRB1200 robot arm. 123 images collected by Afonso et al. (2020), and 1057 new images obtained for this research from greenhouses at Wageningen University & Research from the varieties Merlice and Campari. Therefore, it contained images with tomatoes from different varieties, greenhouses and light conditions. A second dataset was collected using the robot in order to evaluate our world model algorithm. The data was collected in a tomato greenhouse from Wageningen University & Research. The dataset consisted of 700 images from 100 viewpoints of seven tomato plants recorded following a semi-cylindrical path as shown in Figure 2. The path was divided into ten different height levels, and each level corresponded to a semicircle path with ten different viewpoints distributed evenly over the path. The stem was considered the center of the semi-cylinder and was assumed to be approximately at 60 cm in front of the robot origin of coordinates. The radius of the semi-cylinder was 30 cm. Seven different plants, from the variety Santiana (not present in the detection training dataset), were recorded. For each viewpoint, the color image, structured point cloud, and camera pose with respect to the robot base were recorded. Figure 3 shows an example of two viewpoints of the same plant. The tomatoes on every frame's color image were labeled using a bounding box and tracking ID as shown in Figure 4. This type of labeling allowed us to know which tomatoes are present and how many tomatoes in total have been seen in each frame. Figure 2: Example of the semi-cylinder path used to collect viewpoints for the dataset used to evaluate the accuracy of the representation. The semi-cylinder was built using ten different heights, each of which contained ten evenly distributed viewpoints, represented by the gray dots. The path started in the lower left area of the tomato plant, and it is identified by the orange dot. The light blue dot corresponds to the end of the sequence. At every viewpoint, the camera was pointing towards the center of the semicircle of that height level, which is represented by the blue dashed line.  confidence was lower than the threshold were rejected. To test the influence of this parameter, we used two commonly used confidence thresholds: 0.5 and 0.7. Figure 5: Object detection algorithm. Tomatoes are detected in a color image using a trained Mask R-CNN algorithm. Then the detection mask is used to filter the corresponding points of every tomato in the structured point cloud. A sphere fitting algorithm is used to estimate the 3D center of every tomato from its partial point cloud. Next, the center of every tomato is transformed using the known camera pose provided by the robot arm. Last, tomatoes whose center falls out of the space occupied by the plant of interest are filtered out.
To transform the 2D detections into 3D detections, we filtered the 3D points of each detected object from the whole point cloud using the mask m j t . However, the point clouds retrieved from the camera were noisy and at times contained zero valid points in the point cloud. These cases were removed from the detection list because they could not be used in the data association step. The resulting partial point cloud was passed to a sphere fitting algorithm to estimate the center of every tomato µ j cam,t with respect to the camera coordinate system. This sphere fitting algorithm was based on the least squares approach. Due to the noise existing in the point cloud and the small number of points that some detections had, we filtered out tomatoes whose center µ j cam,t or whose radius r j t fell outside of reasonable limits. A center was considered out of reasonable limits if it was predicted between the camera and the closest point of the tomato; this means that the sphere was predicted in an inverted way, i.e. as if the robot was looking at the back side of the tomato. Additionally, we defined the radius as ρ min ≤ r j t ≤ ρ max where ρ min = 1cm and ρ max = 5cm. If a radius fell outside of these limits, it was also considered the result of a faulty sphere fitting. Unlike the tomatoes with no 3D points at all, these two cases with wrong sphere fitting were not discarded; their center µ j cam,t was estimated as the average of all the 3D points assigned to it. The covariance Σ j cam,t was empirically estimated from the calibrated camera specifications.
Next, the known pose of the camera with respect to the robot base was used to calculate the transformation matrix T rob cam , which was used to transform the tomato position p j cam,t to the robot base coordinate frame p j rob,t . Mask R-CNN detected tomatoes in the whole image, including tomatoes from other rows or plants. To make sure we only selected the tomatoes of the plant of interest, we filtered them out based on their transformed position p j rob,t using thresholds measured at data recording time. Based on the spacing between the plants in the row, we restricted each axis µ j rob,t as where µ j rob,t (x) corresponds to the axis parallel to the plants row, µ j rob,t (y) is the axis perpendicular to the plants row, and µ j rob,t (z) is the axis which represents the height with respect to the ground. Values are presented in meters. Since further steps are only performed in the robot coordinate system, we abbreviate p j rob,t as p j t .

World model
The representation, or world model, of the robot's environment consisted of a set of N object representations or corresponded to a specific track with position p i t , class label c i t , and bounding box b i t at time t. From a tracking perspective, objects in the world model are also referred as tracks. The position was represented as a Multivariate Gaussian distribution, where the mean, µ i t ∈ R 3 , represented the most likely position of an object; and the covariance, Σ i t ∈ R 3x3 , the uncertainty on the position. The object position was represented in the world model coordinate system, which corresponds in our experiments to the robot coordinate system. The bounding box b i t of an object was represented as the 2D bounding box of the last detection d j t associated with that object. This was needed for some evaluation metrics explained in Section 2.5. Figure 6 shows how the world model was maintained and updated over time. At every frame, detections arrived from the object detection algorithm. They were passed on to the data association algorithm, which associated existing tracks with the detections. Then these associations were used to update the world model by updating the objects (or tracks). A prediction was made to project the tracks into the next time step (frame), which was then used in the next cycle by the data association algorithm together with detections from the next time step.

Data association
For every frame detected objects were assigned to the existing tracks using their predicted state from the previous time step as D t → O t|t−1 . This was done using the Hungarian algorithm. The method used a cost matrix, C(i, j), where the cell i, j contained the cost of assigning the detection j to the track i, and outputted the combinations of i, j which minimized the total cost. In order to consider uncertainty, these cost were calculated as the square of the Mahalanobis distance where µ i t and Σ i t defined the Gaussian distribution that represents the position of the track o i , and µ j t corresponded to the mean position of the detection d j t . We discarded unlikely associations by setting a threshold to this distance metric. Since the squared Mahalanobis distance follows a chi-square (χ 2 ) distribution, this threshold was set to 7.82, a value which corresponds to the 0.95 quantile of the chi-square distribution with 3 degrees of freedom (Wojke et al., 2017). Associations whose c(i, j) was larger than this threshold were discarded.

World model update
Once a track o i t was associated with a detection d i→j t , the attributes of the detection were used to update the object attributes. Object class c i t and bounding box b i t were updated with the detection class c j t and bounding box b j t respectively. The updated object position p i t was calculated in a Kalman filter update step using the detection position p j t and the predicted position of the object from the previous time step p i t|t−1 . In this process, first, the innovation was calculated as Then, the innovation covariance was calculated as which was used to compute the Kalman Gain The Kalman gain was then used to calculate p i New objects were initialized in the world model when a detection could not be associated with any existing object. In order to prevent initializing objects from false positive detections, new objects in the world model could be considered tentative during a certain number of frames, n init . During our experiments, we assessed the effect of different values for n init . New objects o i were initialized from a detection d j . Under the assumption that objects do not disappear from the robot environment, objects in the world model were never removed.

Prediction
In the prediction step, objects in the world model were propagated between one time step and the next. In this step, object class c i t and bounding box b i t were considered constant, and the object position p i t+1|t was predicted using a Kalman Filter prediction step. Under the assumption of static objects, the mean of the predicted position stayed constant, and the covariance matrix was updated as where Q t was a diagonal matrix whose non-zero entries were set to 0.005. It corresponded to the covariance of the process noise, which was used to increase the uncertainty in the object position based on the uncertainty in the change of camera pose. In our experiments, this parameter had a negligible impact on the performance of our algorithm due to the accuracy of the robot pose. It was assumed constant and estimated experimentally.

Experiments
To get an analysis of the performance of our world model representation algorithm we evaluated two parts: 1. Pre-processing. As shown in Figure 5, everything started with the data retrieved from the sensor. Due to the challenging light conditions of the environment, the point cloud was sometimes noisy and incomplete. This resulted in some tomato detections being rejected, or in larger inaccuracies on their estimated position when a sphere could not be fitted properly. To study the camera performance on the dataset, we calculated the percentage of detections which were rejected due to a lack of points and the percentage of detections whose predicted sphere fell outside of the established limits. The first step in the object detection algorithm was 2D object detection and instance segmentation through Mask R-CNN (He et al., 2017). Therefore, its performance played an important role in the performance of the system. We evaluated the precision P = Tp Tp+Fp and the recall R = Tp Tp+Fn where true positives T p , false positives F p , and false negatives F n were defined by the IoU between the predicted and ground truth bounding boxes, using the standard IoU threshold of 0.5. This analysis was done over the two different used confidence thresholds, 0.5 and 0.7.

World model representation.
To evaluate the performance of our world model algorithm, different values for the initialization threshold n init were explored in our experiments. Besides, two different analyses were performed: counting and data association. The counting performance of our world model was evaluated similarly to the evaluation made in recent papers that used data association for monitoring and yield prediction on crops (Kirk et al., 2021;Smitt et al., 2021). The evaluation of the actual data association performance was performed using a state-of-the-art multi-object tracking metric, Higher Order Tracking Accuracy (HOTA) (Luiten et al., 2021). The metrics used to evaluate the count performance were the mean percentage error: and the mean absolute percentage error where A i is the actual value and P i is the predicted value, and n is the number of frames. MPE gives an estimation of the counting accuracy which reflects under or over counting, while MAPE gives the relative absolute error on the counting. Evaluating the accuracy on object count over a series of sequences, on its own, is not enough to assess the performance of a data association algorithm. A good counting performance can still hide plenty of association errors like ID switching, or false positives that counter false negatives. Therefore, we evaluated the tracking performance of our algorithm using HOTA, a standard multi-object tracking metric (Luiten et al., 2021). It is a metric that can be easily decomposed into the three main sources of errors for tracking algorithms: localization, detection and association. Everything starts with the IoU threshold, α, that is used to define true positives (TP), false positives (FP), and false negatives (FN) similarly to how a detection algorithm is evaluated. From it, the localization accuracy can be measured as: where IoU (c) is the intersection over union score between the predicted bounding box and ground truth bounding box which represent the true positive c. The detection accuracy at a threshold α is then calculated as: The association score for a single TP match c between a detection and a ground truth (at threshold α) is calculated as: where TPA corresponds to the number of true positive associations, FNA corresponds to false negative associations and FPA corresponds to false positives associations. As shown in Figure 7, TPA, FNA and FPA are calculated (for each TP match c) over a predicted track and a ground truth track sequence. The association accuracy is then calculated as the average association score between all TPs matches in the dataset: Lastly, HOTA, for a given threshold α, can be defined as a combination of the above presented accuracies: As shown in Figure 3, the environment is more challenging in the upper Figure 7: Example of predicted and ground truth tracks used to calculate A(c) over a single TP match c (Luiten et al., 2021). The red square indicates TP match c between a detection bounding box and a ground truth, for which we calculate the association score A(c). The goal is to assess how good is the association between the predicted track and the ground truth track over time. TPAs (in green) represent all the matches between predicted and ground truth tracks. FPAs (in yellow) correspond to all the detections of the predicted track that are either matched with a different ground truth track or not matched at all. FNAs (in brown) are all the bounding boxes for the selected ground truth track that are not matched with the selected predicted track. This illustration assumes an object whose trajectory can be defined by its position in a 2D image over time. Even though our data association is done fully in 3D, we had to use image space to evaluate HOTA. Therefore, the trajectory of a tomato track is defined as the positions in the image plane of the detected bounding boxes assigned to that tomato over a sequence of frames. This means that the relative position of the tomato with respect to the camera over the sequence is converted to a set of 2D image coordinates which can be thought as the projections of the 3D points of the tomato into the image plane over the sequence.
areas of the plant compared to the lower ones. Since the representation dataset was recorded in a semi-cylinder path with ten different height steps, we evaluated the performance for all the metrics over these ten steps. When the metric did not depend on previous viewpoints of the sequence, the performance of each step was calculated over the ten viewpoints of that specific height step. This is the case for the detection performance and the camera error analysis. When the performance referred to the output of the world model, where previous frames from both the same or different height steps were needed, we evaluated the performance using data from the first frame until the last frame of the specific height step. For every height step the tracking results of the seven plants were averaged.

Pre-processing
As shown in Figure 6, the first step of our world modeling algorithm is to detect the object of interest in an image. As shown in Table 1, our trained Mask R-CNN is able to achieve a recall of 0.86 and a precision of 0.71 when a confidence threshold of 0.5 is used. When a larger threshold is used, the recall decreases to 0.66 while the precision increases to 0.88. A larger recall was expected when a lower threshold was used, as more detections would be made. However, the chance of false positives also increases, reducing the precision. Figure 8 shows the precision and recall results per height step in the semi-cylinder path. On the one hand, it can be seen how the precision does not change significantly over different heights, although it slightly drops from the first half of the height steps to the second half. This means that there are few more false positives in the higher areas of the plant. On the other hand, recall drops steadily from the first until the last height, meaning that it is harder for the detection network to detect tomatoes in the upper parts of the plant. Table 2 shows the percentage of detections that are rejected due to a lack of 3D points in the structured point cloud (RD), and the percentage of detections whose sphere fitting results in non-valid center or radius (NSF). It can be seen how both RD and NSF increase with the camera height independently   Figure 8 and Table 2, it can be seen how the performance of both camera and detection worsens in the higher plants of the plant. This was expected, as the higher parts of a tomato plant contain more leaves creating more clutter and occlusions, and smaller and less ripe tomatoes, as illustrated in Figure 3.  Figure 9 shows the resulting world model representation over a plant point cloud. Figure 9a shows the resulting point cloud after merging several viewpoints taken from the data, and Figure 9b shows the 3D world model representation, where tomatoes are represented as individual spheres, over the plant point cloud. The representation is not perfect, as there are two false positives in the bottom left, and a missing tomato in the middle-top part of the figure.

World model representation
As table 3 shows, the columns of MAPE and MPE present on average better values when a confidence threshold of 0.7 is used. This suggests that a larger precision is more important than larger recall values. Due to the multi-view nature of our algorithm, if a fruit is not detected in one viewpoint (frame), it can still be detected in one of the following viewpoints. Therefore, low recall values can be overcome with multi-view perception. On the contrary, a larger false positive rate, that corresponds with lower precision values, has the potential to inject non-existing fruits into the tracking system that are not removed further on in the current algorithm. Regarding n init , the rows that indicate a threshold n init = 1 have MAPE and MPE values closer to zero than their counter parts rows with n init = 0. Similarly, this can be explained due to the multi-view nature of our algorithm. True positive detections that are not seen consecutively over 2 frames and, therefore, do not pass the threshold n init = 1, can still appear consecutively in later frames and be added to the world model; while false positives are filtered. When n init = 0 both TPs and FPs that are seen only once are added to the world model. Consequently, the configuration that allows for more false positives, a confidence threshold Table 3: Results of the total counting and tracking experiments. Two confidence thresholds for the detection algorithm were evaluated, as well as two initialization thresholds for the tracking algorithm. The performance is shown per height step in a cumulative manner. This means that each height step contains the data from all the previous steps. The height step marked with * corresponded with the whole sequence of frames. For each cell, the results show the average over the 7 tomato plants. Numbers in bold correspond to the best values per metric. MAPE stands for Mean Absolute Percentage Error (eq. 10), MPE for Mean Percentage Error (eq. 9), HOTA for Higher Order Tracking Accuracy (eq. 16), DetA for Detection Accuracy (eq. 12), and AssA for Association Accuracy (eq. 14). of 0.5 and n init of 1, have higher MAPE values, and presents the largest overestimation of tomatoes. HOTA, DetA and AssA show lower performance in the higher parts of the plant, probably due to the more challenging conditions. The same happens with MAPE when n init is set to 0, as false positives are more likely to appear in the top parts of the plant. Even though recall values decline noticeably in the higher parts of the plant, when n init is set to 1, MAPE values do not increase as a result. This again could be prevented by multi-view perception, as although the chance of detecting a tomato is lower, there are still multiple frames where the tomato can be detected. MPE performance does not decrease either over height because under-and over-counting results between the different plants can cancel each other out, producing values close to zero.
While comparing MAPE and HOTA performance over different parameters and height steps, it can be seen how similar MAPE values can have very different HOTA results. For instance, the MAPE results of 11.35 at step 2, n init = 0, and a confidence threshold of 0,5; and 11.36 at step 5, n init = 1 and a confidence threshold of 0.7, have HOTA values of 66.68 and 56.09 respectively. This is due to how HOTA is calculated, since, with more frames of the recorded sequences, the number of predicted and ground truth tracks increases, which consequently increases the chances of association and detection errors. This is especially true in the used data, as the later frames are more complex than the frames from the bottom part of the plant. Furthermore, the best MAPE results are found with parameters that reduce the number of false positives in spite of a lower detection rate, confidence threshold of 0.7 and n init of 1. However, HOTA and its subsequent metrics, yield better results when the opposite is true, i.e. detection threshold of 0.5 and n init = 0, which increases the detection rate and the number of false positives. This can only be explained if a higher detection rate generated more TPs than FPs, circumstance that increases DetA, AssA and HOTA. Given the equations used to calculate DetA and A(c), a 1:1 increase ratio in TPs and FPs would result in a lower accuracy.

Discussion
The concept of a generic representation of the environments that can be used by robots across many tasks with a few modifications, like the objects of interest, has been proposed by several authors (Elfring et al., 2013;Persson et al., 2020;Wong et al., 2015b;Inceoglu et al., 2019). However, their work was limited to controlled lab environments and did not evaluate in-depth the performance of their data association algorithms with metrics like the ones proposed by Luiten et al. (2021). In this work, we showed how a generic representation, based on data association, can accurately represent and locate most objects in a real world tomato greenhouse environment. Nonetheless, the performance of our world model can be potentially improved in several steps of the pipeline.
The performance of our detection algorithm varied greatly between different height steps. In the lower part of the plants, our performance is similar to what other research has found in datasets that aim toward the de-leafed lower parts of tomato plants (Afonso et al., 2020). However, detection in the upper parts of the plant was more challenging, due to more leaves and smaller and less ripe tomatoes, leading to a lower performance. Using training data containing the tomato variety that is used in the counting and tracking experiment could potentially improve the detection performance. Furthermore, the quality of the point cloud affected the performance of both detection and tracking algorithms. Up to 4.58% of the detections were rejected due to a lack of 3D points, which affected precision and recall values. The predicted sphere of up to 55.08% of the detected tomatoes was non-valid, mainly due to a partial tomato point cloud with few points or with high noise levels.
The performance of the detection algorithm greatly affects the accuracy of our 3D tracking algorithm. This is something that Bewley et al. (2016) pointed out for Hungarian-based tracking algorithms. Concretely, our experiments showed that configurations which increase the recall and reduce the precision of the detection system decreased the counting performance while increasing the tracking performance. A cause for this is the inability of the world model to remove existing objects, which harms its ability to remove false positive detections. Consequently, false positives stay in the world model increasing the counting error. However, removing false positive tracks from the world model in an environment where occlusions are common is not an easy task. If an object is in the field of view of the camera, but it is not detected, it does not exclusively mean that it is a false positive. It could be occluded behind another object, like a leaf for example.
Localizing and counting fruits in agro-food environments is a challenging task due to the complexity of the environment, yet it is key for robotic applications. Our algorithm could potentially yield a MAPE of 5.06%. Recently, Kirk et al. (2021) has shown the potential of tracking systems similar to ours. They were able to count strawberries with a MAPE of 3%; although their conditions were more favorable with fewer occlusions and a larger distance between individual fruits. In similar environment conditions as ours, Halstead et al. (2021) developed and evaluated another similar tracking system to count bell peppers in individual plants, and achieved a MAPE of 4.1%. Their data also contained clutter and de-leafed areas of plants, but, since bell peppers do not grow in cluster like tomatoes, the distance between individual fruits is more favorable toward tracking. Smitt et al. (2021) extended Halstead et al. (2021) work to build an autonomous system that could count bell peppers along a row of plants in the greenhouse. Besides, they added bounding box re-projection between different frames using wheel odometry and the point cloud from the camera. They achieved a MAPE of 21% on tomato and sweet pepper plants. However, the tracking algorithm of all these works was 2D-based, and they used a fixed camera array which moved parallel to the plants' row. On the one hand, fixed camera arrays provide faster information than a camera attached to a robot arm. On the other hand, they are limited in the information they can offer as there are viewpoints that cannot be achieved. Consequently, they are less capable of overcoming certain occlusions. Furthermore, a 2D-based tracking system is at a clear disadvantage against a 3D-based one; as an object that appears in front of a previously seen object and occludes it will have very similar 2D camera coordinates to the occluded object.
In our experiments, we evaluated the counting error in order to assess the performance of our representation by a 3D tracking algorithm. This is a common practice in the field Halstead et al., 2021;Kirk et al., 2021). However, from our experiments, it can be seen how counting and tracking performance do not always produce similar results. A tracking algorithm can carry other types of errors that are not covered by counting performance metrics. For instance, during tracking, two or more objects can switch IDs during one or multiple frames. This might not affect the final count prediction, and is, therefore, not visible in the counting metrics. Furthermore, good counting results can originate from a set of detections which contain missed objects which are compensated by false positives. This type of errors harms the accuracy of the plant representation and might not be desired for some robotics and monitoring applications. Multi-object tracking (MOT) metrics such as the ones proposed by Luiten et al. (2021), give more insights into the actual tracking performance and should be used to properly evaluate the performance of any tracking algorithm.
The HOTA and AssA values at the higher viewpoints of the plant suggest that there are still numerous association errors. This is probably due to the simplicity of the tracking algorithm, where only the estimated Cartesian center of the objects is used for tracking. Several multi-object tracking algorithms have shown how using other features in the data association step (Wojke et al., 2017) or an end-to-end approach (Meinhardt et al., 2022) perform better in scenarios with many occlusions.
All the data used was recorded following a semi-cylinder path around a plant, therefore, our viewpoints are not optimized for information gain and several images do not contain any valuable information at all, like Figure 3a. A NBV algorithm like the one proposed by Wu et al. (2019); Burusa et al. (2022) has the potential to improve the efficiency and quality of our world model representation, as the viewpoints selected by the NBV algorithm might contain fewer occlusions of the relevant objects in the scene.

Conclusions and future work
In this work, we presented a generic method for 3D representation for robots, based on individual objects with dynamic properties, that can build accurate representations in highly occluded environments. We also developed and evaluated a 3D MOT algorithm that is used to create, maintain and update the objects in the representation. We showed how our world model can be used to find and represent tomatoes in a challenging and highly cluttered tomato greenhouse with a high level of occlusion using a multi-view perception approach with a 6DoF robotic arm. Our algorithm achieved a tracking accuracy, represented by the HOTA metric, of up to 71.47, and a counting error as low as 5.06%. This would allow the use of our method in practical applications like fruit harvesting. Additionally, we showed how the total count of objects at the end of a sequence does not fully describe the performance of a MOT algorithm and fails in detecting certain type of errors, especially false positives and false negatives canceling each other out. We also identified the main source of errors in our MOT algorithm by performing a novel analysis on different steps on the pipeline and using adequate and up-to-date metrics. These being the detection and association performance in the most cluttered areas of the plant.
To further improve the world model in future work, we will explore different deep learning methods to improve the tracking performance such as Wojke et al. (2017); Meinhardt et al. (2022) to improve the association performance. Some of these methods might be able to remove false positive objects out of the world model in the presence of occlusions due to being end-to-end deep learning approaches. To increase the detection performance, NBV algorithms can select viewpoints that contain fewer occlusions and provide our detection algorithm with images and point clouds easier to process. Therefore, we will evaluate the performance of our world model representation when it is coupled with a NBV algorithm.

Declaration of competing interest
We declare that there are no personal and/or financial relationships that have inappropriately affected or influenced the work presented in this paper.