Robotics and Autonomous Systems Local-HDP: Interactive open-ended 3D object category recognition in real-time robotic scenarios ✩

We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categoriza- tion, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Each topic is a distribution of the visual words over a predefined dictionary. Using an inference algorithm, these latent variables are inferred from the dataset. Subsequently, the category of an object is determined based on the likelihood of generating a 3D object from the model. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to high-level conceptual topics for 3D object categorization. However, the efficiency and accuracy of LDA-based approaches depend on the number of topics that is chosen manually. Moreover, fixing the number of topics for all categories can lead to overfitting or underfitting of the model. In contrast, the proposed Local-HDP can autonomously determine the number of topics for each category. Furthermore, the online variational inference method has been adapted for fast posterior approximation in the Local-HDP model. Experiments show that the proposed Local-HDP method outperforms other state-of-the-art approaches in terms of accuracy, scalability, and memory efficiency by a large margin. Moreover, two robotic experiments have been conducted to show the applicability of the proposed approach in real-time applications. © 2021TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction
Most recent object recognition/detection techniques are based on deep neural networks [1][2][3][4][5][6][7]. These methods typically need a large labeled dataset for a long training process. The number of object categories (class labels) should be predefined in advance for such methods. However, in real-life robotic scenarios, a robot can always face new object categories while operating in its environment and it requires learning from a small number of observations. Therefore, the model should be updated in an open-ended manner without completely retraining the model [8][9][10]. In this paper, open-ended learning means that the number of categories (class labels) is not fixed and predefined for the model, and that it can grow during runtime. Furthermore, object category recognition is not a well-defined problem because of the ✩ This work is conducted at the center of Data Science and Systems Complexity (DSSC) and sponsored with a Marie Skłodowska-Curie COFUND grant, agreement no. 754315.
The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-andengineering/computer-science/journals. large inter-category variation ( Fig. 1 (top)), multiple object views for each object ( Fig. 1 (bottom)), and concept drift in dynamic environments [11].
Object recognition in humans is a complex hierarchical multistage process of streaming visual data in the cortical regions [12]. The hierarchical structure of the brain for the object recognition task has motivated us to choose hierarchical Bayesian models like Latent Dirichlet Allocation (LDA) [13] and Hierarchical Dirichlet Process (HDP) [14] for object category recognition.
In this paper, we suggest that 3D visual streaming data should be processed continuously, and object category learning and recognition should be performed simultaneously in an openended manner. We propose the Local Hierarchical Dirichlet Process (Local-HDP), an extension of the Hierarchical Dirichlet Process [14] method, which can incrementally learn new topics for each category of objects independently. In contrast to notable recent works [11,15,16] using a predefined number of topics, Local-HDP is more flexible since it is a non-parametric Bayesian model that can autonomously determine the number of topics for each category at run-time.   layer extracts a set of local shape features using the spin-image descriptor [17].
The computed features are represented as Bag of visual Words (BoWs). The obtained representation is then sent to the topics layer, where a set of topics is inferred autonomously for the given object using the proposed Local-HDP method. Each topic is a distribution of visual words over a dictionary. In other words, the topic layer provides an unsupervised mapping of the BoW representation to the topics space, which can fill the conceptual gap between low-level features and high-level concepts. As shown in the object views layer, the appearance of an object may vary from different perspectives ( Fig. 1 (bottom)). Therefore, it is necessary to infer topics using different object views. There might be different instances in an object category as well (see Fig. 1 (top)). This point is addressed in the categories layer. Moreover, a simulated teacher has been developed to interact with the model and evaluate its performance in an open-ended manner.
This work extends two approaches, namely Local-LDA [11] and HDP [14], in four aspects. First, our approach can autonomously detect the number of required topics to independently represent the objects in each category, avoiding the limitation of Local-LDA for determining the number of topics in advance. This feature prevents underfitting or overfitting of the model. Second, our research adapts the online variational inference technique [18], which significantly reduces inference time. Third, the proposed local online variational inference method leads to memory optimization since it needs to store a smaller average number of instances per object category in memory. Fourth, our work extends the hierarchical Dirichlet process [14] by learning and updating local topics for each object category independently in an incremental and open-ended fashion.

Related work
Object representation is one of the main building blocks of object recognition approaches. The underlying reason is that the output of the object representation module is used in both learning and recognition. Object representation techniques can be categorized into three groups, namely, global and local object descriptors and machine learning approaches [19]. Notable global object descriptors are Global Orthographic Object Descriptor (GOOD) [20,21], Ensemble of Shape Functions (ESF) [22] and Viewpoint Feature Histogram (VFH) [23]. Examples of local 3D shape descriptors include Spin-Images (SI) [17], Intrinsic Shape Signature (ISS) [24], and Fast Point Feature Histogram (FPFH) [25]. Local descriptors are more robust to occlusions and clutter. However, comparing pure local descriptors is a computationally expensive task [26]. To alleviate this problem, machine learning techniques like Bag of Words (BoW) [27], Latent Dirichlet Allocation (LDA) [13,28] and deep learning [29,30] methods can be used for representing objects in a compact and uniform format.
Kasaei et al. [11] extended Latent Dirichlet Allocation (LDA) and proposed Local-LDA. They showed the application of Local-LDA in the context of open-ended 3D object category learning and recognition. Similar to our approach, Local-LDA learns a set of topics for each object category incrementally and independently. Unlike our approach, in Local-LDA, the same number of topics is chosen in advance based on trials and errors for all of the object categories. A good choice for the number of topics for each object category is correlated to the intra-category variation of each 3D object category. Therefore, choosing the same number of topics for all the object categories with different intra-category variation might be not reasonable. Moreover, in open-ended scenarios, it is not feasible to anticipate the inter-category variation of 3D objects that the model might see in the future and choose a fixed number of topics in advance for all the categories. To solve these issues, our approach can autonomously choose the number of topics for each object category on the fly without a need for in advanced trails and errors. This makes our approach more robust for recognizing object categories with various intercategory and intra-category variation and applicable in real-world open-ended scenarios. Local-LDA uses collapsed Gibbs sampling for approximating the posterior probability. However, we adapt the online variational inference technique [18] for Local-HDP.
Our approach builds on the Hierarchical Dirichlet Process (HDP) [14], that is based on Dirichlet process (DP) [31] and mixture of DPs [32]. Posterior inference is intractable for HDP, and much research has been done to find a proper approximate inference algorithm [14,33,34]. The Markov Chain Monte Carlo (MCMC) sampling method for DP mixture models has been proposed for approximate inference in HDPs [35]. David Blei et al. proposed the variational inference for DP mixtures [36]. Teh et al. [14] proposed the Chinese Restaurant Franchise metaphor for HDP and used Gibbs sampling method for the inference. The online variational inference approach is proposed by Wang et al. [18] for HDP, which can be used in online incremental learning scenarios and for large corpora. Our method is different from HDP, since the proposed Local-HDP only shares the topics in the local models for each category and not across different categories. This is especially needed in the case of 3D object categorization for open-ended scenarios [11]. The use of local topics avoids underfitting of the model by considering intracategory variations. HDP has further extensions to construct tree-structured representations for text data which have nested structure [37]. Similar to supervised hierarchical Dirichlet Process (sHDP) [38], we use the category label of each object. Unlike sHDP, we learn object categories in an open-ended fashion, while in sHDP, the number of object categories to be learned should be defined in advance.
Deep learning-based approaches [39][40][41] try to learn a sparse representation for 3D objects. Unlike our approach, such methods typically need a large labeled dataset and require long training time. In particular, our proposed approach does not require a large labeled dataset and can incrementally update the model facing an unforeseen object category in an open-ended manner. Moreover, the number of categories is not fixed in open-ended approaches like ours.

Method
We assume that an object has already been segmented from the point cloud of the scene, and we hence mainly focus on detailing the Local Hierarchical Dirichlet Process (Local-HDP) approach.

Pre-processing layers
In Fig. 2, the first two layers -the feature layer and BoWs layer -are the pre-processing layers. In the feature layer, we first select key-points for the given object and then compute a local shape feature for each key-point. Towards this goal, we first voxelized 1 the object (Fig. 3) (b), and then, the nearest point to each voxel center is selected as a key-point. Afterwards, the spinimage descriptor 2 [17] is used to encode the surrounding shape in each key-point using the original point cloud (Fig. 3 (c)). This way, each object view is described by a set of spin-images in the first layer, O s = {s 1 , . . . , s N } where N is the number of key-points.
The obtained representation is then sent to the BoWs layer. Since HDP-based models have the bag-of-words assumption -that the order of words (visual words) in the document (3D object view) can be neglected -the BoWs layer transforms the computed spinimages to a BoW format ( Fig. 3 (d)). Towards this end, the BoWs layer requires a dictionary with V visual words (spin-images). In this work, we have created a dictionary of visual words using the same methodology as Local-LDA [11]. The obtained BoW representation is fed to the topic layer.

Local hierarchical Dirichlet process
After synthesizing the point cloud of the 3D objects to a set of visual words in BoW format, the data is ready to be inserted into the topic layer where the proposed Local-HDP method is employed. In this layer, the model transforms the low-level features in BoW format to conceptual high-level topics. In other words, each object is represented as a distribution over topics, where each topic is a distribution of visual words over a dictionary. To this end, we use an incremental inference approach where the number of categories is not known beforehand and the agent does not know which additional object categories will be available at run-time. The plate notation of Local-HDP is shown in Fig. 4. In this graph, C is the number of categories, |c| is the number of objects in each category. Each object, j, is represented by a set of N 1 www.pointclouds.org/documentation/classpcl_1_1_voxel_grid.html 2 www.pointclouds.org/documentation/classpcl_1_1_spin_image_estimation. visual words, W j,n where j, n show the n'th visual word from the j'th object. Each visual word is an element from the vocabulary of visual words with predefined V words, that is W j,n ∈ {1...V }.
Using a Coffee Mug as an example, a distribution over the topics of the Coffee Mug should be used to generate the visual words of the object. Accordingly, a particular topic is selected out of the mixture of possible topics of the Coffee Mug category to generate the visual words. For instance, coffee mugs typically have a ''handle'', which is represented as a distribution of visual words that repeatedly occur together. This can be interpreted as the ''handle'' topic, which is inferred from the co-occurrence of the visual words in several objects of the same category. The process of choosing a topic and then drawing the visual words from that topic is repeated several times to generate all the visual words of the Coffee Mug. It is worth mentioning that the generative process is not used in the experiments. However, the local online variational inference technique is used to do a reverse procedure of inferring the topics, corresponding to latent variables from the 3D object views. Using the inferred topics, we then compare the log-likelihood of generating the visual words in a 3D object for each local model. The category of the local model with the highest log-likelihood is then selected as the predicted category of the 3D object.

Dictionary of visual words
In this paper, we have used the method of [11] to construct the dictionary of the visual words. This means that a dictionary with V visual words is constructed by clustering a random subset of 50% of the training data. We have utilized the k-means method for clustering the local shape descriptors (spin-images) of the randomly selected objects into V visual words. Consequently, the nearest spin-images to the cluster centers are selected as the dictionary's visual words.

Local online variational inference
The inference method is responsible for inferring the latent variables in the model using a dataset [42]. In this section, we adapt the online variational inference method [18] for Local-HDP. This method can be used in open-ended applications since it can handle streaming data in an online and incremental manner. Moreover, it is faster than traditional approximate inference techniques, e.g., Chinese restaurant franchise [14] and variational inference [36], and it can be used to infer the latent variables of differently scaled datasets [18].
Online variational inference for HDP is inspired by the online variational Bayes [43] method for LDA. This method tries to optimize a variational objective function [44] exploiting stochastic optimization [45]. HDP is a collection of DPs G j that share the same base-distribution G 0 (which is also drawn from a DP). These DPs share the same set of atoms and only the atom weights are different. Mathematically, a two-level HDP is defined as follows: where α 0 > 0 is the scaling parameter and γ is the concentration parameter of a DP. Sethuraman's stick-breaking construction technique [46] is responsible for determining the number of topics in the model. Using the same approach as [47] for HDP, the variational distribution for local online variational inference is in the following form: In the terminology of variational inference techniques, q is called the variational approximation to the posterior p. Variational techniques try to solve an optimization problem over a class of tractable distributions Q in order to find a q ∈ Q that is most similar to p and can be used as its approximation. Moreover, is the inferred topic distribution, and z jd is the topic index for the nth visual word in the jth 3D object. The infinity notion (∞) shows the open-ended nature of the number of parameters.
The factorized form of q(c), q(z), q(φ), q(β ′ ) and q(π ′ ) is the same as the online variational inference for HDP [47]. Assuming that we have |c| objects in each category for Local-HDP, the variational lower bound for object j in category C is calculated as follows: Where H(.) is the entropy term for the variational distribution. Therefore, the lower bound term for each category is calculated in the following way: Using coordinate ascent equations in the same way as online variation inference, the object-level parameters (a j , b j , ϕ j , ζ j ) are estimated. To be more specific, a j and b j are the parameters of the beta distributions for the bottom-level stick proportions π j , ϕ j is the variational parameter for the vector of indicators c j , and ζ j is the variational parameter for the topic z j . These variables are defined in the same way as in [47]. Then, for the category-level parameters (λ (C ) , u (C ) , v (C ) ), we do gradient descent with respect to a learning rate: Here, K and T are the document (3D object view) and corpus (category) level truncates. Moreover, ϕ (multinomial), ζ (multinomial) and λ (Dirichlet) are the variational parameters, which are the same for all the categories. Using an appropriate learning rate p t 0 for online inference, the updates for λ (C ) , u (C ) and v (C ) become: Algorithm 1 shows the pseudo-code of the proposed inference technique for the Local-HDP approach.

Algorithm 1: Local Online Variational Inference
initialization: for all the learned categories. Set t 0 = 1 for each Category C do while Stopping criterion is not met do -Use the object view j for updating the parameters.
-Compute the document-level parameters a j , b j , Φ j , ζ j using the same methodology as [18].

Object category learning and recognition
In this subsection, the mechanism of interactive open-ended learning has been explained in more detail. Classical object recognition methods do not support open-ended learning. In contrast, our method is open-ended, and the number of categories can be incrementally extended through time. The system can interact with a human user to learn about new categories or to update existing category models by receiving corrective feedback when misclassification occurred. We follow the same methodology as [48] for this purpose. The user can interact with the system with one of the following actions: • Teach: introducing the category of target object to the agent. • Ask: inquiring the agent about the category of a target object.
• Correct: sending corrective feedback to the agent in case of wrong categorization.
Whenever the agent receives a teach command, it incrementally updates the local model corresponding to the category of the target object using the aforementioned online variational inference technique. In case of the ask command, the log-likelihood is used to determine the category of an object. The log-likelihood is computed in the same way as in [18]. The local model with highest likelihood is then selected as the predicted category for an object.

Experimental results
Following the same protocol as Local-LDA [11] for interacting with a simulated teacher, two sets of experiments, namely, offline experiments and open-ended experiments, have been conducted to evaluate the performance of the proposed method. The offline experiments use the k-fold cross-validation technique for evaluating the performance of the model in offline scenarios with a small number of training instances. The open-ended experiments are focused on evaluating the proposed approach for the scenarios in which the number of object categories (class labels) is not fixed and can grow over time. In open-ended scenarios, the model is updated in an incremental manner. However, in the offline evaluations, the model is trained once with a training set and then evaluated using a testing set from the dataset.

Datasets and baselines for comparison
For offline evaluation of the proposed Local-HDP and the other state-of-the-art approaches, we have used the restaurant RGB-D object dataset [48]. This dataset has 10 categories of objects and each category has a significant intra-category variation. It consists of 306 different object views for 10 household objects. Therefore, it is a suitable dataset to perform extensive sets of experiments.
The Washington RGB-D dataset [49] is used for online openended evaluation of the method since it is one of the largest 3D object datasets. It has 250,000 views of 300 common household objects, categorized in 51 categories. Fig. 5 shows some of the categories of objects presented in the Washington RGBD Dataset. In all experiments, only the depth data has been used for determining the category of 3D objects. Therefore, as one can see in Fig. 5, detecting the category of an object based solely on the depth data is a hard task even for humans.

Offline evaluation
Similar to Local-LDA, our approach has several parameters that should be well selected to provide an appropriate balance between recognition performance, memory usage and computation time. In order to fine-tune the parameters of our proposed method for offline evaluation, 440 experiments have been conducted with different parameter values. The voxel grid approach has been used for down-sampling and finding the keypoints for the local descriptor. Voxel grid has Voxel Size (VS) parameter which determines the size of each voxel. Moreover, the spinimage local descriptor has two parameters, namely Image Width (IW) and Support Length (SL).
In all experiments, the first level and second level concentration parameters are set to 1, chunk size for offline evaluation is set to 1, and the maximum number of topics is set to 100. All the other parameters are set to the default values as proposed in [47] . Moreover, in all the experiments the LDA parameters are set to be the same values as described in [11]. Since online variational inference is a stochastic inference technique, for each experiment the order of the data instances has been permuted 10 times and for each permutation 10-fold cross-validation has been used. Accordingly, the results have been averaged.  Table 2 The comparison of different approaches using the best parameter values. The average run-time of each experiment is reported for all the approaches.
Approach Accuracy (%) Run-time (s) RACE [50] 87.09 1757 BoW [27] 89.00 195 LDA (shared topics) [13] 88.32 227 Local-LDA [11] 91.30 348 HDP (shared topics) [18] 90.33 233 Local-HDP (our approach) 97.11 352 Table 3 The comparison of the proposed approach with some deep learning approaches for 3D object classification.  Table 1 shows the comparison of Local-HDP and Local-LDA with different parameter values. As one can see in this table, the proposed Local-HDP method outperforms Local-LDA which is the best among the other methods (see [11]). Using the best parameter values based on Table 1 and the corresponding tables in [11], the accuracy of all the approaches is shown in Table 2. Table 2 shows that Local-HDP outperforms the other stateof-the-art methods in terms of accuracy with a large margin. In particular, the accuracy of Local-HDP was 97.11%, which is around 6.11 percentage point (p.p.) better than Local-LDA, and 6.78, 9.11, 8.11, 10.11 p.p better than HDP, LDA, BoW and RACE approaches respectively. Moreover, Local-HDP has almost the same run-time as Local-LDA. Table 3 shows the comparison of the proposed Local-HDP approach with some deep learning architectures, namely, Point-Net [51], PointNet++ [52], and PointCNN [53] for offline evaluation. Since the number of training instances for each category is limited in the restaurant RGB-D object dataset [48] (the number of training instances for offline 10-fold cross-validation for the fork category is 8 and the average number of training instances per category is 27), the deep learning approaches tend to overfit and could not generalize well. To resolve this issue for deep learning approaches, the dataset is augmented 20 times by randomly rotating the point clouds around different axes. Table 3 also compares the accuracy of deep learning approaches with the proposed Local-HDP after augmentation.
To uniformly sample 2048 points from a point cloud for the deep learning approaches, a mesh is constructed using the ballpivoting algorithm for surface reconstruction [54]. Subsequently, the point clouds are normalized to a unit sphere (the same approach is used in PointNet [51]) to uniformly sample 2048 points from the constructed meshes.

Open-ended evaluation
In order to evaluate our model in an open-ended learning scenario, we used the Washington RGBD dataset [49], and we have followed the same methodology as discussed in [11]. In particular, we have developed a simulated teacher which can interact with the model by either teaching a new category to it or asking the model to categorize the unforeseen object view. In case of wrong categorization of an object by the model, correcting feedback is sent to the model by the simulated teacher. In order to teach a new category, the simulated teacher presents three randomly selected object views of the corresponding category to the model. After teaching a new category, all of the previously learned categories are tested using a set of randomly selected unforeseen object views. Subsequently, the accuracy of category prediction is computed. In open-ended evaluation, the model observes the 3D objects one by one and the history of the latest 3n predictions of the model is considered for calculating the accuracy, where n is the number of the learned categories. If the corresponding accuracy is higher than a certain threshold τ = 0.66 (which means that the number of true-positives is at least twice the number of wrong predictions), the simulated teacher will teach a new category to the model. If the learning accuracy does not exceed the threshold τ after a certain number of iterations (100 for our experiments), the teacher infers that the agent is not able to learn more categories and the experiment stops. More details on the online evaluation protocol that has been used in our experiments can be found in [15].
Since the performance of open-ended evaluation may depend on the order of introducing categories and object views (randomly selected at the beginning of each experiment), 10 independent experiments have been carried out for each approach. Several performance measures have been used to evaluate the open-ended learning capabilities of the methods, namely: (i) the number of Learned Categories (#LC); (ii) the number of Question/Correction Iterations (#QCI) by the simulated user; (iii) the Average number of stored Instances per Category (AIC) ; (iv) Global Categorization Accuracy (GCA), which represents the overall accuracy in each experiment. These performance measures have the following interpretations. #LC shows the open-ended learning capability of the model, which answers the following question: How capable is the model in learning new categories? #QCI shows the length of the experiment (iterations). AIC represents the memory efficiency of the method. A lower average number of stored instances per category means a higher memory efficiency of the method. AIC is also related to the learning speed. A smaller AIC means that the method requires fewer observations to correctly recognize each category. #GCA shows the accuracy of the model in predicting the right category for each object.
In order to compare methods fairly, the simulated teacher shuffles data at the beginning of each round of experiments and uses the same order of object categories and instances for training and testing all the methods.  (Table 4). One important observation is that shuffling the order of introducing categories by the simulated user does not have a serious effect on the performance of Local-HDP, while it affects the performance of other methods significantly. The longest experiment, on average was continued for 1411.20 ± 212.75 iterations with Local-LDA and the agent was able to learn 40.60 ± 4.98. Fig. 6 (right) plots the global categorization accuracy versus the number of learned categories. It was observed that the agent with Local-HDP not only achieved higher accuracy than other methods in all experiments but also learned all the categories. It is worth mentioning that Local-HDP concluded prematurely due to the ''lack of data'' condition, i.e., no more categories available in the dataset. This means that the agent with Local-HDP has the potential of learning more categories in an open-ended fashion. According to Table 4, the average GCA for Local-HDP is 85.23% and it is 69.44%, 66.14% and 51.00% for Local-LDA, HDP and LDA, respectively. Fig. 7 represents the absolute number of stored instances per category in one round of the open-ended experiments. It shows that the agent with Local-HDP stored a lower or equal number of instances for all of the categories. On closer review using Fig. 6 (left), one can see that the Local-HDP on average stored 6.85 instances per category to learn 51 categories, while Local-LDA stored 13.75 to learn 40.6 categories. HDP achieved the third place by storing 12.76 instances to learn 27.20 categories and LDA was the worst among the evaluated approaches, i.e., on average it stored 16.74 instances to learn 9.10 categories. According to this evaluation, Local-HDP is competent for robotic applications with strict limits on the computation time and memory requirements.

Real-time robotic application
To demonstrate the applicability of the proposed 3D object categorization method in real-time robotic applications, we have performed two object-manipulation experiments, as shown in Fig. 8. In both robotic applications, the model is trained in an open-ended manner from scratch and the models are not pretrained.
In both demonstrations, a UR5e robotic arm is used to manipulate the objects located on a table. Moreover, a Kinect camera is fixed in front of the table to acquire the visual data for further perceptual analysis. The system detects table-top objects, draws a bounding box around them and assigns a tracking ID (TID) to each object (Figs. 8.b -8.d). To compute the orientation of the bounding boxes, the Principle Component Analysis (PCA) [55] algorithm has been used. First, a local reference frame is constructed by applying PCA on the normalized covariance matrix, Σ, of the point cloud, i.e., ΣV = EV , where E = [e 1 , e 2 , e 3 ] contains eigenvalues in the descending order, and V = [v 1 , v 2 , v 3 ] represents the eigenvectors. Therefore, v 1 is the eigenvector with the largest variance of the points of the object. We consider v 1 and v 2 as X and Y axes, respectively. We define the Z axis as the cross product of v 1 ×v 2 . The minimum and maximum values in each axis are then considered for computing the oriented bounding boxes. The model does not initially have any knowledge about the category of the objects located on the table. In both scenarios, we involved a human user in the learning loop as it is necessary for a humanrobot interaction. In the first scenario, a user can interact with the system through the RViz 3 [56] 3D visualization environment and assign a category label to each of the detected objects on the table. After introducing the object category labels to the model, it can detect the category of the objects even if they have been placed in a different location on the table, which might change the object view partially due to the perspective or occlusion by the other objects. Finally, the clearing task is initiated in which for each individual object, the end-effector of the robotic arm moves to the pre-grasp position of a target object, and then grasps the object and puts it into a trash box located on the table (Fig. 8.a). This demonstration showed that the system was able to detect different object categories and learned about new object categories using very few examples on-site. Furthermore, it was observed that the proposed approach was able to distinguish geometrically very similar objects from each other (e.g., Cup vs CokeCan). The video of this robotic demonstration is available at: https://youtu.be/YPsrBpqXWU4 The second robotic demonstration has more emphasis on category recognition of unforeseen objects and performing a category-specific robotic task. In this demonstration, a user interacts with the system through voice commands and introduces the initially located objects on the table to the model. The model uses the segmented point cloud of these table-top objects to train the model. Subsequently, three new objects will be spawned on the table in the Gazebo simulator [57]. After the detection of each of the new objects, the system tells the predicted category to the user and asks for corrective feedback in case of a wrong prediction. This way the system learns about new object categories incrementally and updates the category models once a misclassification happens.
After recognizing all object categories, the user commands the robot to clear all the coke cans from the table and put them into the trash box located on the table. To accomplish this task, the robot should detect the pose as well as the label of all objects. Then, the robot grasps and manipulates all the coke cans from the table while keeping the rest of the objects from different categories on the table (Fig. 8.c). A video for this robotic demonstration is available at: https://youtu.be/otxd8D8yYLc

Conclusion
We propose a non-parametric hierarchical Bayesian model called Local Hierarchical Dirichlet Process (Local-HDP) for interactive open-ended 3D object category learning and recognition. Each object is initially represented as a bag of visual words and then transformed into a high-level conceptual topics representation.
We have conducted an extensive set of experiments in both offline and open-ended scenarios to validate our approach and compare its performance with state-of-the-art methods. For the offline evaluations, we mainly used 10-fold cross-validation (trainthen-test). Local-HDP outperformed the selected state-of-the-art (i.e., RACE, BoW, LDA, Local-LDA, and HDP) by a large margin, achieving appropriate computation time and object recognition accuracy. In the case of open-ended evaluation, we have developed a simulated teacher to assess the performance of all approaches using a recently proposed test-then-train protocol. Results show that the overall performance of Local-HDP is better than the best results obtained with the other state-of-the-art approaches.
Local-HDP can autonomously determine the number of topics, even though finding a good choice for the number of topics is not a trivial task in LDA-based approaches. Moreover, the number of topics in Local-LDA should be defined in advance and is the same for all object categories, which may lead to overfitting or underfitting of the model. Local-HDP has resolved this issue by finding the number of topics for each category based on the intra-category variation of objects. Adapting online variational inference to the proposed approach enables Local-HDP to approximate the posterior for large datasets rapidly.
In order to demonstrate the applicability of the proposed approach in real-time robotic applications, two robotic demonstrations have been conducted using a UR5e robotic arm. These experiments showed that the robot was able to learn new object categories using very few examples over time by interacting with non-expert human users.
In the continuation of this work, we would like to investigate the possibility of using the proposed method for graspable part segmentation of 3D objects. This way, we can address the problem of 3D object recognition and affordance detection (i.e., detecting graspable parts) simultaneously.