Visual Navigation Based on Language Assistance and Memory

In order to solve outdoor mobile robots’ dependence on geographic information systems, and to realize automatic navigation in the face of complex and changeable scenes, we propose a method that selects landmark and adds prompt guidance so that the mobile robot can navigate relying on visual-language and memory. Visual-language can guide the direction of the mobile robot’s movement, obeying the annotation of people and according to its memory of the scene, which refers to the strategy of selecting passed-by landmarks for the route and remembering the scene features. When passing it, the agent can ascertain the position and match it to carry out the action. Experiments showed that our proposed method can achieve the purpose of independent navigation without GIS, and is superior to existing methods.


I. INTRODUCTION
At present, most outdoor mobile robots move on the road with the help of geographic information system (GIS) [1], such as self-driving [2]. In most cases, GIS can not cover all areas, such as complex urban roads, urban downtown streets, mountain trails and park sites. In real life, human beings can reach the places they want to go without the help of GIS. In the process of determining the route, they do not need to remember every object that they passed by. They often choose to remember some key landmarks as reference nodes, and then connect these nodes in a sequence as the guide route for the next travel direction. When humans cannot find the direction of action, they also seek help from some experienced people (experts) for a feasible route to guide them forward. Inspired by this, we propose a novel navigational method with a description of visual-language and memory guiding the action route of the mobile robot.
Visual navigation based on language assistance and memory (VNLM) mainly includes two aspects, one being visual language navigation and the other being visual relationship detection. Visual language navigation can get rid of the dependence of map navigation in the process of mobile robots The associate editor coordinating the review of this manuscript and approving it for publication was Amin Zehtabian . moving and can guide the movement of mobile robots according to the method of integrating language description and scenes. Visual relationship detection (VRD) [3] is applied to extract the scene features of objects to determine landmarks and select how to execute action. At present, visual language navigation system is based on single language description, which is suitable for short distance navigation. In the case of long distances, the language description text is very long and navigation is poor. For solving these problems, this paper proposes to divide the description-language into several segments. In this way, the connection segment by segment can lengthen the distance on the basis of ensuring the success rate. In addition, we select key nodes as VRD features on the road to make the route accurate. Just like human beings, when they are uncertain about the right route to take, or do not know where they are, when finding obvious landmark objects, they make sure the right route to take and where they are in current time. The two combined methods enable the robot to reach long-distance target locations.

A. CHARACTERISTIC OF LANDMARKS
In the process of traveling, people always remember some landmarks, such as buildings, parks, cinema, etc. The characteristics of landmarks include the following: 1) Characteristic stability: Less affected by the natural environment and can maintain a relatively stable location and appearance, such as the electric pole in Figure 1. 2) Combination with the surrounding environment: a single object often cannot guarantee its uniqueness in a large area. As Figure 1 shows, it is difficult to distinguish the electric pole beside the road from other electric poles when observing its own appearance. We would need to associate the pole with the surrounding objects to remember the scene features. For example, there is a call box near the pole and a trash can between the call box and the pole in Figure 1. 3) Fault tolerance: The sign is still there, but some objects around it have been changed. For example, the trash can near the electric pole may be moved to the other side or pulled away. Although the scene features have been changed, it will not affect the final judgment result.

B. ROUTE GUIDANCE TABLE
The route guidance table will be generated through describing the actual scene of road. When arriving at landmark, the scene feature of landmarks are extracted by using a feature extractor and saved with ID in the memory. As Table 1 shown, ID is corresponding to the ID of landmark feature. The label is a sign that determined which style to execute action. Mobile robots call algorithm according to label during the moving.

C. RUNNING PROCESS OF MOBILE ROBOT
The mobile robot obtains guidance information and executes action through the following methods: 1) Shooting images of the landmark object to form a scene graph, and taking VRD as the feature extractor to generate the feature vector of landmark. 2) In the process of moving, the mobile robot recognizes the language prompts and the target objects in order, and matches the operation instructions of the landmark objects. 3) While finding that the corresponding label matched item is 0, VNLM is used for predictions according to action description. If corresponding to label is 1, direction is executed directly.

D. CONTRIBUTIONS
We propose a new method, VNLM, which performs better than current state-of-the-art methods. Through the above description, contributions of our work are summarized below: 1) We design a novel module for extracting the feature of landmark, which can effectively filter out irrelevant objects to address the disturbance problem of similar objects during matching landmark. 2) We propose a new framework that fuses visual semantic, language description and feature of landmark to predict action. 3) We design a route guidance table integrated into VLN model to divide a road into multiple sections that are described respectively by action description, which effectively addresses the problem of reducing the success rate due to the long action description.
4) Experiments on the lookscene show that our method performs better than current state-of-the-art methods, especially working well for a long route.

II. RELATED WORK
This paper mainly studied two tasks, one being visual language navigation (VLN), with which robots understand scenes and action descriptions to determine the direction of movement, and the other being visual relationship detection (VRD), which can extract the feature of the objects and the scene.

A. VISUAL LANGUAGE NAVIGATION
Visual language navigation tasks have attracted widespread attention, since they are both widely applicable and challenging [4]. An agent is required to follow the navigation direction, understanding the guided language and vision.
Recently, some researchers have proposed numerous tasks in this area [5], [6], [7], [8]. Earlier work [9] combined modelfree [10] and model-based [11], [12] reinforcement learning to solve VLN. Anderson et al. [7] proposed the room-toroom dataset in which the first VLN landmark is based on real imagery. Angel et al. [13] prompted a novel method for VLN tasks, called speaker-follower, which converts panoramic images into datasets and visual states, augmenting data and reasoning in supervised learning. Ma et al. [12] showed that additional training signals can be gained by explicitly estimating the progress toward the goal (referred to as self-monitoring). These methods are fixed interpretation languages, and the moving process cannot be changed, which is limited to familiar scenes in a short distance. Chen et al. [6] conducted research on visual language navigation in Google Street View panoramas and prompted the Touch-Down method. Herman et al. [7] introduced an new idea to generate instructions from Google Maps directions, guiding agents to move toward a destination. Both of them employed nav-graph to obtain source data of panoramic images, constraining the scope of the agent's activities.

B. VISUAL RELATIONSHIP DETECTION
Visual relationship detection [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27] is also considered a scene graph generation task. With the development of deep learning, some VRD methods are introduced to detect the relationship between two objects for images with neural network architectures [26], [27], [28]. Lu et al. [26] introduced a main landmarks dataset and a VRD method that establishes the basic template for many follow-up works, combining the appearance features of subjects and objects with a prior language to determine the occurrence of relationship triplets. Zhang et al. [27] introduced a network model that translates visual features into embedding vectors, interpreting relationships in subject-object pairs. Dai et al. [28] designed a deep relational network specifically for exploiting the statistical dependencies between objects and their relationships.

A. VISUAL RELATIONSHIP DETECTION
As shown in Figure 2, the mobile robot selects a landmark at a certain distance in order to find the landmark accurately next time, and the surrounding related target objects are used as the characteristic scenes of the landmark. After shooting a scene picture of a landmark through the camera, our method uses yolov5 to obtain the boundary box coordinates of the landmark and nearby objects, which are defined as box i∈n . There are three parts to describing feature scenes, as follows: 1) Regional features of the visual scene.
3) Semantic features of the landmark. In addition, some of the most probable subject-object relationships are extracted from the common knowledge graph for completing the scenario diagram.

1) REGIONAL FEATURE OF VISUAL SCENE
A scene image is input into ResNet-50 [29], which is employed as the feature extractor. We used the layer_3 feature map of ResNet-50 as the feat. The boundary box between the landmark box and the surrounding target boxes is denoted as box i∈n and the feat are input into ROI_Pooling to calculate the corresponding image region feature vector {v 1 , v 2 , . . . , v n }; the formula can be described as: The fused regional feature between the landmark and surrounding objects is the key to obtaining the predicate. We took the upper left and lower right corners of the coordinates of the two boxes to calculate the shared coordinate box of two objects; the formula is described as: where box union is taken as the input for (1) to obtain the feature vector of the shared region vector union_v ij . The visual region feature representations of triplets is denoted as After linear transformation, the feature map of each region is flattened to obtain the feature vector of each object. The formula is described as: The triplets of the region feature vector are denoted as (f i , f ij , f j ),used to fuse feature the vector by formulas as follows: The spatial relationship feature between objects is essential, in that it can express the actual degree of proximity, which is the position embodiment of the relationship predicate in visual relationship detection. We obtained the coordinate points of the upper left corner of the subject (x s , y s ) and object (x o ,y o ). It is also necessary to design the merging degree of the two bounding boxes. The ratio of its width and height between two bounding boxes is also needed to obtain a value that can be expressed as: where loc so denotes the spatial features vector, loc so ∈ R 300 .

3) SEMANTIC FEATURE OF THE LANDMARK
Semantic label information plays an important role in relationship reasoning. The embedding word vector of the subject and object labels is generated with word2vec [30]. Word embedding of the subject and object is defined as sem s and sem o , respectively, sem s ∈ R 300 , sem o ∈ R 300 . The formula for predicting relational predicates can be described as: where sem so is the embedding vector of the predicate, sem so ∈ R 256 .

4) FEATURE FUSION
The feature vector of the visual area v so ,spatial feature vector loc so , and semantic feature vector sem so are obtained to concatenate, taken as the input into the linear function for feature fusion, obtaining the fused feature vector through two-level full connection transformation. The formula is defined as: ϕ(I , (s, p, o)) = Linear(Relu(Linear(v so ; loc so ; sem so ))) (7)

5) TRAINING LOSS
For generating multiple predicate relationships between two objects, the ranking loss is taken as the loss function.
We denote the set of subject-object pairs as π pair and define all of the annotated predicates for the pair as π r . Given one image, all of the labeled visual relationship triplets are there are many unannotated predicates between a pair of objects. We denote the set of unannotated relationship triplets as R = {(s,p,õ)|(s,õ) ∈ π pair ,p / ∈ π r }. To address the incomplete annotation problem, we used the ranking loss function proposed in [31]. The ranking function is defined as: where c s denotes the label of the subject and c o is defined as the label of the object.

6) LANDMARK FEATURE
Finally, the relationship graph G(V, E) is generated. V represents the entity nodes in the graph and also refers to objects in the scene. E stands for the edge in the graph, and also refers to the spatial semantic relationship between two objects. Message passing graph neural networks are also employed to extract the node feature vector in the graph; the formula is described as: where node k−1 i is the feature of node i in layer k-1, and e i,j ∈ R 256 denotes the edge feature from node i to node j.

7) FEATURE VERIFICATION
In the process of traveling, the agent constantly detects the surrounding objects and compares them with the ordered landmarks in the Table 1. In order to reduce the amount of calculation, in the first step, we judge whether the detected object label exists in the Table 1. In the second step, we compare the feature values of the detected object with target object feature matrix. To avoid the wrong choice, it is judged by calculating the distance of the feature vector. The distance function is defined as: where x k p denotes the k-th feature point, and x k T denotes the k-th feature point.

B. VISUAL LANGUAGE NAVIGATION 1) LANDMARK
We represented a walking route in the scene that connected multiple nodes and intercepted the scene map corresponding to the node as a local feature landmark. In addition, we applied a method that takes local nodes, which can effectively improve the model's ability to recognize languages and scenes.

2) ACTION DECODING
We recorded the actions made by each time step t as a t , while history action decoding was denoted as VOLUME 11, 2023 A t = {a 0 , a 1 , . . . , a t−1 }, and put <start> denoted as the starting token in front of a 0 . Attention is applied in our model. The attention mechanism included three feature values: Q, K, and V. The function of Q is used to query the relationship between ourselves and other inputs; the function of K is to provide it for others and to find the relationship between other inputs and itself; the function of V is to represent the input characteristics, which are used for the linear combination of the weights generated by Q and K.

3) FEED-FORWARD MULTI-LAYER PERCEPTION
Multi-head attention was used to create temporal dependencies of the instructions, defined as: where The index i stands for the i head in the multi-head attention module.

4) SCENE FEATURE EXTRACTION
For collecting semantic visual features, we have pre-trained ResNet-50 to encode the scene. The ResNet-50 compute model is denoted as:

5) LANDMARKS FEATURE EXTRACTION
As yellow branch is shown in Figure 3, we use VRD to extract landmarks feature in surrounding environment. Following it, we use current landmark feature to select Table 1, applying (11). When the distance of the feature vector is lower than threshold and label is 1, it will execute direction in the Table 1. If it does not meet above condition, Landmark feature X ′ is input into F s−a , where F s−a = Atten(Q, K , V ), to obtain hidden feature L v sa . Following the self-attention layer, there is a feed-forward multilayer perception.

6) LANGUAGE EMBEDDING
We introduced an LSTM to extract hidden language semantic embedding L f , and then passed it through the self-attention layer (18) to obtain hidden embedding L l sa . As shown in (19), we took L l sa as the input for a feed-forward multilayer perception, obtaining the hidden semantic embedding to add L l sa .

7) MULTIMODAL SCENE FEATURES
As mentioned above, we input the scene images into (12) to extract the scene features. Following this, we placed a self-attention layer (20) to hide feature F s sa . We applied the formula (F s sa to obtain the input, which was taken into the cross-attention Formula (21). We applied a feed-forward multilayer perception to obtain feature matrix F s f in (20) and use (23) to fusel, F s f .
We took the above features in the series as the input characteristic matrix of the input loop network to predict the execution of the action, as shown in the following formula. The recurrent network takes a concatenation of these features as input (including the action encoding and feature and embedding mentioned above) and predicts an action a t : where W a is the transformation matrix and b a is the offset.

IV. ANNOTATION AND DATASET STATISTIC
In the literature, several datasets have been created for similar tasks, such as the R2R dataset [5], TouchDown dataset [6] and Talk2Nav [32]. Accordingly, we apply lookscene dataset. As shown in Table 2, lookscene was compared to prior datasets.

A. R2R
The language description is annotated for the route by asking the user to navigate the entire path in the R2R dataset, which contains the route of each step described in language and corresponding real-time changing scene images during the movement process.

B. TOUCHDOWN
In the TouchDown dataset, agent must firstly navigate according to annotation in a real-life visual urban environment, and then identify a location described in natural language to find a hidden object at the goal position. The data contain 9326 examples of English instructions and spatial descriptions paired with demonstrations [6].

C. TALK2NAV
Talk2Nav gathers 43,630 location nodes, which include GPS coordinates (latitudes and longitudes), bearing angles, and 80,000 edges. New York City (NYC) covers an area of 10 × 10 km. Out of all those locations, we managed to compile 21,233 street-view images (outdoor)-each for one road node [32].

D. LOOKSCENE
To retrieve city route data, lookscene uses Baidu's APIs to obtain the metadata information of locations and streetview images. Lookscene includes 28,036 landmarks, which consist of scene feature, latitudes and longitudes, 86,567 navigational routes that describe the process of travel, and the corresponding scene image data in language. As shown in Table 2, lookscene's number of panoramic views is more than Talk2Nav's, and its vocabulary size is 5007, which is more than TouchDown's.

V. IMPLEMENTATION DETAILS A. LANGUAGE DESCRIPTIONS
For word classification, we first train word2vec on based of the available vocabulary words, which is used to convert language annotations into word embedding. We use the alignment between the given language instruction X and their corresponding set of landmark and directional instruction segments in the training set of the Lookscense dataset. Word2vec is used to classify each word token in the navigational instruction to be a landmark description or a local directional instruction. At the inference stage, a binary label is predicted for each word token. We convert the sequence of word token classes into segments T(X) by simply grouping adjacent tokens which have the same class. There are shown few word token segmentation results in Table 3.
Language instructions consist of landmark descriptions and the local directional instructions, which is extracted word embedding.

B. SCENE FEATURES
We experimented with ResNet-50 pretrained on ImageNet to extract image feature and VRD architecture pretrained on the Visual Genome dataset to obtain the object relationship feature in the scene. We define a pretext task using the street view images from the LookScense dataset to learn the weights for ResNet-50. The four direction images of each point are fused into a comprehensive scene map. Given two street-view images with an overlap of their visible view, the task is to predict the difference between their bearing angles and the projection of line joining the locations on the bearing angle of second location. We frame the problem as a regression task of predicting the two angles. This encourages ResNet-50 to learn the semantics in the scene. We compiled the training set for this pre-training task from our own training split of the dataset.

VI. EXPERIMENTS
We conducted several experiments, as follows: 1) Our algorithms outperformed the existing state-of-theart methods on the performance.  Our model was trained and evaluated on the lookscene dataset. We split the dataset into three parts-70% for training, 20% for testing, and 10% for validation. Following Anderson et al. [33], we used three metrics to evaluate our method, as follows: SPL: The success rate weighted according to different path conditions. Navigation Error: The distance between the point of arrival and the goal after going through the path.
Total Steps: The total number of steps calculated for reaching the goal successfully.
In order to evaluate the methods, we also compared the performance of our model with that of current methods at different levels: The number of landmarks on the paths 1-5 according to different path lengths. The difficulty level of a route depends on its length and the number of landmarks. We employed the fifth landmark route to the primary experiments for training and testing.

1) COMPARISON TO PRIOR WORKS
To make a fair comparison with prior works, we used the same image and language features in all cases. We employed our model to extract image and language features. For selfmonitoring and speaker-follower, the panoramic view of the environment is discretized into eight view angles with 45 degree intervals.
The city graph in the lookscene dataset defines the navigable direction at each location. Through experiments on lookscene, our method is shown to outperform the prior methods. Our approach improves SPL from 11.09% to 12.16%, reduces the navigation error from 624.82 to 589.74 m, and reduces the total steps from 40.1 to 39.4, as shown in Table 4. As our model is integrated the attention mechanism, which can effectively remember the actions taken in the scene, our method outperforms current methods.
We also studied how differently long-range navigation impacts performance. There were four difficulty levels consisting of one to four landmarks. From Table 5, compared to the prior approaches, the performance of our work was better than that of the others under SPL. Our approach exceed self-monitoring by 1.1% for routes with one landmark, more than 1.2% for routes with two landmarks, more than 0.5% for routes with three landmarks. As our model uses VRD to extract landmark features and eliminates the interference of other similar objects, which makes our method perform better than other methods.

2) ANALYSIS
There are several reasons for the good performance of our method compared to the other methods: We decomposed the navigation instructions into landmark descriptions, and the route was completed step by step. The attention map was defined by language segments instead of English words, and the two matching modules purposed clearly made our method suitable for long-range visual language navigation. Due to making the instructions and designs work together, the agent can pay attention to the right places without getting lost easily as it moves.  To predict an action, previous methods focus on fusing images and languages in each step. This lacks the optimum. The navigational instructions refer to the view of the scene and the corresponding action. In our work, the image and language features extracted were fused with the attention mechanism. The obtained vector and action features were then input into GUR to keep track of the spatial movements.

3) MEMORY
We studied the results in relation to impacted memory, no memory, and a trajectory of GPS coordinates. The features were extracted according to (x i , y i ) i∈{1,2,...,n} by our method and are listed in Table 6. As a sequence-to-sequence model, it has implicit memory about the history of visual path and actions.
Encoding explicit memory as a trajectory of GPS coordinates improved SPL from 7.13% to 9.54%, as done in our method with no memory.
We added an explicit memory encoder in our method to improve SPL by more than 2%; furthermore, as shown in Table 6, we introduced GPS memory to perform better than implicit memory in SPL or navigation error. In addition, through experiments, we noticed better performance when using the top view trajectory map representation as the explicit memory. Furthermore, we evaluated the effect of memory type on navigation ability under different trajectory lengths and thresholds. When GPS information was encoded into the coordinate sequence, as the trajectory became longer, the navigation performance decreased in SPL. In addition, since the memory module draws a long traversal path in the external memory to lead to route repetition, this leads to a decrease in the recognition rate and an increase in the error rate of navigation.

4) NAVIGATION AT DIFFERENT DIFFICULTY LEVELS
There are different numbers of landmarks in the different length routes to training and testing our model. It can be seen from Table 7 that our method performs better than other methods when setting a landmark. When the number of landmarks is increased, the performance will decrease. From the last line, it can be seen that after mixed route training at all levels of difficulty, navigation performance can achieve good results under various levels of conditions, and perform better than other methods.

5) SEGMENTS
Compared with a single sentence that describes path, we use lists to express the description-language in segments, and divide the description-language into several segments. In this way, the connection segment by segment can lengthen the distance on the basis of ensuring the success rate. As more content is expressed in a single sentence, the error rate of model recognition will increase. We can achieve longdistance navigation with better efficiency through the method dividing sentence.
We express the same description statement in four ways and divide the long description statement into several small segments. It can be seen from Figure 4 that when sentence   is divided into three small description sentences, our method has the highest success rate. In addition, we select key points as VRD features on the road to make the route accurate. Just like human beings, when they are uncertain about the right route to take, or do not know where they are, when finding obvious landmark objects, they make sure the right route to take and where they are in current time. Thus, the distance can be lengthened on the basis of ensuring the success rate.

VII. CONCLUSION
In this paper, we proposed a novel method, VNLM, for automatic navigation, which fuses language descriptions and scene features. In the model, we integrated an explicit memory framework for reaching the destination, as well as a soft attention model over the language segments controlled by an adaptive computational GRU. In addition, we divide the long description sentence into several segments to extend the distance that can be reached on based of ensuring the success rate. In order to solve the problem of confusion that something is similar to the landmark encountered by the mobile robot in the process of moving, we added a new visual relationship detection model that oversees the reasoning function with common sense to extract the features of the landmark and surrounding environment for assistance. Compared to the current state-of-the-art approaches on based of lookscene, our method performed better in the experiment. In future work, we will attempt to extend our method to operate autonomous vehicles, hoping that visual navigation can play a greater role in the real world. WEIPING FU received the Ph.D. degree from Xi'an Jiaotong University, Xi'an, China, in 1996. He is currently a Professor with the Xi'an University of Technology, Xi'an. He has successively undertaken more than 40 scientific research projects, such as the National Natural Science Foundation Project, the National 973 Project, the National Defense Basic Scientific Research Project, the Key Projects of Shaanxi Natural Science Foundation, and the Key Projects of Industrial Science and Technology. His research interests include intelligent robots, logistics automation, and intelligent manufacturing. He is also a member of the China Computer Society. He once worked as the Standing Director of the Shaanxi Vibration Engineering Society and the Xi'an Vibration and Noise Society. VOLUME 11, 2023