Distributed stereo vision‐based 6D localization and mapping for multi‐robot teams

Joint simultaneous localization and mapping (SLAM) constitutes the basis for cooperative action in multi‐robot teams. We designed a stereo vision‐based 6D SLAM system combining local and global methods to benefit from their particular advantages: (1) Decoupled local reference filters on each robot for real‐time, long‐term stable state estimation required for stabilization, control and fast obstacle avoidance; (2) Online graph optimization with a novel graph topology and intra‐ as well as inter‐robot loop closures through an improved submap matching method to provide global multi‐robot pose and map estimates; (3) Distribution of the processing of high‐frequency and high‐bandwidth measurements enabling the exchange of aggregated and thus compacted map data. As a result, we gain robustness with respect to communication losses between robots. We evaluated our improved map matcher on simulated and real‐world datasets and present our full system in five real‐world multi‐robot experiments in areas of up 3,000 m2 (bounding box), including visual robot detections and submap matches as loop‐closure constraints. Further, we demonstrate its application to autonomous multi‐robot exploration in a challenging rough‐terrain environment at a Moon‐analogue site located on a volcano.

To tackle these challenges, we designed a framework for 6D local and global simultaneous localization and mapping (SLAM) for heterogeneous multi-robot teams. We therein combine keyframebased local reference filters (Schmid, Ruess, & Burschka, 2014b) and multi-robot graph SLAM with incremental optimization (Schuster, Brand, Hirschmüller, Suppa, & Beetz, 2015). A decoupled integration of these local and global methods allows us to benefit from their particular advantages: Local reference filters on each robot provide real-time, long-term stable state estimates that are required for stabilization, control and fast obstacle avoidance, whereas online graph optimization provides global multi-robot pose and map estimates needed for cooperative planning. Furthermore, it enables a distributed integration of high-frequency and high-bandwidth measurements and allows each robot to independently estimate its own pose and map on-board and online at all times. We thereby can distribute significant parts of the computational workload, avoid single points of failure and gain robustness with respect to interrupted communication and failure of individual robots. A novel SLAM graph topology, which we first introduced in Schuster et al. (2015), allows a better integration of the filter results according to their estimated uncertainty and independence assumptions, leading to more accurate estimates.
We equip all of our robots with stereo cameras as spacequalifiable vision sensors and employ semi-global matching (SGM; Hirschmüller, 2008) to compute dense 3D data under varying light conditions. On each robot, we aggregate the 3D data along the trajectories estimated by our local reference filters into local submaps of limited size and uncertainty. We then apply online graph SLAM to create and optimize a dense joint 3D map as well as to compute 6D pose estimates for all participating robots. In Figure 1, we present a joint 3D probabilistic voxel-grid map created by two lightweight rover units (LRU; Schuster et al., 2017) and give an impression of our multi-robot experimental setup. We exchange 3D data between robots only in the aggregated format of submaps to reduce bandwidth requirements and distribute the computational workload. Sensor data association for loop-closure generation is particularly challenging for stereo vision-based systems due to the typically narrow angle of view of their cameras compared to laser scanners. The integration of multiple measurements into local submaps with limited drift allows us to tackle this challenge: In Brand, Schuster, Hirschmüller, and Suppa (2015), we first presented a novel approach to select and match submaps, thereby computing an estimate for their relative transformation as well as for its uncertainty. For global pose and map optimization, we integrate marker-based visual robot detections as well as these submap matches as intra-and inter-robot loop-closure constraints into our SLAM graph.
In this study, we combine and summarize methods first presented in conference papers on • Local reference filters for long-term stable real-time state estimation (Schmid et al., 2014b) • Multi-robot graph SLAM for global joint localization and mapping  • Submap matching for loop-closure generation  and describe central aspects in more detail. In addition, we present novel contributions going beyond the three conference papers: • Improved map matching method to maximize the number of loop closures: We leverage the properties of our local reference filter to separate observable and unobservable system states and can thereby reduce the dimensionality of the matching problem from 6D to 4D. We evaluated it in 40 simulated and 13 real experiments and achieved an average increase in the number of map matches of 40%.
• Evaluation of our full localization and mapping system in five novel multi-robot experiments and discussion of the resources required by its individual components. Compared to our previous work , the experiments feature different robots and larger areas of up to 57 m × 53 m.
• Demonstration of the application of our localization and mapping system to autonomous multi-robot exploration in a novel experiment, during which our two LRU rovers mapped an area of approx. 650 m 2 in a rough-terrain outdoor environment at a Moonanalogue site on the volcano Mt. Etna.
We start with a survey of related work in Section 2 and give an overview over our modular system architecture in Section 3. We summarize our local reference filter for real-time vision-based F I G U R E 1 Left: Multi-robot 3D probabilistic voxel-grid map (height-colored, resolution: 10 cm, grid size: 5 m) and SLAM graph with pose covariance ellipsoids, created by our two lightweight rover units LRU1 (blue) and LRU2 (red). We provide a more detailed description in Section 7.3 and Figure 10. Right: Experimental setup with our two rovers inertial navigation in Section 4. In the subsequent Section 5, we introduce our 3D mapping system, describing its individual components from submap creation to our novel 4D map matching in the respective subsections. In Section 6, we present in detail our incremental graph SLAM with its multi-robot graph topology to connect local reference filter estimates. Afterwards, in Section 7, we discuss both the evaluation of our novel 4D map matching method as well as five extended multi-robot experiments. In the subsequent Section 8, we present an application of our system to multi-robot autonomous exploration in a novel experiment at a Moon-analogue site and discuss the lessons learned in this challenging environment.
In the final Section 9, we summarize our contributions on distributed 6D multi-robot localization and mapping and provide an outlook on topics for future work.

| RELATED WORK
Within the large body of related work on SLAM, three fundamental techniques can be identified: Kalman Filter-based methods such as Extended Kalman Filters (EKF), Rao-Blackwellized particle filters (RBPF), and graph optimization. Recent overviews on the challenges and approaches for SLAM systems in general and for multi-robot SLAM in particular are given by Cadena et al. (2016) and Saeedi, Trentini, Seto and Li (2016), respectively.
We employ an EKF in our keyframe-based local reference inertial navigation filter (LR-VINS) to guarantee bounded computation times for local state estimation, making it suitable for real-time applications such as control. EKF-based vision-aided inertial navigation systems (VINS) exist in different coupling configurations: Weiss (2012) demonstrated a loosely coupled system, fusing poses from a keyframe-based mono-SLAM with IMU measurements. Tightly coupled systems as the MSCKF (Mourikis & Roumeliotis, 2007) directly include feature measurements into an EKF. In the work by Hesch, Kottas, Bowman, and Roumeliotis (2014), consistency of the MSCKF is further improved by modifications of the propagation and measurement matrices according to nonobservability properties. To remove initial conditions for VINS systems and to reduce the effect of nonlinearities in the estimation process, Forster, Carlone, Dellaert and Scaramuzza (2017) and Lupton and Sukkarieh (2012) use IMU data preintegration. We will further discuss these works in the context of our LR-VINS in Section 4.
For the global optimization problem, we regard the technique of EKFs as less suited. As EKF SLAM models landmark-based maps as multivariate Gaussians, they typically imply a computational effort that grows quadratically with the number of landmarks (Durrant-Whyte & Bailey, 2006). While RBPFs (Grisetti, Stachniss, & Burgard, 2007) have been widely used for LIDAR-based planar localization and mapping, they are unsuitable for multi-robot 6D SLAM as the required number of particles grows exponentially with the size of the state space (Quang, Musso, & Le Gland, 2010). Graph SLAM started out with batch optimization methods (Kümmerle, Grisetti, Strasdat, Konolige, & Burgard, 2011) for offline least-squares error minimization. Incremental approaches like iSAM2 (Kaess et al., 2012) enabled its application for online robot localization and mapping.
Robot and landmark poses are modeled as nodes in a graph that reflects their, typically sparse, dependencies. They are connected via measurement constraints, represented as edges weighted according to their respective Gaussian uncertainty estimates. The worst-case computational effort on loop closures typically grows with the number of measurements and thus with traveled distance, a challenge that can be approached by constraining the optimization to local regions (Mei, Sibley, Cummins, Newman, & Reid, 2011) or removing nodes through marginalization (Williams et al., 2014). For 6D multi-robot joint localization and mapping, we consider graph SLAM to be the most promising technique. It allows a straightforward integration of inter-robot measurement constraints between intrarobot subgraphs while keeping the computational complexity manageable (Ahmad, Tipaldi, Lima, & Burgard, 2013).
While graph optimization constitutes the back-end of a SLAM system, the front-end is concerned with data association. It can, for example, be approached by using image features as identifiable landmarks (Endres et al., 2012), by matching visual keyframes (Leishman, McLain, & Beard, 2013) or through the registration of depth data (Newcombe et al., 2011). In this study, we employ visual detections of other robots as well as submap matching to generate loop-closure constraints. The creation of submaps is a technique to locally aggregate sensor data into maps of limited size (Reid & Bräunl, 2011;Vidal-Calleja, Berger, Sola, & Lacroix, 2011;Williams, Dissanayake, & Durrant-Whyte, 2002). Their origins can be attached as nodes to the slam graph, the graph optimization thereby affecting their relative transformations. This leads to a sparse graph, while allowing us to keep more information through aggregation into submaps than in keyframe-based mapping approaches (Leishman et al., 2013;Mohanarajah, Usenko, Singh, D'Andrea, & Waibel, 2015), which apply sparse temporal sampling on high-bandwidth sensor data.
Submap matching for multi-robot systems has been proposed by Williams et al. (2002) to match sparse feature-based local maps against a global model. In dense mapping, submaps typically are matched against each other as a global model becomes computationally challenging to handle. Forster, Pizzoli, and Scaramuzza (2013) and Reid and Bräunl (2011) successfully used submap matching using a brute-force correlation search on 2D and 2.5D elevation maps, respectively. This, however, requires the computational resources of a GPU even when limiting the search space to a discretization of  3 or  2 , respectively. Nagatani et al. (2011) use an iterative closest point (ICP) algorithm to match LIDAR-based 2.5D maps created by multiple robots. Similar to our approach, they match submaps of limited size, restrict the matching to neighboring submaps and use the resulting transformations as constraints for graph optimization. For 3D mapping, Labbé and Michaud (2014) and Mohanarajah et al. (2015) employ graph SLAM with SURF features for loop-closure generation. Such image-based features, however, are not robust to changes in viewpoint and illumination and thereby not well suited for heterogeneous multi-robot teams (Forster SCHUSTER ET AL. | 307 et al., 2013). On 3D geometry, ICP algorithms have been shown to work well for frame-to-frame registration (Newcombe et al., 2011;Nüchter, Lingemann, Hertzberg & Surmann, 2007). They can be used for map matching when a close initial estimate is available (Mendes, Koch, & Lacroix, 2016), for example, through known relative start positions in a multi-robot scenario (Michael et al., 2012). As ICP optimization easily becomes trapped in local minima, it is less suitable as a first step for long-range global loop-closure detection, in particular on noisy stereo data. It, however, can improve precision as a refinement step if provided with a good initial alignment. For this, 3D feature descriptors (Alexandre, 2012;Li & Guskov, 2005;Tombari, Salti, & DiStefano, 2011) have become popular to find correspondences between point clouds, a central challenge being the selection and description of robust geometric features (Yousif, Bab-Hadiashar, & Hoseinnezhad, 2014). The estimation of initial interrobot pose constraints is approached by Dong, Nelson, Indelman, Michael, and Dellaert (2015) for 2D LIDAR maps by clustering a large number of transformation hypotheses from scan matches. While this could complement our visual robot detections, the computational effort to generate transformation hypotheses would be significantly higher for 3D maps. In addition, our aggregation into submaps leads to a lower number of potential hypotheses, which might not be sufficient for a clustering-based method.
We integrate filter and graph SLAM methods to fulfill both local real-time requirements as well as to compute global multi-robot estimates. While Leishman et al. (2013) and Mohanarajah et al. (2015) also combine keyframe-based approaches with a pose graph for global localization, in contrast to our work, they explicitly represent all RGBD keyframes as nodes in their graph. We decouple the graph from such low-level states and trigger the creation of local reference frames as new graph nodes from our higher-level mapping modules. Thereby, we keep the size of the graph independent from the number of keyframes, which typically is orders of magnitude larger than the number of submaps required in our system. Williams et al. (2014) combine filter and graph SLAM by splitting a graph containing all measurements into a real-time filter part for the most recent data and a slower smoother part for past states, both running in parallel. While they are able to recover the solution of full batch optimization, they require a tight coupling between filter and smoother, working on the same types of sensor data and exchanging state information in both directions. In contrast, we propose to process sensor information at different levels of abstraction. By solely adding aggregated pose information to the graph, we keep it small for fast global optimization. In addition, we do not feed back loop-closure results into our filter, allowing it to generate smooth estimates required for stabilization and control of highly dynamic systems.
We considered several existing 3D multi-robot graph SLAM approaches, however, each of them has its own limitations, either restricting the enforcement of loop closures (Vidal-Calleja et al., 2011), assuming unlimited communication (Kim et al., 2010) or having been evaluated in simulation under simplifying assumptions (Cunningham, Indelman, & Dellaert, 2013). The latter two connect pose graphs created by multiple robots through frame-of-reference constraints represented as nodes in the graph. Cunningham et al. (2013) and Lázaro, Paz, Piniés, Castellanos, and Grisetti (2013) exchange condensed graphs between robots, the latter explicitly removing double-counted information by introducing antifactors.
Double-counting cannot occur in our proposed system as each robot adds all estimates and measurements to their own graph only once. We do not expect significant benefits from an exchange of optimized partial graphs as our combination of local reference filters with a submapping approach leads to a small joint graph in the first place.

| MULTI-ROBOT LOCALIZATION AND MAPPING ARCHITECTURE
We present an overview of our software architecture in Figure 2. It shows in detail our localization and mapping layer and indicates its connections to the robots' sensors and lower-level perception as well as to the robot control and higher-level planning components. To establish the data flow between our components, we employ three different middlewares, the first two being developed at our institute: Links and nodes to satisfy the real-time communication requirements for control, SensorNet to distribute high-bandwidth vision data over shared memory as well as the popular robot operating system (ROS) to connect high-level components, including our mapping pipeline. As sensors, we use cameras as well as an inertial measurement unit (IMU). On our ground-based robots, we additionally include wheel odometry measurements to improve the accuracy of the filter output, as indicated in Figure 2. We expect these three sensor modalities to be available on real planetary rovers, see for example the selflocalization architecture proposed by Souvannavong, Lemaréchal, Rastel, and Maurette (2010) for the ExoMars rover, which is planned to be launched in 2020. Within our perception layer, we employ semi-global matching (SGM; Hirschmüller, 2008) on an FPGA for dense stereo reconstruction and use this to compute keyframe-based visual odometry (Hirschmüller, Innocent, & Garibaldi, 2002). In addition, we perform marker-based visual detections of other robots and estimate their 6D poses (Olson, 2011). For both visual odometry and robot detections, we estimate the uncertainty and pass it to the subsequent filter and SLAM.
Our focus is on the localization and mapping layer, which contains three major modules: First, we fuse IMU and visual odometry estimates in a local reference filter (Schmid et al., 2014b) for realtime robust local state estimation, which we discuss in detail in Section 4. Second, on our ground-based rovers that operate in rough terrain, we compute a stereo-error adaptive local obstacle and terrain classification directly on depth images (Brand, Schuster, Hirschmüller, & Suppa, 2014). Performing this step early on allows a consideration of the association between original camera viewpoints and depth data, which is lost when aggregating it into maps later on.
The results can be used for fast obstacle avoidance and local path planning on local 2.5D cost maps. Third, our 6D multi-robot SLAM framework contains components to create submaps by integrating depth data along the trajectories given by our local pose estimates, for incremental online SLAM graph optimization, as well as for the generation of loop-closure constraints. For relocalization in areas previously visited by the same and by other robots, we designed a method to match local, partial maps based on the 3D geometry of the environment. We present the individual components in detail in Sections 5 and 6.
As introduced in Schuster et al. (2015), the localization and mapping modules are executed on board all robots within a multirobot team in a distributed fashion. This ensures the online availability of up-to-date pose and map estimates on each robot at all times, increasing robustness in particular in light of communication losses within the robot team. As we do not assume any initial knowledge about the starting positions of the robots, we do not introduce any shared global coordinate frame. Each robot computes its own maximum likelihood estimate for the poses of every robot in its team, given all measurement data available to it. Further, each robot adds the submaps created by the other robots to its own model of the environment as soon as connections between the robots can be made in the SLAM graph. The modularity of the localization and mapping system gives us the option to run only some of its components on resource-constrained systems like micro aerial vehicles (MAVs). In case the creation of a 3D map might not be feasible due to the sensor setup or limited computational power, such robots can still run the filter for local and the graph optimization for global pose estimation. By exchanging measurement data like robot detections and filter estimates, they can both contribute to and benefit from the joint localization with other robots in a heterogeneous team.
We illustrate exemplary applications of the local and global estimates from our localization and mapping components in the planning and control layer depicted in Figure 2. The local state estimates from our local reference filter constitute valuable input for robot control. By satisfying real-time properties and allowing high output frequencies, they are also suitable for stabilization of highly dynamic systems like multicopters (Schmid, Lutz, Tomić, Mair, & Hirschmüller, 2014a). We keep these system-critical components decoupled from our higher-level modules and thus do not feed any estimates from global optimization back into the filter. Combining its estimates with depth data from stereo vision allows us to realize local path planning with fast obstacle avoidance. Our multi-robot SLAM system runs at a slower rate than the filter, providing online estimates suitable for global path and exploration planning. In addition, in a multi-robot setup, the availability of a joint map and pose estimates for all robots supplies the foundation for coordinated cooperative action.

| LOCAL REFERENCE FILTER
The local reference filter is a loosely coupled vision-aided inertial navigation system (LR-VINS) fusing delta poses from a stereo odometry system with high-frequency data from an IMU. On our ground-based rovers, we additionally integrate high-frequency measurements of the wheel encoders in form of velocity measurements. The LR-VINS estimates 6D position and orientation, linear | 309 velocities as well as sensor biases for the accelerometers and gyroscopes employing an Extended Kalman Filter (EKF). The EKF allows to meet hard real-time constraints imposed by control systems. It has been shown that position (x, y, z) and heading angle (yaw) are unobservable within a VINS, that is, their errors and the corresponding estimated uncertainties of unobservable modes rise unbounded (Weiss, 2012). This property is challenging for EKF-based VINS: It leads to inconsistencies in global estimation and can, further, cause numerical issues in long-term operation. Inconsistencies rise with the unobservable yaw error (Bailey, Nieto, Guivant, Stevens, & Nebot, 2006) and, further, due to spurious observability caused by changing linearization points (Li & Mourikis, 2013).
These findings motivated us to split state estimation by observability: observable modes are estimated within the filter, where convergence to their real values can be expected. In contrast, unobservable modes are only locally estimated within the filter but furthermore globally optimized in the SLAM graph (see Section 6).
The local reference filter (Schmid et al., 2014b) was shown to enable consistent, long-term stable, real-time state estimation for unobservable systems. The approach defines a local state reference consisting at least of the unobservable system states. We periodically switch the reference of the filter into such a local reference frame, that is, we change its reference system, transforming all states and their corresponding covariances. In a VINS, by definition, the uncertainties of position and yaw of the reference frame drop to zero, all other uncertainties are reduced relative to the new reference. Relative measurements (as for example a delta pose) in a global frame can be turned into absolute measurements in a local frame, rendering the estimation locally observable. Thereby, the globally unbounded uncertainty for unobservable system modes becomes locally bounded. In this way, the consistency of the filter is improved. The global state is then computed within the graph optimization process, which can improve global consistency through relinearization. Even though the effect of inconsistency in the EKF was shown to be drastically reduced by the LR-VINS, spurious observability can still occur if the local reference is not measurable at all times. The findings of Hesch et al. (2014) to remove spurious observability could also be applied to the LR-VINS to further improve its consistency, if necessary.
In our localization and mapping framework, we define the local reference as the origin of a submap. The switch is actively triggered by the mapping system, as described in Section 5.1.2. For a better understanding of our framework, we summarize the general principles of the vision-based keyframe INS (Schmid, Ruess, Suppa, & Burschka, 2012) and the local reference-VINS (LR-VINS) algorithm (Schmid et al., 2014b) in the following.

| Vision-based keyframe inertial navigation
In our VINS, we (double) integrate high frequency acceleration and gyroscope measurements of the IMU within the strap down algorithm resulting in the direct system state x. Within the EKF, we estimate the indirect state δ, representing the errors of the direct state. Both states are defined as follows: (1) The direct state includes, in corresponding order, the IMU position, velocity and orientation quaternion relative to the current local navigation frame as well as the IMU accelerometer and gyroscope biases in the IMU frame. The indirect state corresponds to the direct state errors, with attitude errors being minimally parameterized as a three-dimensional orientation vector. We apply the standard INS error propagation equations within the EKF prediction step to calculate error uncertainties. The use of the IMU data preintegration model (Forster et al., 2017;Lupton & Sukkarieh, 2012) could improve linearity of the propagation process and eliminate initial state conditions. Nevertheless, besides vision, we also include highfrequency wheel odometry as velocity measurements, which would complicate preintegration between camera frames. However, we consider the integration of the model as a topic for future work.
Our visual odometry algorithm (Hirschmüller et al., 2002) provides a relative transformation measurement from a keyframe in the past to the last captured image with the according measurement noise. We realize a locally drift-free estimation by rereferencing a keyframe from the past. In total, we keep a history of = n 5 keyframes to improve robustness. To represent the keyframes in our local reference filter, we augment the system (error) state by the IMU (error) pose at the exact capture time of each camera image.
Referencing the corresponding augmented states, we process the delta pose measurements and compensate for measurement delays introduced by the vision pipeline. On our systems, the delay is approx. 250 ms. Thus the delay-compensated state estimate can be directly used for control.

| State augmentation, marginalization, and reference switching
The processing of delta pose measurements requires state cloning and marginalization. Both processing steps, as well as the switching of the filter into a new local reference frame, can be included into the prediction equation of a Kalman Filter. State augmentation is the general form of state cloning and can be written by introducing a state augmentation function g that builds the new state vectorx k at time step k consisting of the current state x k and the newly augmented state x aug (in our case a 6D pose). For simplicity we derive the state and covariance transformations in direct state representation. The corresponding transformations for the indirect state can be calculated analogously: where z k is a sensor measurement disturbed by additive white Gaussian noise with covariance R k . The filter covarianceP k for the new state vector is calculated using the Jacobian of g, evaluated at the current state estimate x k : where we define A k as the augmentation matrix and T k as the noise transformation matrix. For stochastic cloning (Roumeliotis & Burdick, 2002), the augmentation matrix has exactly one 1 per row and the noise transformation matrix vanishes.
We can apply a regular Kalman filter prediction step to the augmented state, which results for the covariance prediction in: where the augmented system matrix Φ k is an identity matrix of corresponding size with the original system matrix in the upper left corner. The augmented noise propagation matrix G k is a zero matrix of corresponding size with the original noise propagation matrix in the top rows. Q k corresponds to the covariance of the system noise.
Analog to Equation (2), we can define a transformation function f that transforms the system states at time step + k 1 (including all augmentations) into the new reference frame defined by the augmented state and, at the same time, removes the augmented reference state from the filter:̄=+ With S k being the Jacobian of f evaluated at the current state estimatē + x k 1 , we calculate the transformed state covariancē̄+ P k 1 as follows: Equation (6) can be rewritten as:̄= with Φ̃k as the modified system matrix,Q k as a combination of Q k and R k , andG k as the corresponding noise propagation matrix. In this form, Equation (7) has exactly the form of a Kalman Filter prediction step, except for the potentially non-quadratic shape of Φ̃k.
We exploit the square root UD filter for our implementation: The covariance matrix is kept in decomposed form as = P UDU T with U as unit upper triangular matrix and D as diagonal matrix. This decomposition guarantees the symmetry and positive definiteness properties of the covariance matrix and improves numerical stability.
While state augmentation and marginalization is trivial in direct covariance representation, it is not directly obvious in UDU T representation. Considering for example marginalization, rows of U are removed, destroying the quadratic form of U. Nevertheless, the size of D is still quadratic. The triangularization process for covariance propagation in square root UD form restores the quadratic shape of U and the corresponding quadratic shape of D.
Therefore, it is convenient to formulate covariance manipulations as propagation.

| Reference switching for VINS
Reference switching can be formulated by defining the function f of Equation (5) as an ordinary frame transformation. In the following, we use subscripts for vectors to indicate their frame and superscripts to indicate the frame they are expressed in. The new reference frame + N x 1 is defined by an arbitrary augmented pose with a position expressed in the current frame N x and its corresponding quaternion With these frame definitions, we introduce an example state x consisting of our IMU state and one single augmented pose defining a new frame with index + N x 1 . All other state variables are expressed in the current local frame N x or in the IMU body frame B, respectively.
Omitting the time index k, the direct and its corresponding indirect state are given by: The matrix x 1 rotates a vector from frame N x to + N x 1 and is calculated from x 1 . A frame switch does not change the reference frame of the estimated IMU biases, the corresponding transformation part in f is therefore the identity transform. Our direct state vector can be transformed using Equation (9).
For the transformation of the covariances of the indirect state, we use the relation between true value, estimated value and error as δ = − y y ŷ for vectors and as where I is the 3 × 3 identity matrix and ⌊ ⌋ … the skew operator of a 3 × 1 vector. Our direct state vector holds quaternions corresponding to a transformation Ĉ m N x (vector transformation from arbitrary frame m to frame N x ), whereas we need the inverse transformation for state switching: x . With these error definitions, SCHUSTER ET AL.

| 311
small-angle assumptions and Equation (9), we find the linearized transformations for the indirect states as: We analyze the effect of state switching for a covariance prediction given in Equation (6) for time step k. Let us assume for simplicity and without loss of generality an identity prediction matrix Φ k , an identity augmentation matrix A k and the state defined in Equation (8). We find the pure switching matrix ′ S k (without marginalization of the reference state) from Equation (10) by linearization around δ using the current estimate x of our state vector: The elements in the rows corresponding to the state defining frame + N x 1 cancel out. As during propagation the covariance matrix is multiplied from the left by S k and from the right by S k T , zero rows will cancel out all correlations of the corresponding state. Therefore, the covariance matrix will be rank deficient. By defining S k as ′ S k without its zero rows, the switching state is marginalized out during the switching operation.
After the transformation of the covariance matrix, we can apply Equation (9) on the direct state. By definition, x 1 1 is zero and represents the identity rotation. For simplicity, we analyzed a full frame switch (including the entire orientation). As roll and pitch angles are observable, in our implementation, we do a partial frame switch for position and yaw only, keeping the navigation frame horizontally aligned to the gravity vector. This is realized by manipulating the rotation matrix to include only yaw components. Please consider that for the implementation of a full state switch, including roll and pitch, the gravity vector has to be included into the state, similar to the work of Lupton and Sukkarieh (2012).

| 3D MAPPING AND MAP MATCHING
In this section, we describe our approach to create local and global 3D maps from stereo data and present our method to match these local maps to generate intra-and inter-robot loop-closure constraints for our multi-robot graph SLAM discussed in detail in Section 6.

| Map creation and composition
To create global 3D maps, we first create local submaps and then compose them, globally optimized, into a joint map. A submap is defined by a local reference frame, that is, the pose of its origin, and associated 3D data structures. It is important to note that for each new submap, we trigger a frame switch in our local reference filter.
Thus, the origin of each submap by definition coincides with a respective local reference frame in the filter. We then add the submap origins to our SLAM graph for global optimization.

| Submap generation
To create 3D submaps, we aggregate depth data and obstacle classification results along the trajectories estimated by our local reference filter. Each submap has two application-dependent 3D representations: First, we create a 3D point cloud at a resolution of 5 cm, with color information and a binary obstacle classification for each point. The latter results from an earlier step in our mapping pipeline and is computed anyway for local obstacle avoidance by a stereo erroradaptive terrain classification algorithm (Brand et al., 2014). We use the point cloud representation for map visualization and map matching, see Section 5.2. Second, we integrate the depth data into a probabilistic voxel space representation. For this, we employ the freely available open-source OctoMap library (Hornung, Wurm, Bennewitz, Stachniss, & Burgard, 2013) and create octrees with a resolution of 10 cm as a tradeoff between computational load and required precision. We currently use the resulting joint maps for autonomous exploration based on expected information gain and plan to employ them for whole-body path planning in future work. Both applications require an explicit distinction between occupied, free and unknown space, which is not available in simple point cloud representations.

| Submap partitioning
We aggregate the dense 3D data into submaps assuming that our filter estimates on each robot are locally sufficiently accurate. To satisfy this assumption, we create a new submap once the filter's estimated uncertainty grows above map resolution. Applying a threshold of 0.1 m on the standard deviation of the robot's position, we not only ensure a limited drift within each submap but, by triggering a frame switch, also a limited accumulated error in the local reference filter. In addition, we employ an empirically determined threshold of 3.5 m accumulated driven distance, restricting the size of the individual submaps to limit their memory and processing time requirements during exchange and matching. As the localization error within a submap thus is limited, we perform global corrections only between the submaps of one or multiple robots. This approach allows us to reduce the high-bandwidth 3D data by aggregating it locally into submaps and exchange only this aggregated data between robots in a multi-robot system, thus saving memory and bandwidth compared to keyframe-based approaches (e. g. see Leishman et al., 2013;Mohanarajah et al., 2015) that keep and potentially exchange the full 3D information at a higher sampling rate.

| Global map composition
To generate joint global maps, we merge the information of all of our submaps based on the latest graph SLAM estimates for their respective origins. This can be done either periodically for visualization or on demand for path and exploration planning. As the computational effort of this merging step grows with the number of submaps, we cache the combined map, except for the currently active one and thus frequently changing submap, and invalidate this cache only on significant changes of the respective estimated submap origins, that is, after new loop closures. For future work we plan to limit the exchange of submaps as well as their composition into a global map to an application-dependent area of interest, as for example proposed by Koch and Lacroix (2016). In addition, we plan to merge submaps after successful matching, similar to Mohanarajah et al. (2015), and to remove deprecated submaps when the same area has been exhaustively mapped again more recently.

| Submap matching
We perform a pairwise matching of the 3D structure of submaps to compute relative transformations accompanied by uncertainty estimates between their origins and thus generate intra-as well as interrobot loop-closure constraints. As input data, our submap matching process receives the submaps from all robots in a multi-robot team as well the latest pose and uncertainty estimates of their origins from the graph SLAM component of the robot on which it is running.
In this study, we improved and extended our map matching concepts, which we first presented in Brand et al. (2015) and applied to multi-robot systems in Schuster et al. (2015). As an extension of our previous work, we improved our matching method with the goal of maximizing the resulting number of loop-closure constraints while maintaining their level of accuracy. Our submap matching works on aggregated stereo depth data from noisy sensors with limited field of view and thus can generate only small numbers loop-closure hypotheses compared to, for example, image-feature based systems.
While our matcher is designed to minimize the number of erroneous data associations, they nonetheless are impossible to rule out. Thus, increasing the total number of loop-closure constraints is important for the graph SLAM to be able to filter such false positives as outliers, or, at least, compensate for their influence to increase the robustness of global estimation. To achieve this, we exploit the properties of partial frame switches in our local reference filters to reduce the dimensionality of the matching problem. As described in Section 4, we switch the local frame of reference in the filter only with respect to the unobservable states x, y, z and yaw. Thus, we can define the origins of all submaps to be aligned to the observable gravity vector, that is, for all of them = roll 0 and = pitch 0. This does neither restrict the content of submaps, being able to represent arbitrary 3D geometries, nor the applicability of our system to aerial robots, as each robot's observable states, including its roll and pitch angles, are F I G U R E 3 Submap matcher architecture with two threads for parallel execution of pairing and matching (numbers 1-4 indicate typical order of main data flow) [Color figure can be viewed at wileyonlinelibrary.com] SCHUSTER ET AL.

| 313
continuously estimated within the filter. We, however, are able to reduce the number of dimensions of the map matching optimization from six to four by solely estimating a relative transformation for the four remaining, unobservable degrees of freedom. Constraining the matcher's optimization steps early on to valid hypotheses with respect to roll and pitch leads to an increased number of valid matches, as we discuss in the following (Sections 5.2.4 and 5.2.5) and demonstrate in our experimental evaluation (Section 7.2).
In Figure 3, we present the architecture of our submap matching module. It runs two parallel threads, one to filter the input data and generate pairs of potentially matching submaps and another to perform the matching itself. As the processing of different pairs of submaps is independent from each other, it would be easy to further parallelize this process for multi-and many-core architectures. We implemented a similar type of parallelization across machines by executing matcher processes on multiple robots in parallel. They share their results to avoid duplicate work as well as double-counting of successful matches.
We base the pairwise matching between submaps on their point clouds and follow a two-step approach: First, we compute potential initial alignments through 3D feature matching on obstacle points and select the best model. Second, we perform an ICP step on the full point cloud for refinement and match uncertainty estimation. In Figure 4, we show a submap match, depicting the keypoints and correspondences used during initial alignment as well as the final transformation after refinement. We will explain the individual steps of pairing and matching in the following subsections.

| Keypoint selection and submap suitability check
To compute 3D features, we select distinctive keypoints based on our precomputed obstacle classification in the point clouds. We employ a 3D voxel grid filter with a bin size of 0.1 m on them to reduce the computational effort of the subsequent steps. As we argue in Brand et al. (2014) and Brand et al. (2015), arrangements of obstacles represent informative and unambiguous geometrical features in both indoor and outdoor environments, with the exception of very self-similar man-made structures. To distinguish these valuable arrangements of obstacles from simple structures like straight walls or ambiguous ones like single stones, we check the distribution of obstacle points within each submap. Therefore, we fit a 3D line model to the keypoints using random sample consensus (RANSAC) and check that at least 10 keypoints are further away from it than an empirically determined threshold of 0.5 m. We thereby replace the heuristic based on bounding boxes in the xyplane described in Brand et al. (2015) with a more general approach.
In addition, as a first step, we filter out submaps if their point cloud contains less than 1,000 points or less than 100 obstacle points. These suitability checks allow us to dismiss uninformative or ambiguous submaps before even including them into the matching process. As the keypoints for each submap are independent of all other submaps, we compute them just once and store them together with the submap data.

| Match pair generation and prioritization
In the submap pairing thread, we select potentially matching pairs of submaps. As for n submaps there are F I G U R E 4 Visualization of the initial alignment (Section 5.2.4) and refinement (Section 5.2.5) steps of a match between submaps created by LRU1 (left, blue) and LRU2 (right, red) during our multi-robot experiment #2 described in Section 7.3. The larger points indicate keypoints located on obstacles (in this case large artificial rocks, similar to those in Figure 7b) that have been used to compute correspondences for the initial alignment. a, Filtered correspondences used for initial alignment between the two submaps. b, Transformation after refinement [Color figure can be viewed at wileyonlinelibrary.com] Thereby σ Δ t denotes the relative positional uncertainty between the poses of two submap origins in terms of standard deviations in the respective dimension  t x y z { , , }. It can be obtained from the SLAM graph by expressing the uncertainty of one of the two submap origins in the pose of the other. While we plan to implement this in the future, we used a rough approximation for σ Δ t in the experiments presented in this study: j referring to the variances in the dimension  t x y z { , , } of the poses of the origins of s i and s j as estimated by our graph SLAM. Although this heuristic is fast and easy to compute, we are aware that it fails in certain cases of wellconnected and multi-robot graphs. The resulting approximation errors, however, can in the worst case lead to an over-filtering of potential matches.
We require > ′ overlap 2 m x y , 2 in order for a pair of submaps to be added to the match queue, and in addition exclude successive submaps from the same robot r, that is, s i r and + s i r 1 . Furthermore, we filter out pairs for which the estimated pose uncertainty is below the precision of the matcher. We approximate the latter by the stereo error, which grows quadratically with the distance Δd to the cameras and is significantly large for our camera setups (Brand et al., 2014). It can be estimated as We add all selected pairs of submaps to a priority working queue, prioritizing them to try the most promising pairs first. We thus rank them according to: The score consists of two parts, the first being a heuristic for the expected probability to match and the second for the expected impact of the match on global optimization. In our multi-robot experiments, we weighed them with an empirically determined α = 0.125 to make a trade-off between the two criteria.

| Feature generation
As the first step of the submap matching process, we retrieve the top element of the submap pair priority queue. We compute 3D features for the keypoints of both submaps and store them for use in later match attempts. Thus, we can avoid unnecessary computation in case a submap is never included in a match attempt. To characterize geometric features in the environment, we chose CSHOT feature descriptors (Tombari et al., 2011) as they exhibit a good tradeoff between performance and computational complexity (Alexandre, 2012). Based on SHOT descriptors (Tombari, Salti, & DiStefano, 2010), they characterize the 3D information of the local spatial neighborhood through unique signatures of histograms of orientations with respect to local 3D reference frames. They are rotational invariant and robust to noise and clutter. The CSHOT extension additionally includes color texture information, our robots are, however, only equipped with black&white cameras. It is a topic for future work to analyze its benefits, in particular for heterogeneous multi-robot systems involving different camera setups. We then cluster the correspondences into match transformations through Hough3D voting (Tombari & Di Stefano, 2010), which allows multiple hypotheses to be found and has been shown to be effective for stereo setups providing noisy 3D data. Due to the asymmetry introduced by its separation of translation and rotation, we compute the Hough3D voting two times, from s i to s j and vice versa to obtain more hypotheses and make them independent of the order of submaps. Basically, each correspondence votes for a translation hypothesis in the 3D Hough space. After this clustering step, a RANSAC registration method is used on each cluster to determine the most likely transformation, including a 3D orientation, associated with it. Compared to a closed-form singular value decomposition (SVD; Umeyama, 1991), RANSAC is more robust to large outliers that are to be expected to occur due to noise and potential symmetries in the submaps' point cloud data.

| Initial alignment: Keypoint matching
As mentioned before, the roll and pitch angles of the coordinate frames of both submap origins are zero. As both angles are well observable in our local reference filter, we expect the size of their estimation errors to be negligible, in particular compared to the accuracy of the map matching that is limited by the noise and resolution of our stereo-based point cloud data. We thus do not need to take their uncertainty estimates into account for the map matching pipeline and can limit the RANSAC optimization to We normalize the points by subtracting their meanp i and compute the vector Δp i between them. We then adapt it to Δ ′ p i such that it is orthogonal to z. The cross product of Δ ′ p i and z gives us a third orthonormal vector, defining the rotation matrix C i . With this, we compute the transformation hypothesis between the submaps with rotation matrix C and translation t as: We also restrict the refinement step to 4D by removing the roll and pitch angles from the ICP optimization problem. Similar to our adapted Hough3D voting, the early incorporation of prior knowledge to reduce the dimensionality of the optimization directs the optimizer toward the correct solution. The alternative, full 6D steps as used in our previous work , runs the risk of first converging to implausible solutions that afterwards get eliminated in post processing by applying restrictions that have been unknown to the RANSAC and ICP algorithms themselves. We demonstrate in our experimental evaluation in Section 7.2 that our 4D matching thus yields a larger number of map matches, that is, loop closures, after applying the same plausibility checks to filter potentially erroneous data associations.

| INCREMENTAL MULTI-ROBOT GRAPH SLAM
In this section, we introduce our multi-robot SLAM graph topology for the decoupled integration of local reference filter estimates.
While we first presented this graph topology in our conference paper F I G U R E 5 Schematic of SLAM graph with submap origins and bounding boxes. The overlapping highlighted rectangles represent submaps that match, resulting in a loop-closure constraint that is added to the graph [Color figure can be viewed at wileyonlinelibrary.com] Schuster et al. (2015), in this study we provide a more detailed description and extended discussion of its properties.
We employ a factor graph (Kschischang, Frey, & Loeliger, 2001) to formulate the SLAM problem in a graphical model, following the formalization used by Kaess et al. (2012). A factor graph is a bipartite Graph SLAM systems can be structured into two parts, a front-end and a back-end. While the back-end deals with the optimization problem, that is, the aforementioned minimization of a nonlinear quadratic error function, the front-end is concerned with the construction of the graph. This includes solving the data association problem as well as asserting dependencies between variable nodes by deciding on a graph topology (Grisetti, Kümmerle, Stachniss, & Burgard, 2011). In the following, we will characterize the measurement constraints relevant for our SLAM system, present our contributions to the multi-robot SLAM graph construction and describe the optimization back-end.

| SLAM front-end: Multi-robot graph topology
We consider a setup with R robots and the following six-dimensional variable nodes in the SLAM graph: • • Landmark poses l { } i (optional): l i represents the pose of the ith globally identifiable (robot-independent) landmark.
Each of them represents a 6D pose with Gaussian uncertainty. On each robot r i that runs a graph optimization module, its first submap pose s r 0 i is connected to an unary prior factor that defines its map origin, which we arbitrarily chose to be located at zero. We decided against an explicit representation of visual odometry keyframes in the SLAM graph as a design tradeoff to ensure a limited growth rate of the graph's size while allowing our local reference filter to internally use arbitrary techniques to integrate such high-frequency (20)

| Intra-and inter-robot measurements
We represent three different types of measurements as factor nodes in our SLAM graph: • Robot detections d { } i : d i represents the transformation between the poses of two different robots r 0 and r 1 . These can be determined by visually detecting r 0 from r 1 (or vice versa) and estimating its 6D pose. In our experiments, we attached planar visual AprilTag markers (Olson, 2011) to the robots in our multirobot team and detect these in the other robots' camera images, utilizing an open source detector implementation (Kaess, 2013).
The quality of pose estimates for planar markers highly varies, in particular depending on view distance and view angle, leading to pose ambiguities (Schweighofer & Pinz, 2006). We therefore perform a worst-case error approximation depending on distance, view angle and camera parameters, which we base on • Submap matches c { } i : c i represents the transformation between the origins of two submaps, which is the result of our submap matching process, described in detail in Section 5.2. This can lead to intra-as well as inter-robot loop closures.
In addition, we integrate the estimates of our local reference filters, thereby connecting robot and submap poses. All these measured and estimated transformations are added to the graph as binary factors representing six-dimensional constraints that connect exactly two nodes each. They all are accompanied by a Gaussian uncertainty estimate. It would be straightforward to add further types of measurements: Global position information from GNSS or the matching of aerial images  could, for example, be represented by additional unary factors.
While we are able to restrict the submap matching itself to 4D, the graph SLAM still needs to work on 6D poses, as some types of observations like landmark or robot detections are measured in 6D. Furthermore, we intentionally do not introduce hard constraints on the roll and pitch angles of the submap origins but integrate them with the, typically low, variance estimated by our local reference filter. This allows the graph optimization to compensate for errors in this estimation with respect to other types of measurements. In the following sections, we will discuss and compare the two different graph topologies that we outlined in Figure 6.

| Graph with sequential odometry measurements
The graph topology typically found in SLAM literature sequentially as an approximation of their Gaussian measurement uncertainty. We thereby enforce the resulting covariance matrices Σ u i r to be nonnegative and above an, experimentally determined, threshold to ensure numerical stability during graph optimization. This rough approximation, however, neglects the aforementioned state In the following, we therefore introduce an adaption of the graph topology that allows a more suitable integration of local reference filter estimates.

| Graph with local reference filter estimates
In Figure 6b, we sketched our novel graph topology, as first proposed in Schuster et al. (2015), in which we replace the aforementioned approximation of sequential odometry measurements u i r with two types of estimates that are directly computed by our local reference filter: • The local reference frames and hence the 6D submap origins s i r are aligned with respect to gravity, as discussed in Section 4. Thus, with this graph topology, they can be dissimilar from the robot poses x i r with respect to the roll and pitch angles, as we indicated in Figure 6b by drawing them separated. Compared to the graph topology for sequential odometry measurements, presented in the previous section, the direct integration of the estimates computed by the local reference filter allows a better representation of the underlying probabilistic structure. Within a local reference frame ( = submap), we thus do not introduce any additional independence assumptions.
Furthermore, pose estimates and delayed measurements for each robot can be added at any point in time without requiring additional methods to remove constraints from the graph or to avoid doublecounting of information, such as antifactors (Cunningham et al., 2013). Our graph topology thus allows a straightforward inclusion of delayed measurements like those received from other robots in case of delayed or interrupted communication.
Integrating the frame switch transformations w i r into the graph, we assume independence between the subsequent submap origins s i r and + s i r 1 . This is an approximation, as during frame switches, several filterinternal states (e.g., velocities, IMU biases and visual odometry keyframes) are transferred across submaps. Correlations between submaps could be explicitly considered in the graph optimization by, for example, creating conditionally independent local maps as described by Piniés and Tardós (2008). This, however, would require additional nodes to be added to the graph that represent all common states between submaps. These nodes thus would expose filter-internal and robot-specific states, such as velocities, IMU biases and visual odometry keyframe augmentations, to the graph SLAM. In the design of our system, we instead focus on an explicit decoupling of the local filter and global graph optimization components in favor of a modular multi-robot system architecture. Based on observability considerations, we define the interface between filter and graph on and between all robots at a pure pose level. Therefore, in our approach, the internal states of each individual robot are not exposed and do not need to be transferred to other robots. A change in sensing modalities on one system thus does not require any changes at the SLAM graph level on any of the robots. This is particularly important in heterogeneous multi-robot setups, in which different robots integrate different types of high-frequency sensor measurements in their local filters. The design decisions are a tradeoff between modularity, computational efficiency and the integration of all available information in a smoothing process. Our explicit decoupling of filter and graph SLAM thereby gives us greater design freedom for both subsystems compared to tightly coupled solutions such as, for example, concurrent filtering and smoothing (Williams et al., 2014).

| SLAM back-end: Incremental optimization
The computational cost of batch graph optimization grows with the size of the graph and is therefore not suitable for online global optimization. We thus decided to utilize the incremental iSAM2 optimizer (Kaess et al., 2012), which is available as open source software within the GTSAM 3.2.1 library (Dellaert, 2015). The key idea of iSAM2 for efficient incremental optimization is the conversion of the factor graph to a Bayes net and further to a Bayes tree. This data structure allows to add new variables and factors while keeping subtrees that are not affected by local loop closures unchanged. In our system, it thus allows for fast average optimization steps on the addition of new measurements and filter estimates, while slower, computationally more demanding optimization steps are limited to the infrequent occurrences of large loop closures.
Overconfident, erroneous loop-closure constraints can corrupt the entire graph optimization result. Robust SLAM back-ends mitigate this risk, for example by applying a dynamic scaling to the respective measurement covariances to reduce the influence of outliers (Agarwal, Tipaldi, Spinello, Stachniss, & Burgard, 2013;Latif, Cadena, & Neira, 2014). We replace the quadratic error term in Equation (19) with a robust error function for the integration of our landmark and robot detections as well as submap match estimates. In particular, we employ the GTSAM implementation of M-Estimators and chose the Cauchy error function, as it is suitable to suppress outliers with large errors (Lee, Fraundorfer, & Pollefeys, 2013). The optimization problem can thus be formulated as an iterative reweighted least-squares minimization with the weights in the kth iteration being computed as where the value of the constant c determines the range of the Mahalanobis distances to be still considered as inliers. This error function allows us to gain robustness by mitigating the influence of large outliers that can originate from incorrect data associations as well as measurement or estimation errors.

| EXPERIMENTAL EVALUATION
In the following, we present and discuss novel experiments. First, in Section 7.2, we evaluate the impact of our novel 4D map matching SCHUSTER ET AL.

| 319
optimizations introduced in this study on a total of 53 single-robot datasets. Second, in Section 7.3, we demonstrate the applicability of our full SLAM system in five new experiments, four of them featuring a significantly larger scenario than in Schuster et al. (2015). Third, in Section 8, we present an application of our localization and mapping system in a novel multi-robot autonomous exploration experiment conducted in the rough-terrain environment of a Moon-analogue test site located on a volcano.
In our previous publications, we already evaluated several aspects of our localization and mapping system. For a better overview, we give a short summary of previous experimental evaluations: • Filter consistency and stability: In Schmid et al. (2014b), we demonstrated the consistency and long-term stability of our local reference filter. For this evaluation, we used a quadcopter as a robot that poses additional challenges due to its inherent instability compared to our rover systems discussed here. We evaluated a simulated 24 h quadrotor flight and showed in real quadcopter experiments the applicability of the local reference filter for the control of highly dynamic systems with limited computational resources.
• Accuracy of SLAM pipeline with map matching: We compared our stereo vision-based 6D SLAM system with submap matching to a particle filter-based 3D localization in Brand et al. (2015) and could show an improved 2D localization accuracy of at least 27% in indoor and outdoor scenarios, in addition to estimating three additional degrees of freedom. A comparison of our full SLAM system, as a distributed and suboptimal data fusion method, to full batch optimization, however, remains a topic for future work.
• Impact of SLAM graph topology and heterogeneous multi-robot SLAM: In Schuster et al. (2015), we evaluated the impact of our novel SLAM graph topology on the overall localization accuracy and demonstrated an improvement of 15% on three different datasets compared to a SLAM graph with sequential odometry graph topology as used previously in Brand et al. (2015). In addition, we presented multi-robot experiments with a team of two rovers with heterogeneous camera systemsone was equipped with small-angle lens cameras (f = 5 mm) and the other one with wide-angle lenses (f = 1.28 mm). Our joint localization and mapping gives an improvement over singlerobot SLAM of 32%.

| Comparison of 4D and 6D map matching
In this section, we present an evaluation of our novel 4D map matching algorithm, comparing it to a 6D matching step used in our previous work .

| Experimental setup
We evaluate our algorithm in two series of single-robot experiments in simulated as well as real-world environments, featuring two different scenarios. In both, the LRU used the 3D voxel-grid maps generated by our mapping pipeline to autonomously explore a previously unknown area based on a maximization of information gain and map quality .

| Results and discussion
In this evaluation, we compare three different variants of the map matching algorithm, demonstrating the impact of both the 4D initial alignment (see Section 5.2.4) and 4D refinement step (see Section 5.2.5) with respect to the resulting number of loop closures. As the baseline, we use a 6D initial alignment and 6D refinement (6D + 6D), similar to our previous work , with an outlier threshold on roll and pitch of ∘ 10 for the initial alignment step and ∘ 1 after the refinement. First, we replace only the initial alignment with our novel 4D method, leading to the combination 4D initial alignment and 6D refinement (4D + 6D). This allows a separate evaluation of the impact of our changes on the two processing steps. The third variant is our proposed method, using both a 4D initial alignment and 4D refinement (4D + 4D). We excluded the combination of a 6D initial alignment and 4D refinement (6D + 4D) as its initial alignment errors in roll and pitch could never be corrected by a 4D refinement.
In Figure 8, we present the number of matches as well as the distribution of errors of the estimated transformation compared to ground truth on both the simulated and the real-world experiments. As we enforce roll and pitch to be zero or smaller than ∘ 1 for the 4D and 6D cases respectively, we only consider 3D translation and the error in yaw that the increased number of matches of the novel 4D method comes at no cost in accuracy compared to the 6D method. As they also have a comparable computational complexity, the 4D matching is to be preferred.
The superiority of the 4D approach can be explained as follows: In our novel 4D matching, we solely generate 4D hypothesis that, by definition, satisfy our constraints on roll and pitch. In contrast, during 6D matching, full 6D hypotheses are computed and then filtered with respect to these constraints. Both, the RANSAC computed for each Hough3D bin during initial alignment, as well as the ICP during refinement, generate a single best hypothesis each. Using 4D constraints at this stage thus forces the optimization to find a solution satisfying them, instead of running the risk of generating a single, unconstrained 6D solution that fits better to the respective cost function but will be removed when filtering implausible hypotheses during post processing.

| Collaborative localization and mapping
To demonstrate our full system for collaborative multi-robot localization

| Experimental setup
The scenarios for our five experiments feature our mobile robotics lab as well as adjacent labs, hallways and outdoor areas: • Experiment #1: Mobile robotics lab with three artificial large stones, featuring the aforementioned tracking system for ground truth.
• Experiment #2: Mobile robotics lab (lower right part in the map) as start and finishing point for both rovers. They drove one large loop each, passing through the adjacent lab, the entrance area as well as long hallways (see Figure 10 for details).
• Experiment #3: Setup similar to #2, with slightly different robot trajectories, in particular within the mobile robotics lab.
• Experiment #4: Similar to #2 and #3, but with LRU2 driving two loops through the central lab.
• Experiment #5: Mobile robotics lab with adjacent labs and two loops of LRU2 leaving and entering the building, thereby including indoor as well as outdoor areas.
In Figure 9, we present the 3D maps generated by our SLAM system for all five experiments. In addition, in Figure 10, we provide a sketch of our experimental setup and exemplarily selected Experiment #2 to show a more detailed map in a visual comparison to an architect's plan.
We used our aforementioned tracking system to acquire (partial) ground truth for the robot poses within the area of the mobile robotics lab. In all five experiments, we did not use any artificial static landmarks in the environment. The intra-and inter-robot loop-closure constraints thus solely stem from robot detections and submap matches. In contrast to our aforementioned autonomous exploration experiments, in this setting we rarely moved the robots' pan/tilt unit and did not use it to perform full scans of the area. We therefore increased the threshold to generate new submaps (see Section 5.1.2) to 7 m of maximum driven distance and 0.2 m uncertainty to allow them to be large enough to contain sufficiently discriminative features for map matching even when the rovers only look straight ahead. We also adapted the map matcher parameters accordingly. As the two robots drive through each other's field of view, we exclude their oriented 3D bounding boxes from visual odometry and 3D mapping according to the 6D pose estimates computed by our SLAM framework. While for this evaluation, we manually controlled both robots, in Section 8 we present an additional experiment with a preliminary extension of our exploration algorithm for multi-robot teams.

| Results and discussion
In the following, we present and discuss the results of our five multirobot experiments, with an exemplary more detailed analysis of Experiment #2. In Figure 9, The cause have been difficulties with stereo matching due to an untextured floor in this area, which also heavily impacted the performance of visual odometry. Our local reference filter, however, was able to compensate for this with wheel odometry and IMU measurements, and the graph SLAM then corrected a large amount of the remaining errors. It is important to note that our map representations do not contain any assumptions about a structured environment, that is, no biases toward even floors or straight walls.
In Figure 11, we present plots of the 3D trajectory errors over  Figure 11. The mean and maximum error values refer to those parts of the trajectories for which ground truth was available.   importance of maximizing the number of loop closures to gain robustness with respect to such failure cases, which are impossible to rule out when working on limited amounts of noisy input data.
In We base our discussion on the results for LRU2 in our multi-robot Experiment #2, the data of the other four experiments exhibit similar effects. In Figure 12, we present the CPU and memory usage of the major components of our SLAM system, which we discuss in the following: •  (Kanzog, 2017).

| The ROBEX space-analogue mission
The objective of the ROBEX space-analogue mission is to study the lunar crust model with the help of seismic measurements. Mt. Etna has been selected as a well-suited Moon-analogue test site as it exhibits natural seismic activity at depths similar to lunar deep quakes. As an important aspect with regard to robot navigation, the volcanic rough-terrain environment is also visually similar to the surface of the Moon. In the ROBEX main experiment, one of our LRU roves autonomously picked up seismic measurement instruments from a lander mockup and deployed them at predefined target locations, see Wedler et al. (2017) for details. We successfully used our localization and mapping framework for single-robot local and global localization and mapping during this mission and used the test site for an additional multi-robot exploration experiment.

| Collaborative multi-robot exploration
We conducted a preliminary collaborative exploration experiment with our two LRU rovers at the Moon-analogue site on Mt.
Etna, depicted in Figure 13, to demonstrate the applicability of our multi-robot mapping methods for planetary exploration. For the experiment, we ran a frontier-based exploration algorithm (Yamauchi, 1998) that employs our global probabilistic 3D voxelgrid maps to compute the expected information gain at each new F I G U R E 1 4 Overview of the site on Mt. Etna for our autonomous exploration experiment and view of navigation stereo camera during the experiment. a, Aerial image of test site with manually overlayed approximate exploration target area (red) and area mapped during the experiment (green). b, Image taken by LRU's navigation camera with local terrain classification overlay (green to red: traversibility from easy to hard obstacles) F I G U R E 1 5 Multi-robot 3D map created during our autonomous exploration experiment on Mt. Etna. Left: Point cloud-based map (resolution 0.05 m, grid size: 5 m) created by our two rovers LRU1 (blue) and LRU2 (red). Right: Height-colored voxel grid representation (resolution 0.1 m) of the same map, showing the slope of the terrain. Similar to Figure 10, the ellipsoids represent the estimated positional standard deviation of the rovers at their submap origins with respect to the start position of LRU2 exploration goal location, as described by Lehner et al. (2017).
This information is used to rank the goals and select the respective next location to be explored. To apply these methods to collaborative multi-robot exploration, we extended them to spatially distribute goal locations between robots. Therefore, each robot communicates each new exploration goal to all other robots and enforces a minimum distance between exploration goals of different robots of at least 7 m, approximately two times the sensor range of our rovers when pointing their pan/tilt units toward leveled ground. For the experiment, we defined a target region of 25 m × 20 m to be explored, as indicated by the red polygon in Figure 14.
The exploration experiment ran for 35 min. The two rovers traveled a combined distance of 394 m and mapped an area of approx. 650 m 2 , including parts outside of the exploration target area that have been traversed to avoid obstacles like large stones. Due to a limited availability of fully charged batteries, we were not able to fully explore the target area and in due time manually set waypoints to drive the rovers back close to their start positions. In Figure 15, we present the final map in its point cloud and probabilistic voxel-grid representations as well as the multi-robot SLAM graph, which, similarly to the previous experiments, is small and sparse with 163 nodes and 224 factors. As we applied a frontier-based exploration algorithm, the rover trajectories exhibited few overlap, resulting in only a single loop closure from our map matching system. To approach this general issue, recent work from our group is concerned with active loop closing that makes a trade-off during the selection of goal locations between exploring new areas and revisiting already mapped places for relocalization, as presented in Lehner et al. (2017).
We faced many additional challenges during the experiments at our volcanic test site at a height of 2,645 m above sea level. In contrast to previous indoor and clouded-sky outdoor experiments, the AprilTag markers on our rovers used for mutual robot observations oftentimes could only be detected from one direction.
In direct sunlight on Mt. Etna, they turned out to be too reflective, leading to severe overexposure in the camera images that made the whole tags appear plain white and thus impossible to decode. For the experiment presented here, we manually ensured that the rovers could see each other's markers at least at the start and end of the experiment. For future work, we plan to extend the robots' behaviors to actively look at each other based on their relative localization estimates to create further loop-closure opportunities.
While in general, obstacles often provide unambiguous 3D structures suitable for keypoint-based feature matching, the restriction to these limited the capabilities of our map matcher in the Moonanalogue scenario. Although the rough-terrain ground exhibited recognizable and thus potentially matchable geometry in some areas, it was not classified as an obstacle and thus excluded from the feature matching process. We are currently looking into replacing this by an obstacle-independent selection of the map matcher keypoints. Further, it might be beneficial to tune some map matcher parameters like the aforementioned keypoint selection to different environments. In this context, we started work on automatic parameter optimization, a topic that Cadena et al. (2016)

| CONCLUSION AND FUTURE WORK
In this study, we have presented a multi-robot 6D localization and mapping system that combines real-time local state estimation with global online optimization. The decoupling of these allows a distributed processing of high-frequency measurements in a multirobot team and leads to a small graph for computationally efficient optimization steps. For this, we employ a novel graph topology to incorporate the results of local reference filters according to their uncertainty estimates. In our distributed system, each robot has its own online global pose and map estimate available at all times, even in case of interrupted communication to any of the other robots.
For 3D mapping, we aggregate high-bandwidth sensor data into submaps to online generate dense point cloud maps (resolution 0.05 m) and probabilistic 3D voxel-grid maps (resolution 0.1 m) from noisy stereo data. Sharing aggregated map data instead of raw image streams between robots allows us to reduce the required bandwidth from 38.75 MB/s to 58 kB/s. We generate loop-closure constraints from visual robot detections as well as intra-and inter-robot submap matches and present in detail our technique for matching stereo vision-based submaps on geometric environment features. In this context, the decoupling of observable and unobservable states through partial frame switching in the local reference filter allows us to introduce a novel optimization: A reduction of the dimensionality from 6D to 4D of the map matching itself leads to a, on average, 40% higher number of loop-closure constraints, as we demonstrated in our evaluation based on the data of 40 simulated and 13 realworld experiments. We integrated all these components into our modular mapping architecture that allows an easy adaption to include resource-limited systems as part of future heterogeneous multi-robot teams.
To evaluate our full SLAM system, we conducted five multi-robot experiments with two rovers in areas of up to 57 m × 53 m. Our localization and mapping components generated accurate joint maps from both robots' stereo data and estimated their trajectories with average 3D translational errors below 0.5 m with respect to partially available ground truth, that is, errors below 0.4% of the robots' respective total trajectories. In addition, we successfully demonstrated the application of our localization and mapping framework for autonomous multirobot exploration and presented insights gained from a novel experiment conducted in the challenging environment of a Moon-analogue test site located on the volcano Mt. Etna, Sicily, Italy.
For future work, we plan to further improve our map matching algorithm by adapting the keypoint selection to include traversable SCHUSTER ET AL.

| 329
but locally discriminative parts of rough terrain to obtain matches in environments with no or few obstacles. Another open challenge is the intelligent and consistent merging of submaps in a multi-robot setup to allow long-term mapping. On the graph optimization level, we work on a separation of observable and unobservable states, similar to filter and map matcher, expecting benefits in large-scale multi-robot scenarios. As we designed our mapping system for the requirements of heterogeneous robot teams, we aim for an evaluation of our approach with a team of flying and driving robots, employing the resulting maps for autonomous multi-robot exploration.