Robotic monitoring of grasslands: a dataset from the EU Natura2000 habitat 6210* in the central Apennines (Italy)

Despite the remarkable growth of the global market for robotics, robotic monitoring of habitats is still an understudied topic. This is true, among others, for the species-rich EU Annex I habitat “6210 - Semi-natural grasslands and scrubland facies on calcareous substrates”. This habitat is typically surveyed by human operators. In this work, we present a dataset concerning relevés performed through the quadrupedal robot ANYmal C. The dataset contains information from three plots, which include the robot state, videos, and images acquired to assess the habitat conservation status. Additionally, a collection of videos and pictures about two typical and one early warning species of habitat 6210 is also presented. This database is publicly available in the provided Zenodo repository and will aid researchers in several fields. Robot state information can be used by engineers to validate their algorithms, while data gathered by the robot can be used to design new methodologies and new metrics to assess the habitat conservation status or train/test classifiers (e.g. neural networks) for plant classification.

Actually, given the variability of Annex I habitats, and in particular the wide floristic diversity of grasslands, there are no officially recommended lists of TS, neither at European nor at national scale, and it is recommended to point out the target TS at regional or even local scale 14 . The presence of orchid species is generally considered as an indicator of favorable conservation status, as well as their eventual abundance in the recorded area is particularly relevant as indicator of the priority status of the habitat 6210 [15][16][17][18][19] .
The concept of "early warning species" (hereinafter EWS) applies to those plants whose settlement in the habitat patches is an indication of ongoing processes altering the structure and function of the habitat itself; this concept applies also (perhaps mostly) to non-typical taxa, whose settlement can be a clear sign of transformation. In this sense, non-typical EWS can be used as very informative proxies, pointing out the first community functional shifts 20 . Among the signals addressing grassland habitat degradation, the appearance and establishment of edge and mantle species is considered an important early warning indicator that should be monitored, as it represents ongoing dynamic processes that can lead, in a more or less short time, to a complete habitat alteration 9 .
The standard collection of vegetation data suitable for the assessment of grassland conservation status is generally performed by a sampling field relevé 9,14 . According to the national standards 9,14 , the relevé must be carried out in randomly located 4 × 4 m 2 homogeneous plots, whose number is proportional to the total habitat surface. The recommended area 4 × 4 m 2 complies with the EU standards for grasslands sampling 21 . The data acquisition and assessment of the habitat conservation status are typically performed only by human operators, a time and cost inefficient approach. Additionally, funding for biodiversity preservation is often not adequate 22,23 . Given these observations, the goal of the Natural Intelligence project (https://www.nih2020.eu/) is to enrich the human monitoring capabilities through the use of robotic systems. In particular, we use legged robots to acquire information on the habitat mimicking the surveys performed by humans. The choice of this type of robot is a trade-off between mobility and battery duration.
In this dataset, we report the data collected during a monitoring mission in the occurrences of the habitat 6210 in Valsorda, Gualdo Tadino (PG), Italy (Fig. 1), inside the Natura 2000 SAC IT5210014, employing a quadrupedal robot ANYmal C 24 (Fig. 2). The data have been collected by a team of robotic engineers and plant scientists. The information included in this dataset can be divided into two groups. The first one is a collection of videos and images of three indicator plant species acquired from the robot cameras. The second group includes data related to the autonomous monitoring missions. Specifically, we report the point cloud of the area, the robot state information, and the acquired videos and pictures.
This dataset has a multidisciplinary scope and can be used by researchers in several fields. For instance, point clouds and information about the robot state could be used by robotic engineers to test or validate their own methods as well as benchmark the robot performance. On the other hand, plant videos and images recorded by the robot could be used by botanists to assess the quality of this information as well as the habitat's condition, or by computer scientists interested in testing their Artificial Intelligence (AI) algorithms for species detection and classification, as well as for exploring new possibilities for integrated human-AI monitoring.
To the best of authors' knowledge, this is the first publicly available dataset on robotic habitat monitoring.

Methods
The gathering of the full dataset -comprising monitoring missions and typical and early warning indicator species -was carried out by a team composed of both robotic engineers and plant scientists in Valsorda, Gualdo Tadino 06023 (PG), Italy, inside the Natura 2000 SAC IT5210014 (Fig. 1). This location was chosen because it represents a notable stand of the target habitat 6210 in central Apennines. The period of data acquisition was from 10th to 13th of May, 2022. This month was selected because, as stated in the national guidelines for the habitat monitoring 12 , the optimal sampling period for habitat 6210 covers the time range between May and August, when the majority of the composing species reach their maximum development and perform flowering.
Considering the particular role played by orchid species in this habitat type 15 and based on knowledge of local flora and vegetation, the data collection has been executed in May as an ideal period to detect both the target TS and EWS species in the field. May is the best period to monitor their occurrence, since they all typically flower between April and June 25 . In particular, orchid species are practically indistinguishable in the early stages of their life cycle, when only basal leaves occur, but become clearly recognizable at the time of flowering. The platform used for data acquisition is the quadrupedal robot ANYmal C 24 (Fig. 2) produced by ANYbotics AG. This robot can move both in an autonomous or teleoperated manner. Information about the environment are collected through a LiDAR sensor and four RGB-D cameras. The former is a LiDAR Velodyne VLP-16 puck lite (https://velodynelidar.com/products/puck-lite/), which acquires a 3D map of the environment. The cameras, instead, are four Intel RealSense D435 cameras (https://www.intelrealsense.com/depth-camera-d435/) that are able to collect full HD RGB images and/or 30 fps videos. Figure 2 depicts the location of the sensors: the LIDAR is placed on the rear part of the system, while the four cameras are placed one per side. The platform is equipped also with two wide angle FLIR Blackfly BFS-GE-16S2C-BD2 cameras, which were not used for data gathering. The information about the robot state are collected via the ROS interface of the robot and saved as ROS bag files. For more details about this particular file type please see the official ROS bag reference (http://wiki.ros. org/rosbag).
The dataset contains two different sets of data: (i) typical and early warning species data and (ii) monitoring mission data. Both sets of data are collected with the same device and are related to habitat 6210. However, their content and goal are different. The first batch of data contains categorized pictures and videos of indicator species for habitat 6210. Each species occurrence has been chosen independently and has the only goal of being an example of the indicator species. The second batch of data contains information about habitat surveys performed by the robot using procedures similar to the ones followed by botanists. Both methodologies are described in detail in the following sections. typical and early warning species data. This part of the dataset includes pictures and videos of indicator species for habitat 6210. We selected as indicator species three plant taxa with a key role as explanatory proxies of the habitat 6210 conservation status. They are specifically two TS and one non-typical EWS. All data were categorized by botanists expert of the local flora.
Following the guidelines in 6 , and in accordance with the suggestion to select the TS of the habitat 6210 at local level 12,14 , we selected two typical orchid species, among those mentioned in the national Interpretation  [27][28][29] . Among these species, we chose Asphodelus macrocarpus Parl., a rhizomatous geophytes species with rapid vegetative growth, which typically expands from heliophilous forest edges colonizing grassland habitats. The occurrence and the impact of this species on habitat 6210 in the central Apennines is well documented, especially when traditional agropastoral activities are underpracticed www.nature.com/scientificdata www.nature.com/scientificdata/ or abandoned 27,30,31 . Please note that the nomenclature of plant species is in accordance with the World Flora Online portal 32 .
To acquire the data, first, plant scientists identified the instances of the indicator species in the study area employing the diagnostic keys in 25 . Then, an expert human teleoperated the robot, placing it in front of the chosen instance. This may include one or more of the three species. Finally, the video acquisition was manually started and a video of at least 900 frames was recorded together with 10 pictures. This procedure has been performed 16 times for each of the indicator species. In the case of Dactylorhiza sambucina, we executed the procedure 16 times for each form: pink and yellow. Pictures and videos contain at least one instance of the indicator species, but they may contain also instances of other species. Table 1 summarizes the entries of this part of the dataset, while Fig. 3 shows a picture example for each of the indicator species.
The peculiarity of this part of the dataset is that, to the best of Author's knowledge, this is the first dataset of classified images and videos of indicator species for habitat 6210 taken by a quadrupedal robot. The main goal of this batch of data is to publicly share data that can be used to design novel classification algorithms or to test on robot-acquired data already developed methods. Given this objective, no other information is provided about these data.
Monitoring mission data. This other part of the dataset contains information about the monitoring missions. The idea is to use the quadrupedal robot ANYmal C to imitate the survey of a standard plot by botanists during field relevé. This means that the robot follows the same standard procedures and collects information about the area, but it does not analyze nor classify it.  www.nature.com/scientificdata www.nature.com/scientificdata/ Each area to be surveyed is randomly chosen, and the GPS coordinates are recorded by a human operator employing a Garmin GPS Etrex 10. Each coordinate has been recorded with an accuracy of at least 3 m. In addition to location information, also the time, the date and the weather conditions have been noted. Georeferencing is utterly important to compare data with past or future surveys, while date and weather information are useful for having an approximate estimate of the sunlight. Time and date were automatically saved with the robot status, while weather was visually inspected and manually noted.
The actual survey is divided into two phases: (i) mapping, and (ii) autonomous monitoring mission. We define as mapping the creation of a 3D map of the environment. Indeed, in this phase, the LiDAR sensor mounted on the robot is used to scan the surrounding area. During this scanning, the laser sensor determines the distance between the robot position and the obstacles. To improve the digital representation of the environment, an expert human operator guides the robot around, while the LiDAR gathers information. The result of this procedure is a set of points, namely point cloud, which represents the environment, in terms of both terrain and potential obstacles. An external video was also recorded to qualitatively show the whole phase.
The point cloud obtained in phase (i) is essential to enable the robot self localization, which is fundamental to enable the autonomous locomotion. The latter is crucial for phase (ii), i.e., autonomous monitoring missions. In this phase, the robot mimics the habitat relevé typically executed by plant scientists. This means that the robot acquires information on plots with the same size of those surveyed by botanists following the national and international standards. In accordance with these standards 9,21 , the area to be monitored is 4 × 4 m 2 , and ANYmal completely examines it. To do that, the robot autonomously moves on a grid, following a set of waypoints that starts on the bottom right corner of the area and ends on the top right one. Figure 4 schematizes the robot motion. During this phase the robot keeps its orientation constant. When a waypoint is reached, i.e., in the center of each square in Fig. 4, the robot stops and takes a picture from each of the four RGB-D cameras. Each camera records also a video for the whole mission duration. In addition to these data, in this phase we recorded also the robot status information and an external video of the whole robot motion. No analysis is performed on the robot-acquired data. Table 2 reports information about the three monitoring missions we performed. Since the three areas were close, a single mapping phase was sufficient to enable all autonomous monitoring missions. Table 3 summarizes the robot status information. These are collected using ROS bag files of several topics of the robot control architecture. Robot status information includes, for instance, robot base position, joint position, joint velocities, joint acceleration, joint torque, joint current, and battery status. Details about the topics and the specifications on the datastream output from the ANYmal C robot are available on the website of ANYmal Research (https://www. anymal-research.org/). Figure 6 is an example of (part of) the robot status during part of Plot 1. www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
In this section we describe the data contained in this dataset, we list the different file formats, and we explain how to visualize them. All data are uploaded on Zenodo 33 at https://doi.org/10.5281/zenodo.7385369, while an example code to extract the bag files is uploaded on Zenodo 34 and GitHub 35 .
The tree structure of the full dataset is depicted in Fig. 5. We provide two sets of data together to a README. txt file to navigate through them. The first dataset, named "Typical and early warning species", is a collection of videos and images of three indicator species, which are TS or EWS for habitat 6210. A README.txt file explains the structure of this set of data. For each species, we provide a "Pictures" and a "Video" folder containing the pictures and the videos of the indicator species. Each species has been recorded 16 times. Table 1 summarizes the number of entries of this set of data.
The second one, named "Monitoring missions", contains data related to the mapping and the three monitoring missions. A README.txt file explains the structure of this set of data. The "Mapping" folder contains a README.txt file, a video of the mapping operation and the acquired point cloud. Each of the three "Plot" folders contains three subfolders and a README.txt file with time and date at the start, together with weather information. The three subfolders contain a qualitative video of the operation, the ROS bag files related to the robot state, and the videos and images acquired by the robot. The latter data are named based on the camera used to acquire the information. Specifically, "depth_front", "depth_left", "depth_rear", and "depth_right" refer to the RGB-D camera placed on the front, left, rear, and right side of the robot, respectively. The bag files are instead named with the date and time at the start of the recording. Data formats. In the following we distinguish data elements depending on the file format.
.mp4 and .avi. These are standard file format for videos and can be visioned using many classic multimedia software. The avi files refer to videos collected by the robot, both during autonomous monitoring missions and   www.nature.com/scientificdata www.nature.com/scientificdata/ during typical and early warning species data gathering. The mp4 files are the videos taken with a reflex digital camera and their goal is only to qualitatively see the robot motion.
.ply and .pb. Point clouds have been saved in two formats: ply and pb. Polygon File Format (ply) is a standard format for 3D models, and it can be opened with several software s, e.g., MATLAB (https://www.mathworks. com/products/matlab.html), but there are also free alternatives like MeshLab. pb format is the default one to save point clouds with ROS Gazebo ANYmal research (https://www.anymal-research.org/), and also the one used to load pointclouds. Thus, depending on the goal, users can prefer ply or pb file format. www.nature.com/scientificdata www.nature.com/scientificdata/ .bag. This is the standard format that ROS employs to store robot data. The dataset contains bag files related to autonomous monitoring missions. This data can be visioned through several ROS packages, but also through other means, for instance, using data processing software like MATLAB (https://www.mathworks.com/products/matlab.html). In this latter case, a typical procedure is the following. The bag files contain data saved from the so-called ROS topics (http://wiki.ros.org/rostopic), which are streams of data in the form of ROS messages. The particular list of topics that we saved is shown and described in Table 3. These topics contain all the relevant information about the state of the robot and its main components.
Example code for data analysis. The robot status is stored as bag files. The extraction of the data from bag files can be performed using many ROS packages, or other softwares, e.g., MATLAB (https://www.mathworks.com/products/matlab.html). In Code 1, we provide a minimal example of MATLAB script to analyse the data contained in bag files. To run this example, it is required to have MATLAB (https://www.mathworks.com/ products/matlab.html) and ROS toolbox (https://www.mathworks.com/products/ros.html). This code was written and tested on MATLAB R2022a, but it may also work on later MATLAB releases. The function extracts data as a MATLAB structure from a ROS bag file from a specified topic. Again, the list of topics that are available in the data we provide is shown in Table 3. More detailed descriptions of the topics and the specifications on the www.nature.com/scientificdata www.nature.com/scientificdata/ datastream output from the ANYmal C robot are available on the website of ANYmal Research (https://www. anymal-research.org/), which however can only be accessed by research institutions after a partnership request is accepted by ANYbotics. This notwithstanding, please note that the provided data can be freely accessed without the need for registering to ANYmal Research. Additionally, the code provided in the following section is sufficient to visualize and analyze the data. Partnership with ANYmal Research is only to receive very specific details about the robot and is not required to access and analyze the provided data.
Code 1 Example MATLAB code to extract data from ROS bags.

technical Validation
Quality assurance during the fieldwork was provided both by the robotic engineers team, i.e., Franco Angelini (FA), Mathew Jose Pollayil (MJP), and Manolo Garabini (MG), and by the plant scientists team, i.e., Federica Bonini (FB), and Daniela Gigante (DG). All Authors oversaw the data acquisition and reviewed the final dataset meticulously checking for inaccuracies and incongruencies. It is also worth highlighting that the shared data are provided raw, without going through any post-processing phase that may alter their soundness. The validity of the data collection is also assured by a set of choices which are detailedly described in the following.

Location selection. The chosen location is included in the Natura2000 Site designated as SAC IT5210014
"Monti Maggio -Nero (sommità)" in 2014 36 , and, as such, the presence of the target habitat has been previously detected, identified and assessed according to the standard European and national protocols 9,15,37 . Besides wide patches of the target Annex I habitat 6210, the SAC includes four more Annex I habitats, namely 8210, 9210, 9260, 9340 (for a definition of Annex I habitat codes, see 2,15 ). Official data on the occurrence and distribution of these habitats in the SAC IT5210014 are part of its Management Plan, published by the Regional Offices of Umbria 38 , and also available in the EU Standard Data Form of the SAC 39 . The habitat maps are part of the Natura 2000 Network Management Plans, adopted by the Umbria Region with 40 and subsequently approved by the Regional Council. On the field, DG and FB ensured that the chosen plots for data collection are typical examples of habitat 6210. Date selection. As described in the Method section, the ideal period to perform the data collection is May.
However, being in a natural environment, the precise choice of the best period for sampling mainly depends on the seasonal climatic variations in terms of rain and temperature of the survey area, which influence the vegetative growth. In order to acquire meaningful data, DG and FB tracked the flowering progresses of the target indicator species, starting from mid April, and the dates 10-13 May were selected based on the orchids phenology (e. g. time of full blooming) in the study area. on field indicator species classification. The section of the dataset named "Typical and early warning species data" contains pictures and videos of the indicator species. As described in the Method section, the selection of the species which are indicators for habitat 6210 has been performed following articles published on international journals 6,12,14,26,27,30,31 . On the field, plant scientists (DG and FB), who are experts of the habitat 6210 flora, selected and classified the specific instances of the chosen indicator species following the guidelines in 25 .
Mapping accuracy. The validity of the mapping is directly ensured by the autonomous missions. Indeed, in the mapping phase the robot recorded a digital reconstruction of the surroundings that was then used during the autonomous mission phase. Since the robot did not fail to localize itself in the environment, and it was also able to autonomously walk and perform the monitoring mission, we can infer the point cloud soundness.
Monitoring mission validity. Monitoring missions have been conducted using methodologies that mimic field relevé performed by botanists to ensure the compliance with national and international standards 9,14,21 . The monitoring algorithm developed by the engineer team establishes that error messages are printed on the terminal when faults occur in the data acquisition process. This fact ensured that no issues were presented during the data acquisition. Additionally, no analysis has been performed on the robot-acquired data. This ensures that no data corruption is present. (2023) 10:418 | https://doi.org/10.1038/s41597-023-02312-x www.nature.com/scientificdata www.nature.com/scientificdata/ Database inspection. At the end of the data collection, the database has been created. Only valid and complete data have been added to the dataset. Once the dataset was created, both teams carefully revised each entry to check their validity. This means that plant scientists inspected each video and picture of the "Typical and early warning species data" directory to ensure their correct folder placement. Analogously, the robotic engineering team inspected the robot status folders and the point clouds folder to ensure their correct folder placement. In particular, MJP ran test scripts to ensure no corrupted file was present.

Usage Notes
The presented dataset is highly multidisciplinary and can be employed by researchers working in several different fields. For instance, this dataset can be exploited to achieve the long term goal of the Horizon 2020 Natural Intelligence project (https://www.nih2020.eu/). This is to assist plant scientists in performing habitat monitoring procedures using legged robotic systems. To achieve this objective, two are the main points that need to be tackled: (i) having a robotic platform able to autonomously replicate the survey procedures suggested by national and international standards; and (ii) having algorithms able to (partially) assess the habitat conservation status.
As described in the Background & Summary section, the habitat conservation status can be related to various parameters, among which there is the presence/absence of typical or early warning species 6,8,9,14 . For this reason, to archive point (ii), it is necessary to have algorithms able to classify different plant species. The part of the proposed dataset named "Typical and early warning species" can be used towards this goal. Indeed, it provides labeled pictures and videos which are usually necessary to develop and tune new classifying algorithms. For these reasons, this part of the dataset can be very useful for engineers or computer scientists aiming to design new artificial intelligence based methods for detection and classification of plant species, e.g. 41 . Alternatively, these robot-acquired pictures and videos can be used to validate already existing algorithms that were developed using human-acquired data.
The part of the dataset named "Monitoring mission" can be used to work towards point (i). For instance, LiDAR-acquired point clouds and the robot status can be exploited to advance the research regarding robotic locomotion, navigation, and obstacle avoidance. Additionally, this part of the dataset can be used by botanists to assess the habitat conservation status and to compare them with past or future data from the same plots, or with data from different plots.
The mentioned usages of the dataset are intended as examples. Obviously, many other usages can be found. For instance, instead of using point cloud or robot status data to develop algorithms for habitat monitoring, they can be used to design new methodologies for search&rescue, inspection&maintenance, etc. Indeed, point clouds containing natural environments representations are usually quite useful to validate algorithms tested only in simulation scenarios, decreasing the sim to real gap.
Alternatively, it is possible to exploit the monitoring mission data to build new models. Indeed, these data are georeferenced, so they can be implemented for analysis on remote sensing similarly to 42,43 . The idea is to analyze the images/videos taken by the robot during the autonomous mission to classify the plant species and then integrate this information into species distribution models or validate previously defined models. The position of the plant during the autonomous monitoring mission can be also precisely estimated by merging GPS information with the robot position estimate.

Code availability
The MATLAB scripts used to extract and plots data from the ROS bag (.bag) files are provided on the GitHub page of the Research Center E. Piaggio 35 and are also archived on Zenodo 34 . A description of each script is provided in the README file of the GitHub repository.