Multi-sensor, multi-device smart building indoor environmental dataset

A dataset of sensor measurements is presented. Our dataset contains discrete measurements of 8 IoT devices located in various places in a research lab at the University of Bristol. Nordic nRF52840 DK IoT devices periodically collects environmental data, such as temperature, humidity, pressure, gas, room light intensity, accelerometer; including also a measurement quality indicator. The measurements were taken every 10 seconds over a six-month period between February and September 2022. In addition, we provide Received Signal Strength Indicator (RSSI) of the IoT devices. The data files are formatted as CSV files. There are various software libraries available to access and read this file format. We provide “README.txt” file which explains the repository and how to use dataset. Each data file is named according to its creation date and, once it reaches a size of 1MB, it is compressed and archived. A new folder is created every week to store all the data files from that week automatically. The dataset can be used for drift detection such as malicious or anomaly detection algorithms. It can also be used for smart building applications like occupation detection. The dataset can be found at https://data.bris.ac.uk/data/dataset/fwlmb11wni392kodtyljkw4n2


a b s t r a c t
A dataset of sensor measurements is presented. Our dataset contains discrete measurements of 8 IoT devices located in various places in a research lab at the University of Bristol. Nordic nRF52840 DK IoT devices periodically collects environmental data, such as temperature, humidity, pressure, gas, room light intensity, accelerometer; including also a measurement quality indicator. The measurements were taken every 10 seconds over a six-month period between February and September 2022. In addition, we provide Received Signal Strength Indicator (RSSI) of the IoT devices. The data files are formatted as CSV files. There are various software libraries available to access and read this file format. We provide "README.txt" file which explains the repository and how to use dataset. Each data file is named according to its creation date and, once it reaches a size of 1MB, it is compressed and archived. A new folder is created every week to store all the data files from that week automatically. The dataset can be used for drift detection such as malicious or anomaly detection algorithms. It can also be used for smart building applications like occupation detection. The dataset can be found at https://data.bris.ac.uk/data/ dataset/fwlmb11wni392kodtyljkw4n2 The data was acquired using several sensors in a smart building/office environment. The sensors were integrated to an IoT Nordic nRF52840 DK board. The following sensors were employed: (1) "ISL29125" Light Sensors: Collects intensity of the light [1] .
The sensors were connected to an IoT device equipped with a microcontroller and radio capabilities. The Message Queuing Telemetry Transport (MQTT) [4] was used as the publish and subscribe communication protocol for gathering data and sending it to a central database server for storage.

Data format
The data consists of raw sensor values formatted either as integer or floating point data types. The raw data includes time, device ID, sensor type and the values only. The device IP addresses are replaced with random indicators such as "A", "B", "C". Furthermore, each data value is timestamped with a Unix epoch (or Unix time or POSIX time or Unix timestamp) value to indicate the time point at which the value was recorded. Each sensor associated with the IoT device is indicated with device ID.

Description of data collection
In total, eight identical severely constrained IoT devices were located in different locations in the office measuring six different values from each sensor every 10 seconds. The data was collected using non-obtrusive environmental sensors. In order to capture different scenarios within the office environment, on each IoT device, 6 different types of sensors were used, namely light, movement(accelerometer), temperature, humidity, gas (VOC/ CO 2 ) and pressure sensors. In addition to the sensors, we provide Received Signal Strength Indicator (RSSI) values from each device. The IoT devices communicate via radio with an edge device, consisting of an UMBRELLA node [5] . The edge node forwards data to a desktop server to store data.

Value of the Data
• The rapid increase in the number of IoT applications has resulted in billions of devices being deployed, producing vast amounts of data. These devices are used for various purposes such as monitoring indoor air quality, estimating occupancy, detecting drift, and planning networks. However, gaining access to real-world data presents a significant challenge due to the reluctance of real-world institutions to disclose it. This limited access makes it difficult to test, standardize, and compare sensor-related technologies. For example, Chimamiwa et al. [6] recently provided a dataset of smart homes over a six-month period, and our proposed dataset has been generated over a similar time frame, with similar sensors, in a working office environment. The continuous monitoring data provided by our dataset is a valuable resource for researchers. Open access to real-world sensor data can benefit the research community, particularly for those who do not have the resources or time to create comprehensive datasets. The availability of such datasets can also speed up the development of algorithms for smart buildings and home automation. • Gaining a thorough understanding of the true value of a dataset necessitates taking into account the contextual information about the environment and the dataset processing. To this end, we have crafted an openly accessible dataset that has been meticulously collected over a period of six months, leveraging a diverse set of sensors positioned in multiple rooms within a bustling environment. In addition, we have thoughtfully included comprehensive details about the environmental conditions, aimed at providing deeper insights and facilitating the interpretation of the data. • Open access to raw data from sensors will help advance the development of algorithms for smart office/home environments, such as activity and intrusion detection. Data that has been collected over an extended period of time continuously provides a valuable opportunity to evaluate various machine learning algorithms in areas such as identifying patterns in behavior [7] or detecting anomalies [8] . • An example of how the dataset can be practically used is to test drift detection algorithms designed to identify compromised IoT devices that report false data. This type of manipulation can happen gradually over time and can mislead the state of the environment. To demonstrate this, we assume a device has been hacked and is providing incorrect sensor readings. We focus on gradual manipulation since sudden changes are easier to detect. Additionally, there are natural drifts in the data caused by seasonal changes and abnormal scenarios such as temporary HVAC failures resulting in deviations from ideal indoor temperatures. The dataset can be used to examine both malicious and natural data drifts [13] . Other potential applications of the dataset include occupancy detection, indoor air quality estimation, and evaluating techniques for addressing missing data in time-series data generation.

Objective
Real-world datasets tend to be non-stationary due to their nature when their distribution alters over time. Environmental changes frequently cause anomalous readings in data in smart building applications and change the trend of the data being streamed. In addition, lowcost hardware standard in environmental sensing and security gaps in IoT networks leaves the streamed data open to malicious attacks.

Data Description
We provide CSV file versions in a file, named according to timestamps. The file tree can be seen at Table 1 . Every week, a new folder is created automatically, to store all the data files of the completed week. Every data file is named using the date of creation and when a file reaches  a size of 1MB, it is compressed and archived. Each data file includes 4 columns that represent the time (Unix Epoch) that data was collected, the device ID, sensor type and the measured value. The statistical values of the dataset such as minimum, maximum and standard deviation is shown Table 2 .
The visualisation of collected data as shown in the Fig. 1 and Fig. 2 . The data is collected every 10 seconds for each device and the sensors integrated to them. However, we have some missing data in May and June due to the electricity cut as shown in Fig. 2 .

Experiment overview
To collect real-world data, we have deployed an end-to-end IoT network in the University of Bristol, Communication Systems & Networks (CSN) Research Lab. The lab is actively used by a significant number of academic personnel and students. The number of occupants changes per day between 0 and 28. It is located on the second floor thus, it gets exposed to environmental changes such as seasonal temperature, humidity and light fluctuations. Furthermore, the endpoints are in different locations in the lab as in Fig. 3 to collect varying data due to differentiation between the areas. The network consists of eight stationary severely resource-constrained IoT endpoints, an additional device acting as the "edge'', and a server for data collection and controlling the experiment. Each IoT endpoint hosts sensors providing temperature, humidity, pressure, gas, accelerometer, and light readings. We collected two additional pieces of information: the measurements' accuracy value, calculated by the environmental sensors and the received signal strength indicator(RSSI) [9] . The measurements are sampled periodically, every 10 seconds, and sent from the endpoints to the edge device. The experiment started in February 2022, collecting, so far. We provide data which was collected until September 2022 over six-month period. We stored the sensor readings in the server cloud in CSV file format via an application we developed. In our analysis, only four sensor readings were used to illustrate our dataset (gas, humidity, temperature, and light). We also provide the histograms and time series of the dataset to show the distribution of sensor readings as can be seen in Fig. 1 and Fig. 2 .
Each endpoint device of the network is a data collecting unit and consists of a Nordic nRF52840 DK board [10] and the following sensors: (1) "ISL29125" Light Sensors: Collects intensity of the light as in Fig. 5 .
The endpoint is identified using both the MAC address and a unique identifier provided by the vendor of the DK board. To easily locate every sensor deployed in the network, a map of the devices has been created, as shown in Fig. 4 , where we report only the last digits of the identifiers. In case of failure of a device, we can easily find it in the office rooms.
The DK board and the sensors are connected to every endpoint device using a breadboard. Communication is implemented using the I2C interface, where the DK board acts as a master and the three sensors act as slaves.
The endpoints are connected in a mesh topology, where the destination of the endpoints' data traffic is a device acting as the edge of the network. To enable communication between the endpoints and the edge of the network, we deployed, on the endpoints' DK board, the Contiki-NG operating system [11] . This provides a full stack implementation for forming mesh networks using IEEE 802.15.4 Time Slotted Channel Hopping (TSCH) MAC protocol [12] , an IPv6 network layer and a UDP transport layer. The adoption of TSCH provides an effective solution to avoid interference and obtain healthy continuous data as shown in Fig. 7 .
The device used as the edge of the mesh network is an UMBRELLA edge [5] , equipped with a Nordic nRF52840 SoC and a Raspberry Pi as illustrated in Fig. 6 . The Contiki-ng border router implementation has been deployed on the nRF52840 SoC. In this configuration, data is received by the nRF52840 SoC, using the IEEE 802.15.4 communication standard, and transferred to the Raspberry Pi.

Experiment Control and Monitoring
On the Raspberry Pi acting as edge, we execute a series of software services, implemented in Python language, providing three functions: 1) Control of the experiment 2) Monitoring of the experiment 3) Data file format and storage The control function communicates with the connected nRF52840 SoC, extracting the data originated by the endpoints. The monitoring functions verify that all the endpoints are sending sensor data correctly. The detection of an endpoint failure will be reported, providing the date of the failure and the identifier of the endpoint. Finally, the data file format and storage function are responsible for writing the received data in text files, using a Comma-Separated Values (CSV) format. Moreover, the files are periodically transferred to the server, so that they can be accessible via a Cloud service.

Ethics Statements
The devices were deployed in a university lab space. Access to the area was limited to university students and faculty members. Care was taken to protect individual privacy. The data collection experiment was authorised in accordance with University of Bristol's research ethics approvals processes (application reference 10145). The dataset does not contain any personally identifiable information.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Multi-sensor, Multi-device Smart Building Indoor Environmental Dataset (Original data) (University of Bristol Data.Bris Research Data Repository).