Dataset for Bluetooth 5.1 Direction of Arrival with non Uniform Rectangular Arrays

This paper presents a dataset for Bluetooth 5.1 direction of arrival (DoA). The dataset was generated with a specifically designed mathematical model of a non-uniform rectangular antenna array. The Python source files that generated the dataset are also provided. The dataset was conceived as a starting point for developing and validating DoA algorithms for real-life scenarios. Unlike other datasets, it contains Bluetooth signals with not only varying intensity of additive white Gaussian noise, but also coherent interfering signals with random DoA coordinates. The dataset is divided into two branches, one consisting of pure sinusoidal tones and the second comprised of baseband Bluetooth signals. Since the codebase which generates the data is included, this dataset has a high reuse potential, and it can be modified to suit also other types of signals or different array topologies.


Specifications
Computer Networks and Communications Specific subject area Telecommunications and Bluetooth signals Type of data Dataset files: * .csv files Source code for dataset generation: * .py files How data were acquired Mathematical simulation of complex signals. Python source files that generated the data are included Data format Raw Parameters for data collection Useful signal azimuth and elevation (degrees), useful signal to noise ratio (SNR) (dB), useful signal to interfering signals ratio (SIR) (dB), frequency offset between useful signal and interfering signals (interf. offset) (MHz), residual Bluetooth receiver offset (res. offset) (KHz) Description of data collection Data was generated with a mathematical model of the signal and of the rectangular antenna array. The simulated signal is a baseband Bluetooth 5. Please note : you need to collect all four parts in one folder. Then, you can then use an extraction software of your choice, e.g. 7-Zip ( https://www.7-zip.org/ ), to recreate the entire directory structure of the dataset.

Value of the Data
This dataset simulates the interaction of Bluetooth 5.1 signals with the URA8 array topology. The latest Bluetooth 5.1 standard has introduced important features to aid direction-finding solutions [1] . URA8 is an antenna array composed of eight elements equally spaced on a square edge. This type of array is thoroughly analyzed in [2] and is shown in Fig. 1 . The mathematical model of the array is presented in detail in Section 3.1 .
• The dataset described in this paper is useful because it provides a valuable tool for the development and validation of DoA algorithms for Bluetooth 5.1 signals. It contains coherent interference with respect to the useful signal, which is always present in real-life scenarios.
Since DoA algorithms are usually tested and compared on sinusoidal signals, both sinusoidal signals and Bluetooth signals are included in this dataset. Moreover, the dataset provides the signals in the form of IQ samples, which is how they are outputted by modern Bluetooth 5.1 devices. Fig. 1. Geometrical configuration of the non-uniform rectangular array with 8 patch elements.
• Anyone who develops Bluetooth 5.1 tracking applications can benefit greatly from this dataset, because the simulated data are comparable to those obtained through actual measurements. The data have been compared with real-world IQ samples generated by an STMicroelectronics transceiver prototype, fully compliant with the Bluetooth 5.1 standard. This prototype has been equipped with an eight-sensors patch antenna array, with the same topology as the mathematical model of the array shown in Fig. 1 and described in Section 3.1 .
Experimental results have shown that the measured IQ samples match the signals modeled in the presented dataset with a signal to noise ratio (SNR) of 60dB and a with signal to interference ratio (SIR) below 60dB, showing a mean error equal to 0 and a standard error deviation around 1%. The non-zero value of the standard error deviation is due to a residual stochastic uncompensated frequency offset on the measured IQ samples. • By including the Python codebase that generates the dataset alongside the dataset itself, the data can be modified and expanded at wish. For example, more than two interfering signals could be added, the topology of the array could be modified, or the number of simulated measurements could be increased. The latter aspect is especially valuable when considering the training of artificial neural networks, which require large quantities of data to generalize appropriately.

Data Description
The dataset consists of .csv data, along with .py Python source files that are used to generate the data. It is divided into two branches, each one corresponding to the type of signal which interacts with the rectangular array. The first branch is devoted to radio frequency (RF) sinusoidal signals, while the second one contains Bluetooth 5.1 signals. Each branch is selfcontained within its own folder. The content of the PureTone_data/ and BLE_5_1_data/ folders is described in Sections 2.1 and 2.2 respectively. The Python code that was used to generated the dataset, contained in folders PureTone_code/ and BLE_5_1_code/ , is described in Section 3.3 .

RF 2.4GHz pure tone branch
The directory structure is shown below. The data is organized in a series of simulated tests, each contained within its own folder, whose name always begins with Test_ * / . Each test folder then contains N φ · N θ sub-folders, where N φ and N θ are the number of possible azimuth and elevation angles of the useful signal, respectively. The description of every dataset parameter is reported in Table 1 , and is valid for both dataset branches. The range of each PureTone dataset branch parameter is reported in Table 2 .
Azimuth and elevation angle of the useful signal Azimuth and elevation angle of the first interfering signal Azimuth and elevation angle of the second interfering signal There are three types of .csv files for each sig_azim < φ i > _elv < θ i > folder, all of them using UTF-8 encoding and comma as separator. The following description is valid for both dataset branches: csv files contain the DoA of the useful signal, estimated by the Multiple signal classification algorithm (MUSIC), which is described in Section 3.2 . They are meant to serve as a ground truth for performance comparison with other DoA algorithms. An example of such file is shown in Table 3 . The column headers and their abbreviations correspond to the parameters described in Table 1 . The results obtained by MUSIC are contained in res_azim (resulting azimuth) and res_elev (resulting elevation) columns.
are useful in the case that one does not want to rely on the hierarchy of the directory tree to read the key dataset parameters. Their generic structure is shown in Table 4 . They contain the useful signal angular position range (minimum, maximum, and step), the position of the useful signal, the frequency offset < a i > , the signal to noise ratio < d i > and the first interfering signal to useful signal ratio < b i > . The parameters correspond to those described in Table 1 . Table 3 Example of a doares_saz < φ i > _sel < θ i > .csv file.
• X_m < n i > _ * .csv files contain the 8 × 70 X IQ complex-valued IQ matrices, which result from the interaction of the signals and the rectangular antenna array. These files do not contain any header. Each 8 × 1 column is a complex-valued IQ vector x IQ T . The complex numbers are in the format shown in Table 5   Table 5 Example of a complex-valued IQ vector x IQ T contained in X_m < n i > _ * .csv files.

Baseband Bluetooth 5.1 signal branch
The directory structure is very similar to the PureTone branch and is reported below. The difference consists in the greater number of parameters due to the presence of the residual frequency offset and the constant tone extension (CTE), which are typical of Bluetooth 5.1 signals. Table 1 provides a description for each parameter, while Table 6 reports the value range that Table 6 Dataset parameters value range for the BLE_5_1 branch.

Parameter
Value range Measure unit can be assumed by each parameter. The three types of .csv files within each sig_azim < φ i > _elv < θ i > folder are identical with respect to the PureTone dataset branch.

Experimental Design, Materials and Methods
The dataset is based on a mathematical model of a non-uniform rectangular antenna array with 8 elements. The following sections provide a description of the mathematical model of the array, of the impinging signals, and of their Python implementation which is provided with the dataset.

Mathematical model
https://www.overleaf.com/project/617a8bcbbeb847848f1fd78c The array model corresponds a rectangular array of N = 8 numbered antenna elements, evenly spaced at a distance d = λ 2 . 5 on a square edge, as shown in Fig. 1 . Because of the lack of a central element, this array is nonuniform.
The rectangular array receives M signals s i (n ) , incident with angles (θ 1 , φ 1 ) , (θ 2 , φ 2 ) , ..., (θ m , φ m ) , ..., (θ M , φ M ) (1) where θ m and φ m are the m -th elevation and azimuth angles of each signal. The signals have also a defined power P 1 , P 2 , P m , ..., P M . The array response matrix A describes the interaction of the impinging signals with the rectangular antenna array. For the topology of the array shown in Fig. 1 , the array response matrix is A = [ a (θ 1 , φ 1 ) T , a (θ 2 , φ 2 ) T , ..., a (θ M , φ M ) T )] (2) where the array response vectors a (θ , φ) are defined as: with d = λ 2 . 5 and γ (θ ) = 2 π sin (θ ) (4) As mentioned in Section 2 , the dataset is subdivided into two branches. For the Bluetooth 5.1 dataset branch , the signals s i (n ) are a baseband model of the signals at the analog to digital converter (ADC) of the Bluetooth receiver. Apart from the frequency offset, the carrier translation and the RF impairments are negligible for the study of the DoA. Regarding the frequency offset between the transmitter and the receiver, the majority of the offset is corrected at the automatic frequency corrector (AFC). However, a small part of this offset typically remains, which we indicate as f of f . This residual offset is able to impact negatively the estimation of the DoA. Our mathematical model of BLE 5.1 signals is that of binary Gaussian frequency shift keying (GFSK) modulated signals: where DR is the data rate, f of f is a possible frequency offset error, F s = 16 MHz is the sampling frequency at the receiver's ADC, n represents the discrete-time index, h is the modulation index, p(n ) is the symbol pulse as defined in [3] , and b i ∈ { 1 , −1 } is the binary symbol to be transmitted. As a mathematical model for the baseband CTE, Eq. 5 with all binary symbols b i equal to 1 is used.
The baseband samples at the Bluetooth receiver, running at a proper sample period T s without ADC impairments can be expressed as: where s 1 , s 2 , ..., s M are the complex impinging signals, generated by using Eq. 5 with a sampling period T s and n = [ n 1 , n 2 , ..., n 8 ] T is the noise vector at the analog to digital converter (ADC), with zero mean and variance σ 2 , which is related to the SNR of the useful tag signal. The noise vector is comprised of AWGN noise, which approximates the noise of the receiver, and does not take into account the interfering signals. When one of the s i signals is a CTE, all binary symbols are set to 1, otherwise they are selected randomly. Considering the IQ sampling process of the Bluetooth receiver and calling T switch and T sample = 1 /F s the switch slot duration and the sampling slot duration respectively, both set to 1 μs or 2 μs , the X T s data are down-sampled with a factor of round((T switch + T sample ) /T s ) , thus obtaining the IQ samples in form of X IQ , which is a complex matrix with size N × N samp .
Because of the filtering chain, the power of the adjacent channels, alternate channels and all remaining channels is strongly attenuated in comparison to the reflections of useful signal. This means that, in the BLE 5.1 model, all of the interfering signals are in fact the reflections of the useful signal itself.
In the case of pure tone dataset branch , the signals s i (n ) are defined as where F c = 2 . 4 GHz, ch is the channel spacing (which is an integer multiple of 2 MHz), and F s is the sampling frequency, set to F s = F c · 32 . In this model, we do not decimate the s i signals.
The IQ samples are obtained as where n is the AWGN noise at the antenna elements. Please note that the noise n which is added in the pure signal model is much more wideband when compared to the n noise of the Bluetooth model, because the BLE model is a baseband model, while the pure tone signal is an RF model. Therefore, the signal to noise ratios (SNRs) obtained with the two models are not comparable. This is because the SNR depends on the bandwidth of the noise channel.

The MUSIC algorithm
In addition to the modeled RF 2.4GHz and Bluetooth 5.1 signals, this dataset provides a ground truth for the development and testing of different algorithms. For this purpose, the Multiple signal classification (MUSIC) algorithm [4] was implemented and applied to the simulated dataset signals. MUSIC is a signal subspace method for DoA estimation, which exploits the eigen-structure of the autocorrelation matrix of the signal to find the signal sub-space and noise sub-space. The MUSIC pseudo-spectrum P (φ, θ ) is obtained in following way: where a (θ , φ) are the array response vectors defined in Eq. 3 and Q n the eigen vectors of the noise sub-space. The DoA estimate of the source signal is then obtained by finding the peak in the above defined pseudo-spectrum.

Python implementation
Note on licensing. The Python implementation that is shipped with this dataset is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. If you make improvements to the code, you are invited to share those changes with the community.
File structure and dependencies. The Python implementation is written in Python 3 and is contained inside the following * .py files: The following dependencies are needed: os , time , numpy , scipy and matplotlib .
The following paragraphs describe the content of each Python module by making also reference to the mathematical equations of the previous sections.
The algorithmic module. The algorithmics_URA8.py module is common to both branches of the dataset. It contains the definition of the conventional steering vector for URA8 as defined in Eq. 3 and the implementation of the MUSIC algorithm as described in Section 3.2 . The function which generates the steering vector is the following: The URA8 steering vector from Eq. 3 is also defined inside the MUSIC implementation: The MUSIC pseudo-spectrum from Eq. 11 is then estimated: The signal model modules. For the Bluetooth 5.1 branch, the signals are modeled inside the e2e_URA8_BLE.py module. The frequency deviation from Eq. 7 is implemented as The baseband signals s i (n ) with the frequency offset f of f , as defined in Eqs. 5 and 6 , are implemented as The directions of the incident signals from Eq. 1 and the array response vectors from Eq. 3 are generated in The array response matrix from Eq. 8 is implemented in the following lines: For the PureTone branch, the signals are modeled inside the e2e_URA8_PureTone.py module. The signals s i (n ) from Eq. 9 are implemented as follows: A similar approach is used for the parameters of the PureTone dataset generator: The testing modules. The e2e_URA8_BLE_DOAview.py and e2e_URA8_PureTone_ DOAview.py modules are meant as an aid for the setting of the dataset parameters prior to its actual generation. Their output is a 3D plot of the MUSIC pseudo-spectrum. The following code is taken from the e2e_URA8_BLE_DOAview.py module. It gives the possibility of manually setting the following parameters: The e2e_URA8_PureTone_DOAview.py gives the possibility of manually setting the following parameters: