Dataset for polyphonic sound event detection tasks in urban soundscapes: The synthetic polyphonic ambient sound source (SPASS) dataset

This paper presents the Synthetic Polyphonic Ambient Sound Source (SPASS) dataset, a publicly available synthetic polyphonic audio dataset. SPASS was designed to train deep neural networks effectively for polyphonic sound event detection (PSED) in urban soundscapes. SPASS contains synthetic recordings from five virtual environments: park, square, street, market, and waterfront. The data collection process consisted of the curation of different monophonic sound sources following a hierarchical class taxonomy, the configuration of the virtual environments with the RAVEN software library, the generation of all stimuli, and the processing of this data to create synthetic recordings of polyphonic sound events with their associated metadata. The dataset contains 5000 audio clips per environment, i.e., 25,000 stimuli of 10 s each, virtually recorded at a sampling rate of 44.1 kHz. This effort is part of the project ``Integrated System for the Analysis of Environmental Sound Sources: FuSA System'' in the city of Valdivia, Chile, which aims to develop a system for detecting and classifying environmental sound sources through deep Artificial Neural Network (ANN) models.

Synthetic data were generated through acoustic virtual simulations using the room acoustics simulation framework RAVEN [1] .The SPASS dataset was built using multiple real monophonic sound examples.Those sounds were collected from public databases such as ESC-50 [2] , UrbanSound8K [3] , Making Sense of Sound dataset [4] , Audio Event Net [5] , and data from FreeSound web [6] .Then, the audio files were manually inspected and relabeled following a hierarchical urban sound event taxonomy.

Data format
Raw audio files in wav format Metadata in Comma-Separated Values (CVS) format Description of data collection Five urban soundscapes representing a market, street, park, plaza, and waterfront were simulated using the RAVEN software for virtual acoustic reality.A total of 25,0 0 0 10 s audio waveforms were generated.Each waveform simulates an omnidirectional microphone that records a set of randomly selected monophonic sound events convolved with (spatial) impulse responses from the simulated environments.

Data source location
• Institution: Universidad Austral de Chile

Value of the Data
• SPASS is a high-quality synthetic dataset with perfect strong labels and sufficient data volume to effectively train large machine learning models for sound event detection.On the other hand, manually providing strong labels for large real audio datasets is a highly time-consuming, error-prone, and costly process.• This data directly benefits researchers and engineers that require high-quality data to train a machine-learning model for urban sound event detection.There is also an indirect benefit to decision-makers and analysts using the models trained with SPASS to analyze urban soundscapes.• SPASS was designed to pre-train a base model that can be fine-tuned to actual acoustics tasks related to urban soundscapes.• The proposed methodology can synthesize a wide variety of soundscapes by providing an appropriate collection of monophonic sound events, e.g., natural, rural, and industrial soundscapes.
• The present dataset has potential use in transfer learning or fine-tuning processes with actual recordings of sound scenes (e.g., STARSS22, SINGA:PURA, or SONYC-UST) for the classification and detection of urban sounds.

Objective
To design and implement a methodology to synthesize a large polyphonic audio dataset with perfect strong labels to train machine learning models for urban sound event detection.This data article helps improve our experiments' reproducibility and may motivate other groups to continue building on this foundation.

Data Description
The FuSA system [7] taxonomy shown in Table 1 corresponds to the urban sound events that comprise the SPASS dataset.The hierarchical taxonomy considers seven coarse-level categories: Humans, Music, Animals, Environmental, Mechanics, Vehicles, and Alerts.The 33 fine-level categories can also be seen in Table 1 .
The following section explains how the RAVEN software was configured to simulate these environments.A total of 50 0 0 10 s audio clips were created for each environment.Consequently, the SPASS dataset consists of 25,0 0 0 polyphonic recordings containing diverse urban sound sources.All 10 s audios are mono channel, in 32-bit floating point WAV audio format, with a sampling rate of 44.1 kHz.
The metadata is a CSV table that includes the audio filename, taxonomic category, onset time (sec), offset time (sec), source location in xyz (m) coordinates, orientation of the recording microphone in xyz (m), and final location of moving sources in xyz (m).Note that the information "source location in xyz(m)'' refers to the initial position of the moving sound sources.The information about the orientation of the recording microphone was meant for binaural recording, but for this mono-channel dataset, that information is irrelevant since the microphone is omnidirectional ( Table 2 ).
The audio recordings are available in .7zformat in the Zenodo open repository for download.The dataset consists of 5 audio folders (.7z format) containing all audio recordings, 5 files with the corresponding label of the audio folders (.csv format), and 5 files with the probability distribution of each sound event (.xlsx format).
Fig. 1 shows an example of the audio data with the labels superimposed.Each audio contains different sound events that are spread randomly in time and may overlap.

Experimental Design, Materials and Methods
All the monophonic sound sources were normalized ([ −1, 1]) before simulation in the virtual scene.The location-related information was defined in Cartesian coordinates.Different sound absorption coefficients were defined depending on the material of the surfaces.These coefficients were taken from RAVEN ś database.
RAVEN (Room Acoustics for Virtual Environments) is a room acoustics simulation environment.It performs acoustic simulations in user-defined scenes where impulse responses (IR) can be generated for different positions within the simulated virtual environment.RAVEN is freely available for academic purposes [8] .Each of the five environments was designed differently.In what follows, we indicate the physical design of each environment: • Market: A cube 80 (m) long, 50 (m) wide, and 80 (m) high, with a street at the end of one of the wide sides.Small buildings that simulate houses are added on the edge of both long sides.Buildings were considered as surfaces with different sound absorption coefficients following RAVEN ś material database.The floor and buildings were simulated as hard reflective surfaces.All other boundaries were simulated with a sound absorption coefficient close to one, thus creating a free-field acoustic environment.• Park: A cube 80 (m) long, 80 (m) wide, and 80 (m) high, with a street at the end of one side.Open space with no buildings around.The floor was considered as acoustically soft floor similar to grass.All other boundaries were simulated with a sound absorption coefficient close to one, thus creating a free-field acoustic environment.• Square: A cube 80 (m) long, 80 (m) wide, and 80 (m) high, with two opposite streets at the end of the wide sides.Open space with no buildings around.The floor was simulated as a hard reflective material such as concrete or paving stone.All other boundaries were simulated with a sound absorption coefficient close to one, thus creating a free-field acoustic environment.• Street: A cube 80 (m) long, 20 (m) wide, and 80 (m) high, with two sidewalks around a street and buildings on the edge of both long sides.They were simulated as surfaces with different sound absorption coefficients following RAVEN ś material database.The floor and buildings were simulated as made of a hard reflective material.All other boundaries were simulated with a sound absorption coefficient close to one, thus creating a free-field acoustic environment.• Waterfront: A cube 80 (m) long, 80 (m) wide, and 80 (m) high with a street at the end of the long sides and a river/wave area down the opposite side of the street.A pedestrian walkway was located between these two areas.The floor was considered to be made of a hard reflective material such as concrete.The river/waves surface was considered highlyreflecting since it is a characteristic of the interaction of a wave that travels through the air to water.All other boundaries were simulated with a sound absorption coefficient close to one, thus creating a free-field acoustic environment.
As mentioned in the previous section, every environment has its unique set of sound sources.Additionally, the probability of the appearance of each sound event in each environment was different.For exam ple, the likelihood of finding a "dog" was higher in the park environment than a "braking."The number of events of a given category and environment is randomly selected by drawing from a Poisson probability distribution with rate parameter ( λ) specified in Table 3 .The maximum number of occurrences is also limited, as shown in the table.
A computer program was implemented in Matlab and interfaced with RAVEN to generate the Impulse Responses (IRs) for the virtual environments.The location of the IRs within the simulated geometry was selected randomly with a uniform distribution and the following additional constraints and assumptions: • The sound events are located at least 2 m from the recording microphone and less than 1 m from the geometric boundaries of the scene.• The motion simulations are limited to linear motions.
• An airborne vehicle is perceived as motionless from the recording point of view.
After randomly selecting a monophonic sound event and its location, the sound event is convoluted with the simulated IR.This process brings all the spatial characteristics to the monophonic signal.All monophonic signals are normalized.Therefore, the intensity of each sound event within the simulated environment depends solely on its distance from the virtual recorder.The convolved sound signal is then randomly inserted into a 10 s container waveform, nevertheless, the selection was drawn from a uniform distribution on [0,10-t], where t is the duration (in seconds) of the sound signal.The start and end times and the spatial position of each sound event are automatically recorded in Cartesian coordinates in the metadata file.This process is  repeated according to the probability of occurrence of each sound event, which in turn is associated with the taxonomic class.This procedure results in a polyphonic audio file with start and end time tags of all included sound events and their location in space.

Ethics Statements
This work did not include research on humans, animals, or data collected from social media platforms.We confirmed that the data distribution policies of the primary data sources used to construct SPASS were complied with.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
for detecting and classifying environmental sound sources through deep Artificial Neural Network (ANN) models.© 2023 The Authors.Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

Table 1
FuSA system's taxonomy for urban sound events.

Table 2
Metadata labels with examples.

Table 3
Rate ( λ) parameter of the Poisson probability distribution for each sound event.A symbol "-" indicates that the class is not considered for that category.