The Ambisonic Recordings of Typical Environments (ARTE) Database

Everyday listening environments are characterized by far more complex spatial, spectral and temporal sound ﬁeld distributions than the acoustic stimuli that are typically employed in controlled laboratory settings. As such, the reproduction of acoustic listening environments has become important for several research avenues related to sound perception, such as hearing loss rehabilitation, soundscapes, speech communication, auditory scene analysis, automatic scene classiﬁcation, and room acoustics. However, the recordings of acoustic environments that are used as test material in these research areas are usually designed speciﬁcally for one study, or are provided in custom databases that cannot be universally adapted, beyond their original application. In this work we present the Ambisonic Recordings of Typical Environments (ARTE) database, which addresses several research needs simultaneously: realistic audio recordings that can be reproduced in 3D, 2D, or binaurally, with known acoustic properties, including absolute level and room impulse response. Multichannel higher-order ambisonic recordings of 13 realistic typical environments (e.g., o ﬃ ce, cafè, dinner party, train station) were processed, acoustically analyzed, and subjectively evaluated to determine their perceived identity. The recordings are delivered in a generic format that may be reproduced with di ﬀ erent hardware setups, and may also be used in binaural, or single-channel setups. Room impulse responses, as well as detailed acoustic analyses, of all environments supplement the recordings. The database is made open to the research community with the explicit intention to expand it in the future and include more


Motivation
Over the last twodecades, there has been agrowing interest in studying human hearing in complexa coustic environments that better represent listening situations experienced in real-life (e.g., [89,21,23,31,42,55,76,65,87]). This typically involves using aplurality of sound sources arriving from different distances and directions around the listener,c ombined with reverberation and ambient noise. Reproducing such environments is particularly informative for the study of hearing devices, since their performance is ultimately assessed by their ability to improve speech perception in noisy real-world environments. However, both speech and noise vary dramatically between the clinic and the real world, to the extent that it is often impossible to predict real-world performance of those devices giventhe clinical data alone. Moreover, even when data is available about performance in more complexa coustic scenarios, it is often unclear howt og eneralize these results, and it may be impossible to replicate them. This is because realistic test material in research is collected and reproduced with test-specificr equirements, which are usually too narrowt ob eu seful in other fields of research. Fore xample, these may include particular speech-in-noise material, reproduced using an arbitrary loudspeaker arrangement, or designed to test aspecificsignal processing algorithm (e.g., ad irectional beamformer of ah earing aid). In addition, these studies often contain technology and materials that are not publicly available.
One wayt oe nable reproducibility and offer tighter experimental control is to use ashared corpus of acoustic scenes, which could be played back in different laboratories on different sound reproduction systems. In this paper,the Ambisonic Recordings of Typical Environments ings, am icrophone array wase mployed in combination with the HOAm ethod to allowt he reproduction of realworld scenes using appropriately positioned loudspeaker arrays, as well as binaural playback overheadphones. By using loudspeaker-based reproduction, subjects can utilize their individual spatial cues, including head movements, and hearing devices can be more easily integrated.
Utilizing realistic listening environments in hearing research may affect outcome measures in different ways. For instance, the ability to understand speech and to communicate with others will be affected by the noise and reverberation that is introduced by the givenenvironment [54]. In particular,t he temporal, spectral, and spatial variability of the noise may change the instantaneous SNR of the speech signal that is receivedatthe twoears [31]. Normal hearing listeners can takeadvantage of these SNR fluctuations in speech intelligibility either by within-ear glimpsing [24] or better-ear glimpsing [22]. Additional benefit may be provided by the binaural auditory system utilizing interaural time difference cues to spatially unmask the target speech (e.g., [29]). However, the benefitofany of these mechanisms may be reduced by the presence of room reverberation [50]. In addition to these signal energy-related mechanisms, non-target talkers in the environment can also impair intelligibility by distracting the listener,a nd thereby challenge cognitive mechanisms such as selective attention. Even though these informational masking effects have been widely studied in the laboratory [48,47], their real-life relevance is unclear (e.g., [85]). Givent he complexity of all these environment-specificacoustic factors together with the limited understanding of their combined effect on hearing outcomes, their accurate reproduction in the laboratory is important, in particular for assessing functional hearing abilities in hearing-impaired listeners.
Hearing-impaired listeners are considerably more susceptible than normal-hearing listeners to environmental effects on speech reception (e.g., [40]), which may be either due to their reduced auditory sensitivity,f requency selectivity,o rt emporal acuity,o rd ue to an age-related reduction in their cognitive function. Either way, the reallife auditory and cognitive mechanisms are currently not well addressed in laboratory-based speech-in-noise tests. In such tests, semi-anechoic target sentences are presented in masking noise that consists of speech-shaped broadband noise (e.g., [67]), speech babble, or speech noise (e.g., [17]), and is presented at ap rescribed levelf rom a fewl oudspeaker directions (e.g., [52,62,66]). The listener'sscore of correctly identified speech is then used to adapt the speech level( i.e., the SNR)u ntil 50% intelligibility is achieved. The resulting speech level, or the speech reception threshold (SRT), is often obtained with ecologically invalid levels of speech and noise, which may be at least partially explained by the rather artificial speech and noise material that is not actually encountered in reality [65,76]. Hence, providing ecologically valid noiselevels cenes may be an ecessary stepping stone to direct ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 105 (2019) research efforts to acoustics that do not just stem from a clinical convenience, butare also encountered in reality.
Apart from assessing speech recognition performance in realistic noise, the application of realistic acoustic environments may be important also when sounds other than speech are the main signal of interest (e.g., [49]). The above-mentioned auditory mechanisms are essential in forming af ull image of the environment surrounding the listener -e.g., sources behind the head, outside of the visual field -and helps the listener to direct attention to a source of interest (e.g., at alker), or to avoid being hit by an approaching object (e.g., acar). The combined listening experience that occurs in the real-world may be explored using the complexacoustic stimuli provided by the ARTE database.
Differences between typical laboratory and real-world conditions also have consequences for the operation of hearing devices (e.g., [16,23,74]). Modern hearing devices are equipped with ah ost of nonlinear algorithms with the primary aim to increase the effective SNR of speech in noise, through algorithms for noise reduction, microphone directionality,a nd dynamic range compression [45]. These algorithms are often validated under controlled laboratory conditions, butbecause of their inherent nonlinearities, theymay respond differently to the speech and noise signals experienced in the real life and, as a result, deliveru ncertain outcomes [65]. Furthermore, realistic acoustic environments can be influential in terms of subjective pleasantness, comfort, or the lack thereof [13,88]. Ap leasant sounding environment can be conducive for conversations, music listening, and be psychologically comfortable. However, hearing aids can also affect perceivedacoustic comfort in away that is unlikely to be captured in clinical settings [75]. Therefore, the design of future hearing aids should benefitfrom testing them under more realistic conditions.

Existing Databases
As motivated and described above,ar eal-world acoustic environment database that is suitable for the givena pplication in hearing research requires at least that (i) sound files are provided in aformat that allows spatial reproduction of the recorded scenes via arbitrary loudspeaker arrays and headphones with an adequate accuracy, (ii) the sound pressure levelisprovided to allowthe reproduction of each recording at its original level, (iii)s ome basic acoustic and other scene descriptive information is provided that allows informed selection and comparison of scenes, and (iv) RIRs are provided that allowtoadd speech (orother sound)m aterial to the scenes, as for instance required for implementing aspeech-in-noise test.
There are several types of databases that have been presented in literature that are relevant for hearing research and address some of the above requirements, but, to the best knowledge of the authors, none of them satisfies all these requirements. Some databases focus on typical acoustic conditions of specificplaces and provide detailed descriptions, average values of various acoustic measures, and sometimes room impulse responses. Other databases provide actual sound recordings in different audio formats, butl ack important acoustic and other descriptive details. Belowi sa no verviewo ft he existing relevant databases, their intended usage and their limitations for applications in basic and clinical hearing research.
Noise level surveys: Only ah andful of studies have been published that surveyed typical noise levels in various daily scenarios. All of these studies sampled the acoustics using ear-levelrecordings in some manner.Theyvary in the choice of scenarios, the detail of the derivedacoustical data theyprovide, and the depth of the accompanying acoustical analysis. [72] published the first comprehensive survey of noise and speech leveld istributions in schools, homes, hospitals, departments stores, trains and airplanes. The A-weighted broadband noise levels and standard deviations were reported, butw ith no details about other acoustic parameters of the noises and environments. [42,81,76] each observed 18-20 subjects in their daily acoustic environments, who were reportedly satisfied, experienced hearing aid users. [42] investigated the subjects' reaction to their "auditory ecology" and recorded using two omnidirectional microphones, mounted on behind-the-ear dummy hearing aids. The broadband noise levels (indB) of ten environments are provided along with their standard deviations. The data is also divided into situational categories (conversations, TV,music, etc.), rather than location categories only.S ubjects also ranked the relative importance of these daily environments to themselves. [81] conducted as imilar survey,b ut presented am uch more detailed statistical and acoustical data set. Analysis of recordings of elevenb road scenario categories included their mean octave-band spectra, group broadband levelpercentiles, mean and standard deviations, frequency of occurrence and some relevant estimates using subjective data. In [76], aparticular emphasis wasput on estimating the SNR in various daily situations, divided into nine typical categories. Conveniently,the mean and distribution for the better and worse ears were provided in dB and dBA, as well as the power in octave bands. Similar data wasreproduced in [87]. In parallel, surveysi nt he noise control literature have published levels of alarge number of environments and situations. The Noise Navigator TM Database [15] compiled over1 700 levels of different objects and scenarios. Mean or maximum sound pressure levels are provided along with the distance from the source during the measurement. In the more recent Non-Occupational Incidents Situations and Events (NOISE)database the focus waso nl eisure activities and related statistics about noise exposure levels, mean, maximum levels and standard deviation, along with ad escription of the measurement conditions are provided [14]. As the primary goal of these studies wast os urvey daily acoustic environments, they tend to suffer from lack of specificdescriptions of the environments that belong to the particular categories. While the deriveda coustic data may be sufficient to roughly model an equivalent steady-state reference environment to the categories in the abovementioned studies, there are no actual recordings available of archetypical scenarios. Furthermore, there are no mentions of the spatial distribution of the various sources, the room acoustic characteristics of these environments, the exact listener positions, or the temporal dynamics of the acoustic scenes.
Soundscapes: In soundscape studies researchers have looked into detailed acoustic and psychological characteristics of particular environments, such as restaurants [51] and public squares [18]. Unfortunately,i ntegrating the output from theses studies into asingle comprehensive database of daily scenes may be impossible, because they are not standardized. Moreover, soundscape studies do not always avail the recordings for public use. Nonetheless, an umber of publicly available soundscape databases do exist with af ocus on the sonic and environmental qualities of the scenes, rather than the acoustical ones. Notably,one such database is the World Soundscape Project Database [80]. It includes comprehensive descriptions of the recorded environments, area photos, exact map locations, and sometimes at imeline description of different sound events throughout the recordings. However, whether single or multichannel, e.g., [7], these recordings are generally uncalibrated, so that theym ay not be useful for controlled clinical research of the kind that is required in speech communication or hearing device work. Similar environments are sometimes used in the study of environmental sounds (other than speech and music), where the scenes provide the backdrop for the target sound. The Database for Environmental Sound Research and Application (DESRA)w as an attempt to provide as ystematic source of such sounds [37,4].
Room impulse response databases: Several audio databases were released for public use that are primarily intended for evaluating different acoustic speech enhancement methods, which mainly focus on the reverberant characteristics of realistic acoustic environments as captured by the room impulse response (RIR). These databases can be used to synthesize realistic scenes by superposing prerecorded anechoic speech and noise signals that are convolved with the provided RIRs, butt hey are not well suited for reproducing the full complexity and dynamics of the environments experienced in the real world. [46] recorded multichannel head-related and room-related impulse responses using in-ear and behindthe-ear microphones on am anikin and humans [2]. They included an anechoic chamber,a no ffi ce, ac ourtyard, and ac afeteria, and provided the reverberation times of these environments. However, the database specifically pertained to the quiet room conditions, rather than to the occupied locations. The Multichannel Acoustics Reverberation Database at Yo rk (MARDY) provides RIRs of different configurations of reflectors and absorbers in av ariable acoustics rooms [84]. The RIRs were recorded using both an omnidirectional microphone and aeight-element linear microphone array at three different source-receiverd istances, and wass pecifically designed for testing de-reverbration algorithms [8]. Similarly,inthe Aachen Impulse Response Database (AIR) [ 43] binaural room impulse responses (BRIR)a re provided for four different rooms in different source-receiverconfigurations [1]. Single source recordings were made with ad ummy head, with the explicit aim to be employed in studying de-reverberation algorithms. Specificd escriptions of the room dimensions and reverberation times are provided as well. In amore room-acoustical focused approach, another database provides BRIRs in hundreds of receiverlocations within three rooms with fixed source position, in order to have ad ense mapping of the source-receivera coustics [78]. The recordings were done both in omnidirectional and B-format configurations [12].
In other recent databases, the main aim wast os upport beam-forming algorithms and the absolute acoustic measures were not reported. The Multichannel Impulse Response Database (MIRD)c ontains impulse responses (IRs)o fm icrophone arrays in av ariable reverberation room [39,9]. Different geometrical configurations of linear arrays of eight microphones were used to obtain the responses of loudspeakers on twos emicircles around the array.I na nother database, the scenario waso ne specific medium-sized room with four target talkers seated around atable and live babbling talkers (0,8,24or56) surrounding them [86]. Twenty-four microphones were placed in different locations on and between the talkers and obtained different combinations of talkers and speech-babble, in addition to the head-related impulse responses of the room 3 . Finally,the Open Acoustic Impulse Response (OpenAIR) library offers ap latform for sharing the IRs that were recorded in different locations using various methods, including multichannel B-format measurements [64]. The database also includes exact map locations, photos, and room acoustical data derivedfrom the measurements [11]. Machine learning databases: The last class of databases reviewed here serves the design and training of machine learning algorithms that perform scene classification and event identification of audio recordings. Fore xample, the yearly Detection and Classification of Acoustic Scenes and Events challenge (DCASE) [ 5] utilizes the TUT 4 databases. The 2017 database [56], for instance, contains ac losed set of 15 scene classes (e.g., home, park, train, grocery store). Within the challenge, newalgorithms were set to compete with ab aseline levelo f6 1% successful classification (averaged overa ll scene classes)o f 10 sl ong segments, which were edited from 3-5 minute binaural recordings of the previous year [58]. Similar databases were released in the past (see [57] for acomplete list and ar eview).T wo relevant multichannel databases in this category are the Multi-channel Acoustic Noise Database (DEMAND), which contains acoustically calibrated 16-channel recordings of everyday scenes of six broad classes: domestic, office, transportation, public, nature, and street -each containing three scenes [79,3]. Finally,t he EigenScape is an database of everyday scenes that specifically aims to servea pplications of classification, which wasi nspired by soundscape research requirements, and provides full 3D fourth-order ambisonic recordings in eight classes (eight scenes each): beach, busy street, park, pedestrian zone, quiet street, shopping center,train station, and woodland [35,6]. Unfortunately, no calibration values or detailed information about the ambisonic processing are provided for EigenScape. Notably, this class of databases contains large amounts of recordings with rich annotation that can be used to robustly train the classification algorithms. However, despite the laborious surveying and annotation done to generate them, they are mostly unsuitable for auditory research because they are uncalibrated in leveland other acoustical data are missing.

Goals
From the above discussion, it is evident that no audio database or acoustic survey exists that allows the faithful reproduction of the acoustic environments experienced in the real world, as required for conducting hearing research with highly ecologically valid outcomes. To at least partially address this limitation, the multichannel acoustic scene database, ARTE, is provided here, which wasd esigned with the following goals: 1. Provide to the research community accessible multichannel recordings of ar ange of realistic acoustic scenes that can be: a) used in al arge variety of auditory perception tests with improvedecological validity; and b) played back in loudspeaker arrays of different geometries as well as under headphones. 2. Enable standardization and replication of auditory perception tests that utilize realistic noisy environments. 3. Complement the multichannel recordings with measured multichannel RIRs as well as basic derivedacoustic data. Goals 1b and 2a re addressed by selecting the HOAf ormat as reproduction method (see Sections 1.1 and 2.4) and Goal 3isaddressed by conducting RIR measurements in all environments (see Section 2.2)a nd deriving the required acoustical data from the measured RIRs and HOA recordings (Section 3).Goal 1a is (partially)addressed by the perceptual evaluation presented in Section 4, butu ltimately,w ill be revealed when the ARTE database is utilized in future hearing research.

Methods: Database Generation
The process of selecting, recording, processing, analyzing, and preparing of the acoustic scenes for filesharing is described below.

Scene Selection
The recorded environments were selected with the intention to coverab road range of typical everyday situations that takeplace in diverse acoustic conditions. Recordings were obtained at av ariety of private and public spaces in Greater Sydney, such as cafés, atrain station, alibrary,an office, food courts, aliving room, and adinner party.These locations were considered common enough by the authors, mostly appeared in previous studies that surveyed hearing aid usage [76,42], buta lso had the physical and technical conditions for the recording equipment to be set up. As such, the present set of scenes were all situated in urban settings, of Western, English-speaking environments. The recordings captured several hours worth of daily activity in these locations, including incident conversations, footsteps, machinery noise, vehicles, amplified sounds, animal sounds, and others. Fort he current database, two minute excerpts were selected out of these recordings due to storage space limitations. In each environment, the microphone array (see Section 2.2 below) wascentrally positioned in the recorded space, and the microphone look-out direction (0 • )w as directed to the direction of interest in the scene. Forspecificdetails about individual scenes, see Section 3.1.

Sound Reproduction Format
As HOAtechnology has matured overthe last years, suitable systems for its loudspeaker reproduction are being set up in more and more laboratories around the world. The application of HOAi sw idely established in auditory displays, virtual reality,a nd sound engineering, and is increasingly used in hearing research. Using HOA, the sound field is either synthesized in software or recorded using microphone arrays. While the specifics ystem (microphone and loudspeaker arrays suitable for HOA) required for hearing research is not standardized at present, the HOAtechnology itself offers ahigh degree of compatibility,which may alleviate the need for hardware standardization. This is because reproduction on anyHOA system is based on the same acoustical principles that enable the (re-)synthesis of the intended sound field, within aknown error margin (see Section 2.4). Therefore, the standard reference for these systems are the physical sound fields themselves. However, even though the physical sound reproduction error is known or can be measured for any HOAs ystem (e.g., [70]), the accuracyr equired for hearing research is not clear and may depend on the details of the applied acoustic scenario, as well as on the actual auditory mechanism or hearing ability under assessment.
Having loudspeaker systems that are capable of reproducing three-dimensional sound fields also makes them attractive for hearing device research (e.g., [16,26,36,71,70]), where wearing headphones is mechanically,acoustically and ecologically too farfrom realistic aided listening conditions to attest for aided spatial perception of sound. Additionally,methods that aim to perceptually restore the sound sources, butnot the physical sound field, may fail to realistically simulate the function of hearing device beamformers, which work differently than the human ear.B ecause of this applicability for hearing device research, as well as the relative independence on the specifictechnical setups used (recording and reproduction are separate), the HOA3Dsound field reproduction method wasapplied in ARTE.

Recording and collection of material
2.3.1. General scene recording practices Fore ach recording done in ap ublic location the authors first obtained the necessary clearances from the relevant property managers. The recordings were always attended by at least one person to ensure that no passers-by came too close to the microphone array,o rt ot he rest of the equipment, and an umber of clearly visible signs were placed to inform about the recordings. Curious passers-by were told that their conversations may be picked up by the system and played back in research settings. Aminimum distance of about 1.3 mbetween anyacoustical source and the microphone array wasmaintained to avoid spatial distortion in reproduction due to near-field effects.
Several hours worth of recordings were typically obtained in all locations, with the recording time window chosen so that it enabled capturing different levels of activity of the location. Also, it had to include the possibility to perform room impulse response (RIR)m easurements in conditions that were as quiet as possible, which usually meant obtaining the permission to stay in the premises before or after business hours. Recordings in outdoor locations were only conducted under fair weather conditions with no wind or rain to avoid damage to the equipment. The only outdoor recordings that were carried out were nevertheless corrupted by wind-noise and had to be excluded from the database.

Multichannel recording equipment
The acoustic scenes contained in the ARTE database were recorded with am icrophone array that wasd esigned and built at the National Acoustic Laboratories. The array consists of 62 miniature microphones (Knowles FC-23629-C36)that are flush-mounted inside rubber seals 5 on ahard plastic sphere with ar adius of 0.05 m. The microphones are arranged symmetrically in rings, with 24 microphones in the horizontal plane with an angular separation of 15 • , ten microphones at both +25 • and −25 • elevation with an angular separation of 36 • ,s ix microphones at both +50 • and −50 • elevation with an angular separation of 60 • ,and three microphones at both +75 • and −75 • elevation with an angular separation of 120 • .I nside the plastic sphere, each microphone is connected to as eparate preamplifier, which applies again of 14 dB and provides balanced outputs. The 62 output channels are then connected to two 32-channel RME M-32 analog-to-digital converters that are linked to asilent measurement computer via an RME MADI HDSPes ound card. The microphone array was calibrated such that the original sound pressure levelo f the scene wasm aintained during reproduction. Recordings were done with Audition 3( Adobe Systems Inc.), at as ampling rate of 44.1 kHz and depth of 32 bits, and all subsequent editing and post-processing wasdone using MATLAB (The Mathworks, Inc.). This recording fidelity is considered sufficient for the HOAe ncoded recordings, which exhibit inherent reproduction errors above 7.4 kHz due to spatial aliasing of the 62-element microphone at high frequencies [71].

RIR measurements
Multichannel RIRs were measured in all the environments to allowthe integration of additional sound sources to the recorded scenes during post-production. This is necessary, for instance, when the recorded scenes are used as background noise in aspeech test. In such case, anechoic target speech material can be convolved with the RIRs and added to the recorded scenes at agiven speech levelorSNR. The resulting target speech then includes the reverberation that is characteristic for the givenenvironment, which is important for the perceptual integration of the target speech and the background noise. AMAT LAB routine generated a20 sl ong logarithmic sweep signal [63], which wasp layed through aD SE A2760 amplifier that drove aT annoyV 8 loudspeaker.The multichannel RIR wasautomatically calculated from the recorded sweep response. The resulting RIR wass ubsequently displayed, to allowm anual verification that the dynamic range wasatleast 50 dB and that the signal decay wasclean. Otherwise, where possible, the RIR measurements were repeated until the response appeared clean enough. The absolute sound pressure levelof the sweep wasaround 90 dBSPL at adistance of 1.3 m, but depended on the givenlocation. In most locations the RIRs were measured in quiet without anypeople around, butin some locations this wasnot possible. There, the sweep had to be reduced to more comfortable levels and increased ambient noise levelo fp eople and trafficw as inevitable. In loud measurements, bystanders were advised to wear earplugs, which were provided by the recording team, who also wore them. In such cases the post-processing of the RIRs described in Section 2.4.2 wasparticularly important to restore asufficient SNR.
The RIR wasmeasured in each location with the loudspeaker positioned in front of the microphone array at 1.3 monthe horizontal plane. Thereby,the microphone array pointed always to 0 • and the loudspeaker to the microphone array.T he source-receiverd istance of 1.3 mw as chosen to minimize anypotential near-field effects distorting the perceiveds patial image of the reproduced sound source and, at the same time, to realize ashort-enough distance that is representative for natural conversations (e.g., see [83]).

HOAR ealization
The process of encoding the microphone array signals into the multi-channel HOAformat and subsequently decoding it into loudspeaker signals is illustrated in Figure 2, and the corresponding mathematical details are giveni nA ppendix A1. The Q = 62 signal channels recorded with the microphone array, s q=1,...Q (t), are encoded into K = 31 HOAs ignals, b k=1,...K (t), by applying am atrix of K × Q encoding filters h E,kq .T he encoding filters were derived in the frequencyd omain by applying the shape-matching method [68], which also calibrates the microphone array. In this method, the microphone array is placed in the center of a3 Dl oudspeaker array,a nd the impulse responses (IRs)are measured from each loudspeaker to each microphone. The encoding filters are then derivedfrom the measured IRs following the calculations described in the Appendix A1. The finite impulse response (FIR)encoding filters had alength of 2000 samples at asampling frequency of 44.1 kHz. Limited by the employed loudspeaker array, different HOAo rders, M 3D = 4a nd M 2D = 7, are provided for periphonic (3D) and horizontal-plane only (2D) playback. All spherical harmonic functions up to the degree m = M 3D are provided (25channels in total)aswell as all sectorial harmonic functions (i.e., m = n,see Equation (A6( in Appendix A1)with degree M 3D <m≤M 2 D (i.e., 6additional channels). This results in K = 31 HOA channels in total.
Since the shape-matching process inherently involves the frequencyresponse of the loudspeakers of the playback array,theyhad to be pre-equalized. To equalize the loudspeakers, their individual IRs were measured to an omnidirectional 1/4 Type 46BL G.R.A.S. low-noise microphone located in the center of the loudspeaker array,which waspowered with aG.R.A.S. CC Supply Type 12 AL and recorded using an RME M-16 analog-to-digital interface. The equalization filters were realized by mixed-phase FIR filters with alength of 2048 samples, which equalized the anechoic loudspeaker responses at 40-10,000 Hz. Applying pre-equalized loudspeaker signals ensured that the final HOAsignals provided in the ARTE database are independent of the playback loudspeakers.
As illustrated in Figure 2, the resulting HOAsignals are weighted, using am atrix of K × G decoding gains g D , and summed up into G loudspeaker signals l g=1,...G (t). The mathematical details of the decoding process are givenin Appendix A1. Since this decoding process depends on the specificl ayout of the playback loudspeaker array,d ecoding gains cannot be provided here. Instead, aM AT LAB (version R2018a)function is provided in the database that takes the loudspeaker locations as input and calculates the gains g D .W ith respect to the present study,t he example array of G = 41 loudspeakers shown in Figure 1w as employed. Giventhe non-regular layout with an increased number of loudspeakers in the horizontal plane, the mixed- QxK filtersK xN gains order ambisonics method wasu sed (e.g., [30,53]), also with M 3D = 4a nd M 2D = 7, utilizing all K = 31 HOA channels. In this case, using the same loudspeaker array as in the shape-matching process ensured that the entire sound reproduction system wasc alibrated such that any recorded scene would be automatically reproduced at its original sound pressure level. Forcalibrating anyarbitrary loudspeaker array for playback, ad i ff use speech-shaped noise is provided (see Table I).The sensitivity of the playback system should be adjusted such that the diffuse noise produces al evel of 70 dBSPL in the center of the array. The resulting calibration gain can then be applied to any acoustic scene of the ARTE database to reproduce their original sound pressure level.
Finally,i ts hould be noted that depending on the layout of the playback loudspeaker array,n ot all provided HOAc hannels may be usable [82,Eqs. 22 and 30]. As arough guide for regular loudspeaker arrays, at least N ≥ (M 3D + 1) 2 or N ≥ 2 · M 2D + 1l oudspeakers are required for periphonic (3D) or horizontal-plane only (2D) playback. The number of loudspeakers N has to be increased for non-regular loudspeaker arrays. For" massive loudspeaker arrays" the number of applied loudspeakers should be limited to avoid sound coloration [77]. For"reasonable" 2D sound reproduction, only the sectorial HOA channels (i.e., for which m = n,i nE quation (A6),A ppendix A1)m ay be used in the decoding process. How-ever,s uch (ora ny other)m apping of a3 Ds cene into 2D will always introduce spatial distortions, amongst other artifacts.

Processing of RIRs
In addition to the recordings of the acoustic scenes, multichannel RIRs are provided in the ARTE database for all the corresponding environments, and are saveda s3 1channel HOAs ignals. All the RIRs were truncated just before the response "disappeared" in the noise floor of the measurement. Since not all recording locations could be accessed during quiet times, some RIRs were contaminated with substantial ambient noise. Fors ome environments, this resulted in rather short RIRs after truncation. Since the RIRs were measured for rather close sourcereceiverdistances, this is not necessarily aproblem in applications where substantial background noise (i.e., the acoustic scenes)i sp resented to the listener and thereby masks the late reverberation of the target speech. However, this may be aproblem if the unprocessed RIRs are used to realize areverberant sound source in quiet.
In some applications it may be useful to enhance the "directionality" of the RIRs (e.g., [70]). Fors uch cases, as econd version of all RIRs is provided in the ARTE database, in which the direct sound component wass eparated from the rest of the RIR and can be expressed as asingle-channel RIR. The direct sound component is then givenonaseparate (non-HOA) channel, which can be presented from asingle loudspeaker located (roughly)i nthe original direction (and distance)ofthe direct sound while the rest of the RIR can be presented via the loudspeaker array.The direct sound component wasseparated here by applying ao ne-sided Hanning time-windowt ot he 31 HOA signals with afrequency-dependent duration of D = 2/f , which wasl imited to the interval 0.002s ≤ D ≤ 0.01s. The inverse window( i.e., flipped with 50% overlap)w as applied to the reverberant part of the RIR such that an addition of both parts would sum up to the original RIR. The direct sound component wasthen givenbythe omnidirectional (the zeroth order)H OA channel. Even though this RIR enhancement can be useful in some applications, it should be noted that it may affect the realism or ecological validity of the reproduced sound field.

Conversion for Binaural Playback
The ARTE database also contains binaural versions of all the recorded acoustic scenes. The binaural signals are mainly provided so that an interested user can get afi rst impression of the different scenes by listening to them via headphones. These were generated by measuring the headrelated transfer functions (HRTFs)for all 41 loudspeakers of the playback array,s hown in Figure 1, to the twom icrophones inside the ears of aB rüel &K jaer type 4128C Head and Torso Simulator (HAT S),w hich describes an array of 41 TannoyV 8c oncentric loudspeakers installed within an anechoic chamber.W ithin this array,t he loudspeakers are arranged symmetrically in rings on as phere with aradius of 1.85 m. Sixteen loudspeakers are mounted on the horizontal plane (0 • elevation). Additional sixteen loudspeakers are mounted at ±30 • elevation (eight each) and eight at ±60 • (four each). One loudspeaker is hung directly above the listener'shead. The decoded loudspeaker signals for each acoustic scene were then convolved with the corresponding HRTFs and summed up separately for the left and right ears to form the binaural signals. Additional diffuse field binaural versions of the scenes were obtained by removing the ear canal response from the HATS binaural recordings, through equalization of the ear-drum responses to those of an omnidirectional microphone in diffuse noise field.

Database Delivery and File Format
The ARTE database requires about 10 GB of storage and is available online at https://doi.org/10.5281/zenodo.2261632, where all the necessary information about the implementation, the calibration and playback, and the acoustics of the scenes is available. The documentation about the supplied MATLAB (version R2018a)f unctions are common to all scenes, and specifice xamples are givena bout processes such as applying the HOAd ecoding to the recordings. Each scene in the ARTE database has its ownd irectory that contains associated data that can be downloaded separately: the 31-channel HOAW AV version of the recording, the binaural WAVv ersion, and aPDF filecontaining the acoustic parameters of the scene (see Section 3.2 below).Note that the original raw62-channel recordings are excluded from the database.

Scene Overview
Ad escription of the current set of environments is provided in Table I, along with their mean sound pressure levels in dBAand dB SPL. With the exception of the Train Station scene that opened with avery dominant announcement, and the Street /B alconys cene that had noticeable trafficfluctuations, the excerpts sounded rather consistent overt heir entire duration, while still revealing the acoustic diversity of the particular scene. The consistent behavior is particularly important for applications in speech-innoise tests, where major levelfl uctuations would significantly reduce the test-retest reliability.T he scenes were also screened for inappropriate language, as well as for any words that could identify arecorded person or reveal critical (orpotentially confidential)information. All excerpts were carefully scrutinized by the authors, to ensure that no recording and HOAp rocessing artifacts were audible. Fade in and fade out (0.5 slong)were applied to the start and end of each recording, respectively,toprovide smooth reproduction.
Several environments require special mention. Two church scenes are included, which were both excerpted from the same recording, which wasdone in identical conditions, butatdifferent times around the service. The two recordings were made inside ar ather small church with uncharacteristically lowreverberation, and mainly differed by their levelo fc onversation noise. The Living Room scene contains asequence of television advertisements that were not recorded in situ. Instead, the IRs of the television set loudspeakers (stereo)were recorded with the HOAmicrophone array.R andom Australian television advertisements were then recorded offline and convolved with the IRs and mixed with the ambient noise of the living room to obtain the final scene. Adding the television sound during post-processing provided some freedom in selecting ageneric program and ensured that the presentation level wasreasonable, independent of the ambient noise. Finally, aspeech-weighted diffuse noise scene (Scene 6inT able I) is included with an arbitrarily chosen sound pressure level of 70 dB SPL. This scene wasa rtificially generated and recorded inside a3Dloudspeaker array (see section 2.4.2) and is mainly provided for calibrating the applied loudspeaker playback system.
Additional excerpts for the existing as well as for new environments will be appended to the database in the future.

DerivedS cene Acoustic Data
With reference to the third goal of this study (Section 1.4), the results of an acoustic analysis are provided in the ARTE database to allowthe potential user to makeaninformed decision on which scenes to select for their given application, allowc omparisons with acoustic scenes provided by other existing databases (see Section 1.2), and help to interpret results when applied in subjective experiments. The acoustic analysis wasp erformed by simulating the reproduction of the recorded scenes through the 41 channel loudspeaker array shown in Figure 1. The simulations were realized by first measuring the IRs between each of the 41 loudspeakers to aG.R.A.S. Type 46BL omnidirectional 1/4 microphone located at the center of the loudspeaker array.T hen, the decoded HOAl oudspeaker signals were convolved with those IRs and summed. The advantage of using this simulation wasthat it enabled offline estimation of the acoustic parameters and thereby avoided remeasuring each time anew excerpt of an acoustic scene wasselected or processed. Details of the acoustic analysis are givenbelow, along with exemplary results for twor epresentative scenes. The ARTE database contains results for all acoustic scenes and will be extended to any newscene that will be added in the future.

Sound pressure level
From the simulated omnidirectional recordings, the unweighted and A-weighted sound pressure levels were calculated for all the different scenes contained in the ARTE database. The results are summarized in Table I. The scenes provide ab road range of levels of about 30 dB in 1-4 dB steps.

Reverberation time
The reverberation times (T 30 )a ssociated with all the acoustic scenes that are contained in the ARTE database were derivedfrom the simulated omnidirectional RIRs following the process described in ISO 3382-2 [41]. The results are summarized in Table I. In the artificial Diffuse Noise and the Street /Balconyscenes the concept of reverberation time does not apply and no values are provided.

Spectrum
To characterize the spectral behavior of the different acoustic scenes, the power spectra in third-octave bands were calculated from the simulated omnidirectional recordings and twoe xamples are shown in Figure 3: the Living Room (scene 4) and the Street /B alcony( scene 10). In addition to along-term frequencyanalysis, shown by the solid lines and circles, ashort-term frequencyanalysis wasperformed using a20-ms long vanHann windows with 50% overlap. The resulting median values are shown by the solid gray lines, and the 25th and 75th, as well as the 5th and 95th percentiles are shown by the dark gray and light gray areas, respectively.The twoexample spectra do not only showdifferences in overall power,but the Street / Balconyscene (right panel)h as also relatively more lowfrequencypower than the Living Room scene (left panel). The percentile plots indicate that both recordings contain substantial levelfl uctuations of more than 20 dB for the Street /Balconyand more than 30 dB for the Living Room scene.

Temporal envelope and modulation spectrum
To illustrate the temporal behavior of the different acoustic scenes, the temporally smoothed envelope of the simulated omnidirectional recordings wasc alculated by applying an A-weighting filter to the waveforms, followed by squaring, low-pass filtering at 16 Hz with a4 th-order Butterworth infinite impulse response (IIR)filter,and then taking the square root. The resulting envelope is shown in Figure 4for the same twoe xample scenes as used above. The modulation spectrum wasderivedbyapplying asimplified, frequency-independent version of the processing described by [44]. The waveforms were band-pass filtered by an A-weighting filter,s quared, analyzed by am odulation filter bank with one-octave wide band-pass filters, and normalized by the total power of the A-weighted waveform. Forp lotting purposes, the spacing of the modulation filters was0 .1 Hz and the spectrum wasn ormalized to its maximum value within the modulation frequency range shown in Figure 4. It can be seen that the temporal envelope of the Living Room scene (top-left panel) contains more temporal fluctuations than the Street /Balconyscene (top right panel), which contains more continuous noise. This is further evident when considering the corresponding amplitude modulation spectra shown in the bottom panels of Figure 4. The Living Room scene contains strong temporal modulations with ap eak frequency of around 2.8 Hz, which mainly stems from the voice presented from the television. The Street /Balconyscene exhibits very strong low-frequencymodulations and shows a minimum at around 5Hz.

Directional characteristics
To illustrate the directional characteristics of the different acoustic scenes, the A-weighted (RMS)s ound pressure levelo ft he signals presented by the 16 horizontal loudspeakers of the array shown in Figure 1w ere calculated, and are shown in Figure 5f or the twos cenes, as a function of loudspeaker direction. These directivity plots  were normalized for each scene separately to the maximum occurring sound pressure level. In the Living Room scene, the main energy arrivesf rom the front left (−45 • ) and front right (+45 • )d irection, which corresponds with the location of the loudspeakers of the television. Additionally,t here is increased energy coming from the back where the kitchen noise came from. The Street /Balcony scene is less directional, highlights that the microphone array wasnot directed straight to the road butslightly turned to the left. The energy from behind-left refers to am ajor reflection from aback wall.

Perceptual Evaluation
The overall goal of the the ARTE database is to share stimuli that are significantly more realistic than the stimuli currently used in hearing research (Section 1.4). The accuracyofthe applied HOAsound reproduction method has already been discussed by [69] and [71] using different acoustic measures (see also Section 2.4.1), and in [70] the effect on speech intelligibility performance has been evaluated in in hearing impaired listeners with directional hearing aids. Here, ap erceptual evaluation of the acoustic scenes described in Table Iw as performed, with the primary goal of understanding howw ell the reproduced scenes represent the nominal (original)scenes. Forexample, does the recorded café scene actually come across as acafé? Making this association may not be trivial for the listener without anyverbal background, visual context, or prior exposure to the scenes. The questions of this task were part of alarger survey concerning complexacoustic scenes, which will be published elsewhere. However, for manyapplications the ability to correctly identify the exact scene may not be important, butr ather to understand the more general scene category that subjects associate with it. An additional goal of the subjective evaluation wastherefore to record the alternative associations that the subjects reported for the different scenes of the ARTE database. This combined information will then help researchers to select environments for their specifica pplication and aid future research on finding more general relevant scene categories as well as auditory scene recognition and analysis.

Methods
Agroup of 66 subjects (18male, 48 female), aged between 19-64 (mean age 29.3 years)participated. Pure tone audiograms were measured for all participants: 50 had normal hearing (≤20 dBHL), twelvehad slight hearing losses (20)(21)(22)(23)(24)(25), and four had mild losses (25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35). The subjects receivede ither as mall gratuity or course credit for their participation. All signals were generated on aPCwith an RME MADI sound card connected to twoR ME 32-channel digital-toanalog converters (M-32). These fed 11 Ya maha XM4180 power amplifiers that drove the 41-channel HOAl oudspeaker array described in Section 2.4.3 and located in the anechoic chamber of the Australian Hearing Hub, Macquarie University,A ustralia. The subjects were seated on ah eight-adjustable chair with their head located in the acoustic center of the HOAl oudspeaker array.F ourteen different environments were presented in twop arts comprising all the 13 scenes described in Table I, whereby the Diffuse Noise (scene 6) wasi ncluded as ar eference condition and wasr epeated at twol evels: 60 and 70 dB SPL. The first presentation wasatraining and familiarization round with Café (2) environment (see Table I).Then, ar andomized sequence of sevena dditional environments wasp resented. Following am andatory break, the test resumed with one repeated environment out of the sevenand arandomized sequence of the remaining six environments. The entire test lasted between 1.5 and 3h ours (including tasks unreported here). Test participants were asked to: (i) indicate whether the scene theyh eard takes place indoors, outdoors, or in ac ombined space, (ii) try and identify the scene as an open question, (iii)r ate howr ealistic the scenes sounded to them on ascale between 0to10from completely artificial to completely realistic. The questions were answered using paper and pen, and are giveninAppendix A2.
All scenes were twominutes long and played in acontinuous loop. The subjects were instructed to listen carefully to the acoustic scenes before answering the questions. The experimenter could monitor the participant via av ideo camera and talk-back microphone system in the control room to provide assistance, if required, during the test and to change the scene when requested.

Results
The subjective response about the type of scene (indoors, outdoors, combined space)a re listed in three columns of Table II, indicating the various confusions made. Fort he eight quietest scenes, subjects correctly identified the type of environment theylistened to, at arate of 89% or more for the eight quietest scenes. In the Street /Balconyscene, which wasc lassified as combined indoor-outdoor space, most responses (73.8%)i ndicated that it wasc ompletely outdoors, despite some indoor sounds that were part of it. Similarly,the Train Station scene wasonly correctly identified as acombined indoor-outdoor scene by 52.3% of the subjects, with split responses for indoor or outdoor.Interestingly,indoor identification rates dropped for the loudest scenes (80.7% and 66.7% for Food Courts 1a nd 2, re- Table II. Subjects' place identification through listening to 14 scenes. a The number of subjects per scene. Note that the church and diffuse scenes, only the first (unbiased)o ccurrences were counted. b Type of space identification, with correct answers (%) are in boldface: Indoors, Outdoors, or Combined space. The question is not strictly valid for the diffuse scenes, butt he typical confusions are shown. c The strict identification rate (%) estimated by strict verbal equivalence between the written guess by the subject and the known scene location. d Lenient identification rate (%) and labels when theywere considered close enough to the known scene location. e Common alternative labels that may be adapted in future experiments. f The mean realism ratings (0-10)w ith confidence interval of twostandard errors. See text for further details. Correct scene identification varied dramatically between scenes (0 to 100%), as listed in the "Strict ID" column of Table II. While the Train Station scene wasalways successfully identified, the twochurch scenes were generally confused for something else, or were givenar ather vague description (e.g., social gathering). Some of the answers were inaccurate, butw ere close enough givent he lack of prior knowledge about the scenes, so theyw ere counted as correct. Fore xample, cafeteria and café are very close. This wast aken into account in the alternative identification calculation (the "Lenient ID" column of Table II), which accepts more answers as correct. Some identification labels were wrong in the strict sense, but are justifiable and can come across as convincing, especially since theyr epeated several times. These answers are listed under "Alternative labels" in Table II). Because the scene identity wasalways revealed after the initial responses (see Appendix A2), manys ubjects were able to identify correctly the Diffuse Noise as well as the Church scenes when presented asecond time (e.g, theyidentified Church 2because theyhad already heard Church 1, which wasr ecorded in the same place, buta ta nother time). To avoid anylearning bias for these twoscenes, the identification responses were only counted for the first presentation and were omitted when repeated, which is reported in the number N of subjects column of Table II. The most frequent scene identification confusions are also listed in the Table II under the heading of "alternative responses". The subjective response about the type of scene (indoors, outdoors, combined space)a re listed in three columns, indicating the various confusions made. Note that the identification questions were asked about the diffuse noise scenes, although theyw ere not strictly answerable, because they were not real places. Three people who associated white noise with these scenes were counted as correct in the lenient identification.
The realism ratings of the scenes exhibit three clusters (Table II, last column): diffuse scenes (mean rating of 4.6-4.8), church scenes (7-7.1), and all the rest (7.9-8.9). The artificial diffuse noise scenes were indeed judged to be more artificial than real, confirming that listeners reacted differentially to these sounds. Additionally,i na greement with the church misidentification, the church scenes were also judged to be less realistic.

Discussion
Listeners were able to identify correctly an umber of scenes, despite the lack of visual cues or additional information. Notably,the train station, the twocafé and the office scenes were all above 83%, regardless of howt he identification wase stimated. However, the identification wasn ever perfect and may have depended on the subject'sf amiliarity with the specifice nvironment. Even for as traightforward environment such as al iving room, imprecise identifications were common, as the scene is dominated by both loud television and kitchen utensils, so that listeners had to decide where theystand in relation to the different sources. The library had more than 60% identification, which wasl ikely affected by the familiarity of subjects with that particular space, as the recording was done in the university library.The twodifferent food court scenes were misidentified more often than expected. Food Court (1) wasrecorded in the university itself, which may explain the somewhat higher (lenient)r ating of 45.4%, compared to Food Court (2),3 6.3%, because it is more familiar at least to the manys tudents that participated in this listening test. However, most confusions of these two scenes indicate that theyw ere correctly associated with noisy food establishments of similar or smaller scale. It is not impossible, for example, that ap articularly large pub with af ew hundreds of people would sound similar to a crowded food court. Familiarity may have also played a role here, as some people may avoid eating out at food courts altogether.For this reason, some users of the ARTE database may decide that these alternative labels are good enough and can serveasvalid scenes, even though theydo not adhere to the strict original location. Determining what scene labels should count as 'lenient' or 'alternative'i sa somewhat subjective process in itself, butasthe complete subjective data is included in Table II, future users of the scenes may reinterpret these labels.
The observed identification rates were only in partial agreement with ap revious study [38], in which subjects had to identify stereophonic recordings from 34 locations with an average duration of 10.42 s, and RMS-level matched to 80 dB SPL. Forinstance, the identification rate was95% for their train station, similar to the ARTE train station (100%), buto nly 20% to their office and library recordings, which were 86% and 60%, respectively,inthe present study.The differences are difficult to discuss without more details about their recordings, buti ti sl ikely that due to their short stimulus presentation, the acoustic scenes did not include enough sound events and acoustic features that are uniquely associated with the particular scenes in order to makeaquick judgment.
Identification of the twoc hurch scenes wase xceptionally low, once the presentation order correction wasa pplied. While almost all listeners could tell that this is some kind of asocial setting happening indoors, there were very fewcues that disclosed the exact purpose of the social interaction. Some participants commented that, as churchgoers, theyf ound neither the room acoustics (less reverberation than typical for churches; see Section 3.1)nor the conversations sounded likeatypical church. It seems warranted that these scenes could be used to represent other social situations butachurch. In particular,Church (2) has neverb een identified as ac hurch, so it may be renamed as "Social Gathering", as it wast he most frequent classification givenb ys ubjects. Correct identification of the street /b alconys cene wasa lso exceptionally difficult for subjects, because of the combined space in which it was recorded. While most of the sound wasunmistakable traffic coming from the street, there are some reflections and odd sounds coming from inside the apartment (e.g., dishes and water flowing), which may be difficult to explain without the visual context, especially for lay listeners. This resulted in either incomplete scenes (i.e., "busy road"), or more creative descriptions such as a"road side café". The four respondents whose answers were counted in Table 4.2 identified the scene as being in ah ouse near ab usy road. Finally,w hile the diffuse noise scenes were not meant to sound likeany specificplace, theywere frequently associated with water -either waterfalls or rain.
Identifying the type of space of the recordings (i.e., indoors, outdoors, or combined-space)w as generally easier for listeners. The ARTE database currently contains only indoor and twocombined space scenes, butunfortunately no outdoor scenes. Most listeners were able to easily identify all indoor scenes, with the exception of the two food courts, where the very loud babble likely masked any room acoustic cues. Also, confusions between combinedspace and outdoors were rather common for the twor elevant scenes (train station and street/balcony).Itisl ikely that the combined-space option requires abetter ability to visualize the scenes and have awareness of room acoustics -s omething that not all listeners can be expected to have.The unavailability of outdoor scenes did not enable the simulation and subjective evaluation of free-field or natural environments, and thus conclusions from the test cannot be directly extrapolated to such settings based on the available data. Once again, while not strictly correct, it may be justifiable for future users of the ARTE database to treat the Street /B alconys cene as outdoors, givent he subjective data of Table II.
The rating of realism turned out to involveasignificant degree of ambiguity,w hich wasm ainly due to an insufficient explanation of what wasm eant by "realism". Some subjects referred to the reproduction of the HOAs ound system, and other subjects referred to the believability of being in ag iven scene, once it became known, or to the question if the particular sounds actually represent such a scene. Unfortunately,the finalratings likely refer to acombination of all of these aspects. When asked informally after the test about their overall impression of the sound reproduction, most people found it very real-sounding.

General Discussion and Conclusions
While the main aim behind the ARTE database is to provide real-world material in hearing testing, it has been designed in aw ay that is general enough to be useful in other related fields of research. Forexample, although the present ARTE database is not intended to servea sas ystematic survey of the acoustics in everyday listening environments, it does provide detailed deriveda coustic data alongside each recording. Similarly,t he levelo fs cene description in ARTE may not match that of soundscape databases, buti tm ay still be suitable for related soundscape as well as ecological acoustics research that requires generating soundscapes under controlled conditions. In room acoustics and digital signal processing, methods based on RIR databases regularly rely on the synthesis of noisy reverberant environments, whereas the ARTE database provides direct recordings of complexevery-day environments in addition to RIRs. This results in richer, more complex, and more realistic acoustic scenes, than theyare feasible with RIR-based synthesis methods alone (e.g., by being able to capture the acoustic source movement), and may be used for similar applications of improving speech reception through digital signal processing. Finally,i na pplications such as automatic scene classifica-tion, there has been no mention of the absolute levels of the recorded material (except for DEMAND, [3])-acritical factor in sound perception. The ARTE database is possibly too small to be used in these applications, butitmay be used to cross-check the performance of classification algorithms trained on larger,yet uncalibrated, databases.
Although the research questions of the fields of investigation mentioned above -h earing assessment, acoustic communication, soundscapes, room acoustics, scene analysis and automatic scene or source classification -are inevitably different, their requirements for realistic stimuli may be very similar.T he release of the ARTE database provides an ovela ttempt to address several aspects of realistic scenes in aholistic manner.Because of its comprehensive nature, not all features are going to be useful for all researchers. Rather than deterring researchers who may be interested only in as ubset of the features, it is important to emphasize that observations, which will be gathered across different studies that investigate various aspects of everyday listening, may become easier to collect and reproduce using the standardized stimuli. Furthermore, it is also hoped that the comprehensive nature of ARTE will encourage its enhancement in the future, with contributions from other researchers.
The HOAs ound-field reproduction method that the database is based on is both ad isadvantage and an advantage at the current stage of this technology.O nt he one hand, the strength of using HOAm ust not be underestimated, as it lends itself to universal deployment using different hardware and software setups. Moreover, deriving binaural, single-or multichannel audio from the HOAr ecordings is straightforward, even when the complete setup for HOAr eproduction is unavailable. It is a provenmethod to reproduce sound that is subjectively realistic, as wass hown in Section 4.2, and is also rich in terms of the instrumental acoustic measures that can be computed from it -s patial, spectral, temporal and dynamic. On the other hand, processing and reproducing the HOArecordings requires technical knowledge that is currently held only by fewl aboratories around the world. The deployment of the ARTE database will hopefully contribute to making the HOAtechnology more accessible to the broader research community as awhole, or inspire the development of simpler technologies that could eventually supersede it. Even if this is where the future lies, it would be necessary to investigate to what levelofprecision realistic virtual scenes have to be reproduced in the laboratory, in order to obtain observations that are relevant to the real world. More research using stimuli such as provided in the ARTE database will be needed to bridge the wide gapbetween realistic listening environments and the traditional stimuli in psychoacoustics.
The selection of everyday scenes that presently populates the database is not universal for several reasons. The environments were all recorded indoors or in combined indoor-outdoor spaces -a ll in urban settings of al arge, Australian-English speaking metropolis. Even within the limited population tested, large variations were observed with regards to howw ell the scenes are recognized. The success rate of correctly recognizing or identifying scenes may degrade if tested on populations of other cities, in other countries, both in the developed and the developing world. Fore xample, the over-represented cases in the database of food courts may be irrelevant to manypeople in the countryside, as are the church scenes for the many non-Christians worldwide. All possible applications for research that were indicated above -h earing impairment rehabilitation most notably -should not be bounded geographically or culturally to anyspecificregion. Therefore, more universally applicable research may require adding more scenes to the database that pertain to broader populations, possibly through the inclusion of recordings set in non-English speaking cultures. It will be critical also to add outdoor scenes to the database, which at the moment are completely lacking. However, special care will have to be taken with regards to wind noise, which corrupted several of the recordings that were originally intended for ARTE. Even though wind protectors may be used to reduce wind noise, their benefitfor the rather high HOAorders that were recorded here is very limited. Instead, it is highly recommended to obtain wind condition forecasts and select the least windy hours of the day.Finally,some locations may be made somewhat redundant through comparative findings from soundscape studies, butthis remains to be seen. The ARTE database is purposely designed in at ransparent wayt oe ncourage other members of the research community to expand it in the future, or to connect it to their owndatabases.
In some spaces, the number of people may have significantly altered the reverberation that wasc aptured during the recording, compared to that of the RIR measured without them. This discrepancym ay be noticeable in circumstances where the recorded scene and ad eriveds timulus convolved with the respective RIR are mixed, butwill have to be examined according to the particular sounds and uses in question. ARTE database, all SHFs up to ad egree of m = 4w ere considered as well as all sectorial SHFs (i.e., m = n)f or 4 <m≤7. Equation (A2( wasa pplied here separately at L = 1000 equidistant frequencies between 0 ≤ f ≤ f s /2, at as ampling frequencyo ff s = 44.1kHz, which provided the one-sided transfer functions of the encoding filters h E,kq (t). The encoding filters were then derived by extending these one-sided transfer functions by their frequency-reversed complexc onjugate (i.e., utilizing the identity H E,kq (−jω) = H E,kq (jω) * )a nd applying the inverse Fourier transform.

A1.2. HOAd ecoding
As illustrated in Figure 2, the decoding into G loudspeaker signals, l g (t), is realized by aw eighted sum overa ll K HOAsignals, b k (t), i.e. which is the pseudo inverse of the re-encoding matrix: .
With Y k (θ g ,ϕ g )b eing the K SHFs sampled at the direction (azimuth θ g and elevation ϕ g )ofthe G playback loudspeakers as described above.The order of the HOAchannels k = 1, 2,...K is summarized in Table A1. The following question wasg iven separately on another page, as it revealed the correct scene identity to the listener. Imagine you are situated in [SCENE NAME]w hose sound environment is being virtually reproduced now. In the following question you will be asked to subjectively rate various aspects of this sound environment. There are no wrong or right answers. 3. Howrealistic do you find this audio environment? Completely Artificial (0) -Completely Realistic (10)