A toolbox for rendering virtual acoustic environments in the context of audiology

A toolbox for creation and rendering of dynamic virtual acoustic environments (TASCAR) that allows direct user interaction was developed for application in hearing aid research and audiology. This technical paper describes the general software structure and the time-domain simulation methods, i.e., transmission model, image source model, and render formats, used to produce virtual acoustic environments with moving objects. Implementation-specific properties are described, and the computational performance of the system was measured as a function of simulation complexity. Results show that on commercially available commonly used hardware the simulation of several hundred virtual sound sources is possible in the time domain.


Introduction
Hearing aids are evolving from simple amplifiers to complex signal processing devices. Current hearing devices typically contain spatially sensitive algorithms, e.g., directional microphones, direction of arrival estimators, or binaural noise reduction, as well as automatic classification of the acoustic environment that is used for context-adaptive processing and amplification [25]. Several of these features cannot be tested in the current lab-based setups for hearing-aid evaluation, because they employ rather simple acoustic configurations. Furthermore, it was shown in several studies that hearing aid performance depends on the spatial complexity of the environment, and that the hearing aid performance in simple laboratory conditions is not a good predictor of the performance in more realistic environment or in the real life [36,13,6,9,24]. Finally, recent developments in hearing aid technology led to an increased level of interaction between the user, the environment and the hearing devices, e.g., by means of motion interaction [39,40], gaze direction [29] or even with brain-computer interfaces [17]. Thus, for an improved assessment of hearing aid benefit as well as for the development and evaluation of user interaction techniques, a reproduction of complex listening environments in the laboratory may be beneficial.
Advances in computer technology in combination with recent multi-channel reproduction [7,34,14] and acoustic simulation methods [3,31,42] allow for the reproduction of virtual acoustic environments in the laboratory. Limitations in reproduction and simulation quality have been studied in terms of perceptual effects [8,22] as well as in terms of technical accuracy of hearing aid benefit prediction [21,33].
These studies support the general applicability of virtual acoustic environments to hearing-aid evaluation and audiology, but show that care must be taken in designing the simulation and reproduction methods, to ensure that the outcome measures are not biased by the artifacts of the applied methods.
Several further requirements apply when using virtual acoustic environment in hearing research and audiology. To allow for a systematic evaluation of hearing device performance, virtual acoustic environments need to be reproducible and scalable in their complexity. The presence of early reflections and late reverberation in the simulation is essential for the application of hearing aid evaluation, since both of these factors may affect hearing aid performance [36]. For assessment of user interaction, but also for the analysis of hearing aid benefit, simulation of the effects of motion of listeners and sources might be desired. These effects do not only include time-variant spatial cues, but also Doppler-shift and time-variant spectral cues due to comb filtering.
Existing virtual acoustic environment engines often target authentic simulations for room acoustics [19,32], resulting in a large computational complexity. They typically render impulse responses for off-line analysis or auralization and thus do not allow studying motion and user interaction. Other interactive tools, e.g., the SoundScapeRenderer [1], do not provide all features required here, such as room simulation and diffuse source handling.
To accommodate the requirements listed above, a toolbox for acoustic scene creation and rendering (TASCAR) was developed as a Linux audio application [23] with an open source core [37] and commercial support [38]. The aim of TASCAR is to interactively render complex and time varying virtual acoustic environments via loudspeakers or headphones. For a seamless integration into existing measurement tools of psycho-acoustics and audiology, low-delay real-time processing of external audio streams in the time domain is applied, and interactive modification of the geometry is possible. This technical paper aims at describing the general structure of applications in hearing aid evaluation and audiology, the applied underlying simulation and rendering methods, and their specific implementation. A measurement of the computational performance and its underlying factors is provided to allow for an estimation of maximum simulation complexity in relation to the available computing power. This paper also serves as a technical reference for the TASCAR open source software (TASCAR/GPL).

General structure
The structure of TASCAR can be divided into four major components (see Figure 1 for an overview): The audio player (block a in Fig 1) serves as a source of audio signals. The geometry processor (block b) controls position and orientation of objects over time. The acoustic model (blocks c) simulates sound propagation, room acoustics and diffuse sounds. Finally, the rendering subsystem (block d) renders the output of the acoustic model for a physical reproduction system.
A virtual acoustic environment in TASCAR is defined as a space containing several types of objects: point sources (e.g., speakers, distinct noise sources), diffuse sources (e.g., remote traffic, babble noise), receivers (e.g., dummy head), reflectors (e.g., boundaries of a room) and obstacles. Source objects are provided with the audio content, delivered either by the internal audio player module, or externally e.g., from physical sources, audiological measurement tools, or digital audio workstations (DAW).
Objects in the virtual acoustic scene can change their positions and orientations over time. Information about the object geometry at a given time is taken either from sampled trajectories, from algorithmic trajectory generators, or from external devices, e.g., a joystick or head-motion tracker (interactive controller of an object's movement, e.g., motion of a dummy head). Geometry information is exploited in the acoustic model to modify the input audio signals delivered by the audio player. Modifications performed by the acoustic model mimic basic acoustic properties like distance law, reflections and air absorption. The resulting sound corresponds to the time-variant spatial arrangement of the objects in the virtual scene. Geometry data can also be exchanged with external modules, e.g., game engines, to make the visualization consistent with the acoustic scene content.
At the final stage of the acoustic model, there is a receiver model, which encodes the modified signals into a receiver type specific render format, used subsequently by the rendering subsystem for the reproduction of the simulated environment on a physical reproduction system.

Geometry processing
Each object in a virtual acoustic environment is determined by its position p(t) and orientation Ω(t) in space at a given time t. Position is defined in Cartesian coordinates p = (p x , p y , p z ), and orientation is defined in the Euler angles, Ω = (Ω z , Ω y , Ω x ), where Ω z is the rotation around the z-axis, Ω y around the y-axis and Ω x around the x-axis.
Trajectories Γ for a moving object are created by specifying the position and orientation for more then one point in time: where t 1 , t 2 , · · · ∈ R are the sampling times of the trajectory. The time variant position p(t) is linearly interpolated between sample times of Γ p , either in Cartesian coordinates, or in spherical coordinates relative to the origin, respectively. The time variant orientation Ω(t) is linearly interpolated from Γ Ω , in Euler-coordinates. To apply the orientation to objects, a rotation matrix O is calculated from the Euler coordinates.

Acoustic model
For each sound source object k, the acoustic model modifies its associated original source signal x(t) delivered by the audio player using geometry data into an output signal y(t) that is then used as input signal to a receiver. The performed computations simulate basic acoustic phenomena as described below. Signals y(t) serve at the subsequent stage to calculate the output of a receiver (see Section on render formats below).
The acoustic model consists of the source model (omni-directional or frequency-dependent directivity), the transmission model simulating sound propagation, an image source model, which depends on the reflection properties of the reflecting surfaces as well as on the 'visibility' of the reflected image source, and a receiver model, which encodes the direction of the sound source relative to the receiver into an receiver output for further processing by the rendering subsystem.

Image source model
Early reflections are generated with a geometric image source model, i.e., reflections are simulated for each reflecting plane surface with polygon-shaped boundary by placing an image source at the position of the reflection. Each image source is rendered in the time domain, in the same way as primary sources. This is different to the more efficient "shoe-box" image source models commonly used in room acoustic simulations [3], which calculate impulse responses by solving the wave equations. For a first order image source model, each pair of primary source and reflector face creates an image source, where the plane on which the reflector lies is a symmetry axis between the primary and image source (see Fig 2). The image source position p img is determined by the closest point on the (infinite) reflector plane p cut to the source p src : p img = 2p cut − p src .
For higher order image source models, lower order image sources are treated as primary sources leading to higher order image sources.
The image source position itself is independent of the receiver position. However, for finite reflectors there are two types of reflections in TASCAR, and depending on the receiver position it is determined which reflection type is executed (see Fig 2). If the intersection point p is of the line connecting the image source with the receiver and the reflector plane lies within the reflector boundaries, the image source is 'visible' in the reflector, and a 'specular' reflection is applied. If p is is not within the reflector boundaries, the source is 'invisible' from a receiver perspective and the 'edge reflection' is applied. For 'edge' reflections, the apparent image source position is shifted so that the distance between the source and receiver remains unchanged, whereas the receiver, edge of the reflector and the apparent source position form one line (see Fig 2, right panel). The angle θ by which the image source is shifted to create effective image source controls a soft-fade gain by which the source signal is multiplied g: The coefficient κ = 2.7 was chosen for a rough approximation of diffraction of speech-shaped signals and medium-sized reflectors. If a receiver or a sound source are behind the reflector, the image source is not rendered. A reflector object has only one reflecting side in the direction of the face normal.
To simulate the reflection properties of a reflector object, the source signal is filtered with a first order low pass filter determined by a reflectivity coefficient ρ, and a damping coefficient δ, which can be specified for each reflector object: In room acoustics material properties are commonly defined by frequency dependent absorption coefficients α(f ). These can be calculated from the reflection filter coefficients ρ and δ by The filter coefficients ρ and δ can be derived from frequency dependent absorption coefficients by minimization of the mean-square error between desiredα(f ) and α(f ) derived from the filter coefficients.

Source directivity
For the simulation of source directivity, the receiver position relative to the source is calculated. Frequency-dependent directivity with omni-directional characteristics at low frequencies and higher directivity at high frequencies is achieved by controlling a low-pass filter by the angular distance between the receiver and the source direction. The normalized relative receiver positionp rec,rel is The cosine of the angular distance is thenp x,rec,rel . The cut-off frequency f 6dB defines the frequency, for which −6 dB at ±90 degrees are achieved. With ξ = πf 6dB /fs log(2) , a first order low-pass filter with the recursive filter coefficient c lp , is applied to the signal, to achieve the frequency-dependent directivity, or in other words, the direction-dependent frequency characteristics.

Transmission model
The transmission model simulates the delay, attenuation and air absorption, which depend on the distance r(t) between the sound source (primary or image source) and the receiver, as well as attenuation, caused by obstacles between source and receiver. Point sources follow a 1/r sound pressure law, i.e., doubling the distance r results in half of the sound pressure. Air absorption is approximated by a simple first order low-pass filter model with the filter coefficient a 1 controlled by the distance: where f s is the sampling frequency and c the speed of sound. The empiric constant α = 7782 was manually adjusted to provide appropriate values for distances below 50 meters. This approach is very similar to that of [27] who used an FIR filter to model the frequency response at certain distances. However, in this approach the distance parameter r can be varied dynamically. The distance dependent part of the transmission model without obstacles can then be written as where x(t) is the source signal at time t, and y(t) is the output audio signal of the transmission model. The time-variant delay line uses either nearest neighbor interpolation or sinc interpolation, depending on the user needs and computational performance of the computing system. Obstacles are modeled by plane surfaces with polygon-shaped boundaries. The acoustic signal is split into a direct path, which is attenuated by the obstacle-specific frequency-independent attenuation a o , and an indirect path, to which a simple diffraction model is applied. The diffracted path is filtered with a second order low pass filter which is controlled by the shortest path from the source via the obstacle boundary to the receiver. With the angle θ o between the connection from the intersection point of the shortest path with the obstacle boundary to the source position, and the connection from the receiver position to the intersection point, the cut-off frequency of the low-pass filter is with the aperture a = 2 A/π defined as the radius of a circle with the same area A as the obstacle polygon. This simple diffraction model is based on the diffraction on the boundary of a circular disc [2], however, position-dependent notches are not simulated. The diffracted signal is weighted with 1 − a o and added to the attenuated signal.

Diffuse sources
Sound sources with lower spatial resolution, e.g., diffuse background noise or diffuse reverberation [42], are added in first order Ambisonics (FOA) format. No distance law is applied to these sound sources; instead, they have a rectangular spatial range box, i.e., they are only rendered if the receiver is within their range box, with a von-Hann ramp at the boundaries of the range box. Position and orientation of the range box can vary with time. The size of the range box is typically adjusted to match the dimension of the simulated room. The diffuse source signal is rotated by the difference between box orientation and receiver orientation. Diffuse reverberation is not simulated in TASCAR. To use diffuse reverberation, the input signals of the image source model can be passed to external tools which return FOA signals, e.g., feedback-delay networks or convolution with room impulse responses in FOA format [12]. A smooth transition between early reflections from the image source model and diffuse reverberation based on room impulse responses can be achieved by removing the first reflections from the impulse responses. To account for position-independent late reverberation, room receivers can render independent from the distance between source and receiver, e.g., the transmission model can be replaced by a room-volume dependent fixed gain.

Receiver model
The interface between the acoustic model and the rendering subsystem is the receiver. A receiver renders the output of the transmission model depending on the relative position and orientation between receiver and sound source. Signals from the transmission models belonging to all sound sources are summed up after direction-dependent processing. The render format determines the number of channels and the method of encoding the relative spatial information into a multi-channel audio signal. The output signal of a receiver is z(t) = (z 1 (t) , z 2 (t) , . . . , z N (t)) .
The receiver functionality can be split into the panning or directional encoding of primary and image sources y(t), and the decoding of diffuse source signals f (t) in first order Ambisonics format with Furse-Malham normalization ('B-format'): In the panning part, the driving weights w = (w 1 , w 2 , . . . , w N ) depend on the direction of the relative source position in the receiver coordinate system, p rel,k = O −1 rec (p k − p rec ); O rec is the receiver orientation matrix, and p k is the position of the k-th sound source. y k (t) is the output signal of the transmission model, i.e., it contains the distance-dependent gain, air absorption and obstacle attenuation, for the k-th source; K is the number of all primary and image point sources.
In the diffuse decoding part, D is the receiver-type specific first order Ambisonics decoding matrix for the w, x, y and z channels of the first order Ambisonics signal, andÔ −1 rec is the rotation matrix for first order Ambisonics signals, to compensate the receiver orientation. f l is the first order Ambisonics signal of the l-th diffuse source, rotated by the source orientation; L is the number of all diffuse sources, including diffuse reverberation inputs.

Render formats
The render formats of TASCAR can be divided into three categories: Virtual microphones simulate the characteristics of microphones. They primarily serve as a sensor in a virtual acoustic environment. Speaker-based receiver types render signals which can drive real or virtual loudspeakers, used for auralization of virtual scenes. Ambisonics receiver types render the scenes to first, second or third order Ambisonics format, which can be rendered to virtual microphones, loudspeakers or other reproduction methods using external decoders. Receivers can render either for three-dimensional reproduction or for two-dimensional reproduction. In both cases, the directional information of the relative source position is encoded in the normalized relative source position,p However, in the two-dimensional case p rel is projected onto x, y-plane before the normalization by setting its z-component to zero. In both cases, the acoustic model, containing all distance-dependent effects, and the image source model are calculated based on the three-dimensional relative source position.

Virtual microphones
The virtual microphone receiver type has a single output channel. The driving weight is It's directivity pattern can be controlled between omni-directional and figure-of-eight with the directivity coefficient a; with a = 0 this is an omni-directional microphone, with a = 1 2 a standard cardioid, and with a = 1 a figure-of-eight. The diffuse decoding matrix is The factor √ 2 of the w-channel is needed to account for the Furse-Malham normalization of the diffuse signals.

Speaker-based render formats
This class of render formats contains all types which render the signals directly to a loudspeaker array. The number N and position s n of speakers can be user-defined;s n is the normalized speaker position. A measure of angular distance between a source and a loudspeaker is d n = 1 −s np T rel . The most basic speaker-based receiver type is nearest speaker panning (NSP). The driving weights are: Another commonly used speaker-based render format is two-dimensional vector-base amplitude panning (VBAP) [34]. The two speakers n 1 and n 2 which are closest to the source are chosen. A gain vector (g n1 , g n2 ) T based on the normalized speaker positions and the normalized relative source position in the x, y-plane is defined: Then the driving weights are For ambisonic panning with arbitrary order, the signal of each source is encoded into horizontal Ambisonics format (HOA2D). Decoding into speaker signals is applied after a summation of the signals across all sources. In the decoder, the order gains can be configured to form a 'basic' decoder or a 'max r E ' decoder [14]. An equal circular distribution of loudspeakers is assumed for this render format. Although this receiver applies principles of Ambisonics, it is a speaker-based receiver, because encoding and decoding is combined.
All speaker based receiver types use a max r E first-order Ambisonics decoder for decoding of diffuse sounds: g is the decoder type dependent gain; for max r E this is g = 1 √ 2 in the two-dimensional case and g = 1 √ 3 in the three-dimensional case [14].

Ambisonics-based receivers
First, second and third order receiver types were implemented. They follow the channel sequence and panning weight definition of the Ambisonics Association [4], using Furse-Malham normalization. The Ambisonics-based receivers encode plane waves, i.e., they do not account for near-field effects. For two-dimensional encoding, all output channels which are zero, w n ≡ 0, are discarded.

Binaural rendering
Binaural signals and multi-channel signals for hearing aid microphone arrayŝ z = (ẑ 1 , . . . ,ẑ m ) are generated by rendering to a virtual loudspeaker array, i.e., using a speaker-based render format, and applying a convolution of the loudspeaker signals z n with the corresponding head-related impulse responses (HRIRs) h n,m for the respective loudspeaker directions. The HRIRs can be either recorded (e.g., [28,41]) or modeled [18].

Implementation
The implementation of TASCAR utilizes the Jack Audio Connection Kit [16], a tool for real-time audio routing between different pieces of software, and between software and audio hardware. The audio content is transferred between different components of TASCAR via JACK input and output ports. The JACK time-line is used as a base of all time-varying features, for data logging and as a link to the time-line of external tools.
The audio signals are processed in blocks. Time-variant geometry and the dependent simulation coefficients, e.g., delay, air absorption filter coefficients or panning weights, are updated at the block boundaries. The simulation coefficients are linearly interpolated between the boundaries. This approximation by linear interpolation might be inaccurate if the simulation coefficients vary non-linearly within a block, e.g., panning weights during fast lateral movements.
Render formats and algorithmic trajectory generators are implemented as modules. Object properties, like geometry data, reflection properties and gains, and the time-line can be controlled interactively via a network interface.
To achieve parallel processing in TASCAR, virtual acoustic environments can be separated into multiple scenes. Independent scenes can be processed in parallel. Feedback signal paths, e.g., caused by room coupling or external reverberation, are possible, but will lead to an additional block of delay. The delay and processing order of scenes is managed by the JACK audio back-end.

Performance measurements
For a rough estimation of the factors of computational complexity in TASCAR, the CPU load was measured as a function of several relevant factors. The performance measurements were done with version 0.169 of TASCAR. All underlying render tools are part of the TASCAR repository [37].
Methods CPU load C caused by audio signal processing was assessed using the 'clock()' system function, after processing 10 seconds of white noise in each virtual sound source. The number of primary sources K, number of output channels N , block size P , maximum length of delay lines l d and the render format was varied (see Table 1 for an overview of the parameter space). No image sources were processed, i.e., all simulated sources were primary sources, and no reflectors were used during the performance measurements. Each measurement of a combination of K, N , P , l d and render format was repeated twice. The CPU load C is time per cycle τ P in samples divided by length of cycle P in samples.
Number of sources and number of output channels are directly related to the numerical complexity in the receiver module. The block size controls the frequency of the geometry update. Memory usage is mainly affected by the maximum delay line length. One delay line is allocated in memory for each sound source. At 44.1 kHz sampling rate, the memory usage of the delay lines is 520 Bytes per meter and source. Different render formats may differ in their numerical complexity.

Results
A one-way analysis of variances revealed that at all tested factors except for the delay line length and repetition showed a significant influence on the τ P at a significance level of p = 0.05. Thus, in the following analysis the data was averaged across l d and repetitions.
To provide an estimation of the contribution of different factors to the numerical complexity, a model function based on the implementation was fitted to the measured data: In this model, a 0 represents the overhead by framework which is not related to the simulation properties. a 1 is an estimate of geometry processing time, which is performed for each source, but not depending on the number of audio samples per processing block P . The factor a 2 is related to source audio processing time per sample in the transmission model, and the processing time spent in the receiver, which does not depend on the number of speakers. a 3 is an estimate of the post processing time per audio sample in the receiver, which does not depend on the number of sources. a 4 is time per audio sample for each loudspeaker and sound source, i.e., time spent in the panning function of the render format. The model parameters were found by minimizing the mean-square error between the measured and predicted CPU load C, and are shown in Table 2. An example data set for one architecture and receiver type is shown in Fig 3. It is often required to estimate the maximum number of sound sources K for a given CPU, render format and loudspeaker setup (affecting N ) and latency constraint (affecting P ). Eq (19) can be transformed to As an example, K max was calculated for all tested combinations of CPU model and receiver type, for C = 90% and P = 1024. These results are given in the last two columns of Table 2, for N = 8 and N = 48.
The results show that on CPU models which are commonly used at the time of writing, several hundred sound sources can be simulated. From the tested render formats, 'HOA2D' was most efficient, especially for larger values of N . These results take only a single core into account. On multi-core computers, more complex environments can be simulated by splitting them into multiple environments of lower complexity, and rendering them in parallel.

Validation and applications
The proposed simulation tool is based on established render formats, such as VBAP [34] or HOA [14]. The physical and perceptual properties of these render methods have been extensively studied [30,15,11,35,1,5,8]. The limitations for applications in hearing aid evaluation differ from perceptual limitations [21]. They depend on the sensitivity of hearing aid algorithms and the applied hearing aid performance measures on spatial aliasing artifacts of the render methods. Thus the optimal render method depends on the context of a specific application of the proposed simulation tool. Based on the data by [21], a specific TASCAR scene can be designed such that it meets the requirements of an application-specific receiver, e.g., a human head with two-microphone hearing aids on each ear. Distance perception in human listeners is believed to be dominated by the direct-to-reverberant ratio [10]. In the proposed simulation tool with a simple image source model and position-independent externally generated late reverberation, the distance perception may depend on simulation parameters. Thus, in a previous study the distance perception and modeling with room-acoustic parameters in simulations with TASCAR was evaluated [22]. It was shown in a comparison of binaural recordings in a real room and a simulation of the same geometry that in the simulation a distance perception similar to real rooms can be achieved.
An overview over a number of possible applications is shown in Fig 4. The simplest application of TASCAR is to play back a pre-defined virtual acoustic environment via multiple loudspeakers (Fig 4.a). For subjective audiological or psycho-acoustic measurements in virtual acoustic environments, without hearing aids or aided with conventional hearing aids, the audio input of virtual sound sources can be provided by external measurement tools (Fig 4.b). TASCAR can also be applied to assess hearing aid (HA) performance in simulated virtual environments, based on instrumental measures, or with human listeners, e.g., in combination with the open Master Hearing Aid (openMHA) [26]. Subjective or instrumental evaluation of research hearing aids can be performed by feeding the output of the virtual acoustic environment directly to the inputs of a research hearing aid [20] (Fig 4.c). An example study of this use case can be found in [24], where hearing aid performance in eight different virtual acoustic environments of different spatial complexity was assessed. Test stimuli as well as the configuration of virtual acoustic environment and the research hearing aid can be controlled from the measurement platform, e.g., MATLAB or GNU/Octave (Fig 4.d).
Motion data can also be recorded from motion sensors or controllers, to interact with the environment in real-time, or for data logging (Fig 4.e).

Summary and conclusions
In this technical paper, a toolbox for creation and rendering of dynamic virtual acoustic environments (TASCAR) was described, which allows direct user interaction. This tool was developed for application in hearing aid research and audiology. The three main modules of TASCAR -audio player, geometry processor and acoustic model -form the simulation framework. The audio player provides the tool with audio signals, the geometry processor keeps track of the distribution of the objects in the virtual space, and the acoustic model performs the room acoustic simulation and renders the scene into a chosen output format. The simulation uses a transmission model and a geometric image source model in the time domain, to allow for interactivity, and for a simple physical model of motion-related acoustic properties, such as Doppler shift and comb filtering effects. TASCAR allows selecting from a number of various rendering formats, customized to the needs of a range of applications, including higher order Ambisonics and binaural rendering formats.
Performance measurements quantify the influence of factors related to simulation complexity. The results show that, despite some limitations in terms of complexity of the virtual acoustic environment, several hundred virtual sound sources can be interactively rendered, even over huge reproduction systems and on consumer-grade render hardware.
It can be concluded that the proposed tool is suitable for hearing aid evaluation. It offers a set of features, e.g., dynamic time-domain geometric image source model, diffuse source handling, directional sources, which is to current knowledge unique in this combination.