A Comparison on the Localization Performance of Static and Dynamic Binaural Ambisonics Reproduction with Different Order

Ambisonics is a series of spatial sound reproduction system based on spatial harmonics decomposition and each order approximation of sound field. Ambisonics signals are originally intended for loudspeakers reproduction. By using head-related transfer functions (HRTFs) filters, binaural Ambisonics converts the Ambisonics signals for static or dynamic headphone reproduction. In present work, the performances of static and dynamic binaural Ambisonics reproduction are evaluated and compared. The mean binaural pressure errors across target source directions are first analyzed. Then a virtual source localization experiment is conducted, and the localization performances are evaluated by analyzing the percentages of front-back and up-down confusion, the mean angle error and discreteness in the localization results. The results indicate that binaural Ambsonics reproduction with insufficiently high order (for example, 5-10 order) is unable to recreate correct high-frequency magnitude spectra in binaural pressures, resulting in degradation in localization for static reproduction. Because dynamic localization cue is included, dynamic binaural Ambisoncis reproduction yields obviously better localization performance than static reproduction with the same order. Even a 3-order dynamic binaural Ambisoncis reproduction exhibits appropriate localizations performance.


Introduction
Conventional Ambisonics is a series of loudspeaker-based spatial sound reproduction systems and techniques. Based on the principle of spatial harmonics decomposition and each order approximation of sound field, it aims at reconstructing the target sound field within a local region and below a certain frequency limit [1]. Binaural Ambisonics is a binaural rendering technique evolving from conventional Ambisonics, which converts conventional Ambisonics signals for headphone reproduction by using head-related transfer function (HRTF) filtering or virtual loudspeakers [2][3][4]. Due to its flexibility, binaural Ambisonics has been included in the standard of MPEG-H 3D audio by ISO and IEC [5]. It has also been applied to virtual reality (VR) which has been developing rapidly in recent years.
The nature of binaural Ambisonics is spatial harmonics decomposition and each order approximation of binaural pressures, or equally, each order approximation of sound field within a region around the head. Increasing the order of binaural Ambisonics promotes the upper frequency limit for accurate reconstruction of binaural pressures, and thus improves the perceived performance in reproduction. However, the (L-1) order binaural Ambsonics reproduction requires M ≥ L 2 virtual loudspeakers, corresponding to M ≥ L 2 pairs of HRTF-based filters [6]. Therefore, the cost of signal processing for binaural Ambisonics reproduction also increases with the order. In practical application, the order is chosen based on a compromise between performance and cost of signal processing.
Binaural Ambisonics can be further classified into static and dynamic reproduction. The former neglects the dynamic variation of binaural signals caused by head turning, that is, simulates the situation that listener's head is immobile during listening. In contrast, by using a head tracker to detect the temporary head orientation of the listener, the later simulates the dynamic variation of binaural signals caused by head turning.
There have been a lot of works on evaluating the performance of both conventional and binaural Ambisonics reproductions, including analysis on the error of binaural pressures and timbre change, examination on the virtual source localization in reproduction [7][8][9][10]. However, there are rare works in the literatures that compare the perceived performance of static and dynamic binaural Ambisonics reproduction. Actually, because dynamic binaural Ambisonics includes the dynamic cue for auditory localization, it is expected to exhibit localization performance superior to static binaural Ambisonics reproduction with the same order.
In present study, the localization performances of static and dynamic binaural Ambisonics reproduction with different order were experimentally evaluated and compared. The results of this work provide some guilds for order chosen for binaural Ambisonics in the practice application.

Coordinate System
A clockwise spherical coordinate system is used. The origin of coordinate is located at the center of head. Spatial position is specified by distance 0 ≤ r < ∞, azimuth 0º ≤ θ < 360º and elevation -90º ≤ φ ≤ 90º. Where φ = -90°, 0° and 90° represent the bottom, horizontal and top direction, respectively; in the horizontal plane, θ = 0°, 90° and 180° represent the front, right and back direction, respectively.

Spatial Ambisonics
The sound pressure at arbitrary field point (r, Ω) = (r, θ, φ) caused by a far-field point source at position (rS, ΩS) = (rS, θS, φS) can be expressed as a plane wave, which can subsequently be decomposed by spherical harmonics functions: where S0 is the amplitude of plane wave; k is the wave number; jl(kr) is the l-order spherical Bessel functions; superscript "*" denotes complex conjugate operator, and m l Y is the l-order and m-degree complex-valued spherical harmonics function, given by | |m l P is the associated Legendre polynomial.
In spatial Ambisonics reproduction, suppose M loudspeakers are arranged uniformly on a spherical surface with sufficient large radius around the listener. The direction of the i th loudspeaker is Ωi = (θi, φi), corresponding signal amplitude is Ei. Then the reproduced pressure is a linear combination of plane waves caused by all loudspeakers and can also be decomposed by spherical harmonics functions: Matching Eq. (1) with Eq. (3) and truncating the order to (L-1), yields the following equation:   (6) Superscript "T" denotes the transpose operator. Column vector YS of length L 2 represent L 2 spherical harmonics components or normalized independent (encoding) signals of the (L-1) order Ambisonics.
Y is an L 2 × M matrix, with its elements representing the spherical harmonics functions of loudspeaker directions.  (10) where D is the decode matrix and. given by following pseudo-inverse of matrix Y: (11) where the superscript "H" denotes the Hermitian or complex transpose operator. Eq. (9) indicates that the (L-1) order Ambisonics requires L 2 loudspeakers at least. Therefore, as the order increases, the system becomes complex. In addition, an (L-1)-order Ambisonics is able to reconstruct the target sound field within a spherical region with radius rH and up to a frequency limit of fmax.H. The relationship among them is given by [6], max .
( 1) 2π where c = 343 m/s is sound speed. Eq. (12) is the consequence of Shannon-Nyquist spatial sampling theorem, which indicates that the radius of region and upper frequency limit for accurate reconstruction of target sound field increase with the order of Ambisonics.

Binaural Ambisonics Reproduction
In traditional binaural reproduction, input stimulus is filtered by a pair of HRTFs at the target direction and then reproduced by a pair of headphones. Alternatively, in binaural Ambisonics reproduction, each Ambisonics loudspeaker signal is filtered by a pair of HRTFs at corresponding loudspeaker direction and then summed up to form the binaural (headphone) signals. In other words, binaural Ambisonics reproduces the loudspeaker signals by using virtual loudspeakers: where α = L or R is the left-ear or right-ear, respectively. The minimal number M of virtual loudspeakers needed for (L-1) order binaural Ambisonics reproduction should also satisfy Eq. (9). However, binaural Ambisonics reproduction is free from the restrictions of practical loudspeaker configuration in conventional loudspeaker reproduction, making the higher order reproduction realizable. On the other hand, higher order binaural Ambisonics requires more virtual loudspeakers or HRTF-based filters and thus makes signal processing complex.
In static binaural Ambisonics reproduction, the directions of target source with respect to head are fixed. Therefore, the binaural signals in Eq. (13) are invariable when head turns. In dynamic binaural reproduction, on the other hand, binaural signals should be constantly updated according to the temporary orientation of head. This can be implemented by two methods. One method is to constantly update the HRTFs in Eq. (13) according to the temporary directions of virtual loudspeakers with respect to head. Because head turning is equivalent to target source turning toward opposite directions, another method is to constantly update the loudspeakers signals Ei in Eq. (13) according to the temporary direction of target source with respect to head. To avoid the audible artifact caused by updating the HRTF-based filters, the second method is preferred. Fig. 1 is the block diagram of a dynamic binaural Ambisonics system with the second method. Firstly, the target source direction information is encoded into independent signals according to Eq. (7). Then the encoded signals are fed to the decoder described by matrix D in Eq. (11), yielding signals for M loudspeaker reproduction. Finally, the M loudspeaker signals are converted to binaural signals by using HRTF-based filters. During reproduction, the temporary head orientation of listener is detected by a head tracker, based on which the virtual loudspeakers signals Ei are updated.
In addition, let rH = 0.0875 m be the average radius of head, Eq. (12) also yields the Shannon-Nyquist frequency limit fmax.H for accurate reconstruction of binaural pressures in (L-1) order binaural Ambisonics reproduction.

Method for Analyzing the Error in Binaural Pressures
To evaluate the performance of binaural Ambisonics reproduction with various orders, the error of binaural pressure is first analyzed. For a target plane wave from direction ΩS, the binaural pressures can be calculated by filtering the input stimuli with a pair of far-field HRTFs at direction ΩS, as The binaural pressures for Ambisonics reproduction can also be calculated by filtering each loudspeaker signal with corresponding HRTFs and then summing The error in binaural pressures can be evaluated from Eq. (14) and Eq. (15). The mean normalized square error εα(f) of complex value pressure over MS target directions is calculated as [11]: A low εα (f) means a small error of binaural pressures for Ambisonics reproduction. Similarly, the mean normalized square error εα,mag(f) of pressure magnitude is calculate by replacing the complex value pressures in Eq. (16) with their magnitudes [11]:

Results of Error in Binaural Pressures
The HRTFs used were obtained by 3D-laser-scanned model of KEMAR artificial head and BEM-based calculation. The directional resolution of HRTFs was 1°. M virtual loudspeakers for binaural Ambisonics reproduction were nearly-uniformly arranged on the surface of a sphere. MS target source directions were also nearly-uniformly arranged on the surface of a sphere [12].
As an example, M = 400 and MS = 900 were chosen. According to Eq. (9), M = 400 virtual loudspeakers are suitable for binaural Ambisonics reproduction up to (L-1) = 19 order. Fig. 2(a) shows the mean normalized square error εR(f) of complex value pressure for (L-1) = 3, 5, 10 and 18 order reproduction. Because the results for left and right ears are similar, Fig. 2(a) only shows the results for the right ears. The vertical lines in Fig. 2(a) are Shannon-Nyquist frequency limit fmax.H in (L-1) = 3, 5, 10 and 18 order binaural Ambisonics reproduction. They are 1.9 kHz, 3 kHz, 6 kHz and 11 kHz, respectively, as calculated from Eq. (12).
It is observed that for each order reproduction, the error is less than -10 dB below the corresponding Shannon-Nyquist frequency limit fmax.H. Error increases above corresponding fmax.H. Increasing order reduces errors at high frequency. In contrast, within the low frequency range of 0.2 kHz-0.6 kHz, increasing order may increase the error. However, in the low frequency range, the errors are always less than -40 dB and thus insignificant. Fig. 2(b) shows the corresponding mean normalized square error εR.mag(f) of pressure magnitude. As order increases, the tendency for εR,mag(f) is similar to that for εR(f) around and above corresponding fmax.H. Within the low frequency range of 0.2 kHz-0.6 kHz, increasing order from 3 to 10 decreases the error, further increasing the order to 18 increases the error. However, in the low frequency range, the errors are always less than -50 dB and thus also insignificant.
The above analysis indicates that a very higher order binaural Ambisonics reproduction is required to accurately reconstruct binaural pressures at high frequency, which makes the signal processing very complex.

.1 Method for Virtual Source Localization Experiment
A series of virtual source localization experiments were conducted to evaluate the localization performance of binaural Ambisonics reproduction. In order to evaluate the effects of the order of Ambisoncis on static and dynamic binaural Ambisonics reproduciton, 2 × 3 = 6 combinations of the following conditions were included: (1) Two reproducing manners, including static and dynamic binaural Ambisonics reproduction.
In addition, the traditional dynamic and static binaural reproductions were chosen as control groups. In fact, the traditional binaural reproduction can be seen as the binaural Ambisoncis reproduction with infinite order. Therefore, this experiment included 8 kinds of manners of binaural reproduction, i.e., (3 different orders + 1 traditional reproduction) × 2 different reproducing manners.
The experiment was conducted via a virtual auditory display (VAD) [13]. The VAD was based on a PC with windows platform and software written in C++ language. An electromagnetic head tracker (Polhemus FASTRAK) detected the orientation of subject's head. It was able to detect the head turning in three degrees of freedom, including turning around the left-right axes (pitch), around the front-back axes (tilting or roll) and around the up-down axes (rotation or yaw ). According to the direction of target virtual source relative to the temporary orientation of subject's head, the VAD synthesized binaural signals. The HRTFs used for binaural synthesis were identical to those used in analysis. Dynamic and static reproductions can be implemented by turning on and turning off the head tracker, respectively. The synthesized binaural signals were rendered by a pair of headphones (Beyerdynamic DT770PRO). The update rate and system latency time of VAD were 60 Hz and about 25.4 ms, respectively.
Pink noise with full audible bandwidth was used as stimuli. For each kind of reproduction manner, 16 target virtual source directions were chosen. These directions located in three elevations of φS = -45°, 0° and 45° with five azimuths θS = 0°, 45°, 90°, 135° and 180° in each elevation respectively, adding a direction on the top. The experiment was conducted in a listening room with background noise lower than 30 dBA. Eight subjects (four males and four females) aged between 23-28 years participated in the experiment. All subjects had a normal hearing and had experience in localization experiments. Before the formal experiment, all subjects were asked to be familiar with the process. During the experiment, the subjects were required to point out the perceived directions with an electromagnetic tracker fixed on a stick. For each reproduction manner and target direction, the stimuli were reproduced 3 times in a random order, yielding 3 repeats × 8 subjects = 24 judgments.

Results of the Virtual Source Localization Experiment
Four indexes, including percentage of front-back confusion, percentage of up-down confusion, the unsigned mean angle error Δ2 and mean discreteness κ -1 are used to evaluate the localization performance. The unsigned mean angle error Δ2 is defined as the average of the difference between the perceived direction and the target direction [14]: where rs(n) and rI(n) are the vectors of target direction and perceived direction of the nth judgment, respectively. And N = 3 repeats × 8 subjects = 24 is the total number of judgments. The notation " ⋅ " denotes the scalar multiplication of two vectors. The mean discreteness is defined as: The lower the value of κ -1 , the less of the dispersion is.
Prior to calculating the mean angle error and the mean discreteness, the judged directions for front-back and up-down confusion cases are resolved. That is, the front-back and up-down confusions are corrected by reflecting the judgments against the appropriate plane before the analysis. In addition, a series of homogeneity tests is conducted to check the consistency of the raw localization results. The Kruskal-Wallis H test at a significant level of α = 0.05 is used for the homogeneity tests. The results show that there are no significant differences for all the tests, i.e. the localization results for all of the subjects and repetitions are consistent and therefore reliable and stable.
Tab. 1 lists the percentages of front-back (F-B) and up-down (U-D) confusions, mean unsigned angle error and mean discreteness of localization results for various reproduction conditions. The statistics are conducted over target virtual source directions, all subjects, and all repeats from each subject. In the case of calculating the percentages of front-back confusions, top direction and azimuth θS = 90° directions are excluded from the statistical analysis. And in the case of calculating percentages of up-down confusion, all elevation φS = 0° directions are excluded from the statistical analysis.
A. Front-back confusion In the case of static reproduction, localization results exhibit high percentages of front-back confusion (from 41.1% to 49.7%) for both traditional binaural reproduction and binaural Ambisonics reproduction with various orders. Dynamic reproduction reduces, or even almost eliminates the front-back confusion. The results of multi-way ANOVA indicate that the reproduction manner (static/dynamic) is significant for traditional binaural reproduction and binaural Ambisonics reproduction with various orders. And the order of binaural Ambisonics is not significant for both static and dynamic reproduction.

B. Up-down confusion
For both static and dynamic binaural Ambisonics reproduction, the percentage of up-down confusion decreases with the increasing order of Ambisonics. The up-down confusion of the dynamic Ambisoncis reproduction is obvious less than the static one. For the same reproduction manner (static or dynamic), the 5-order Ambisonics reproduction exhibits a much higher percentage of up-down confusion as compared with traditional binaural reproduction. While the 18-order reproduction yield percentage of up-down confusion similar to that of traditional binaural reproduction. The results of multi-way ANOVA indicate that the order of Ambisoncis and reproduction manner are significant.
C. Unsigned mean angle error and mean discreteness For both static traditional and static binaural Ambisonics reproductions with different orders, the unsigned mean angle error and mean discreteness under each condition are similar and relatively high. Under the same other conditions, the unsigned mean angle error and mean discreteness of dynamic reproduction are less than the static cases. In addition, the unsigned mean angle error and mean discreteness for dynamic binaural Ambisonics reproduction roughly decrease with the order. The 18-order static or dynamic binaural Ambisonics reproduction yields performance similar to those of traditional static or dynamic binaural reproduction, respectively. The results of multi-way ANOVA validate above observations.  Fig. 6 plot the localization results for 8 reproduction conditions according to the method in [15]. In these figures, the judged directions for front-back and up-down confusion cases have been resolved. Symbol "+" and symbol "o" stand for the target directions and the mean directions of judgments, respectively. The blue elliptical solid line or the green elliptical dash line means that the judgments should be classified statistically as a Fish or Kent distribution at a 95% confidence level. An ellipse describes the discreteness of the judgments about the mean direction. The results show that, the 5-order and 10-order static binaural Ambisonics reproduction yield less accuracy of localization in terms of mean direction and discreteness. Dynamic reproduction improves localization performance. And for dynamic reproduction, the performances of mean direction and the discreteness are continuously improved as the order increases from 5 (fmax.H = 3 kHz) to 10 (fmax.H = 6 kHz), then to 18 (fmax.H = 11 kHz). For both the static and dynamic binaural Ambisonics reproduction, the (L-1) = 18 order reproduction yields localization performances similar to that of traditional binaural reproduction. Even the 10-order reproduction is able to create performances comparable to that of the traditional binaural reproduction in the dynamic cases.

A Supplementary Experiment for the 3-Order Dynamic Binaural Ambisonics Reproduction
The results of above experiments indicate that the dynamic binaural Ambisonics reproduction yields better localization performance as compared with the static ones of the same order. Moreover, a 5-order dynamic binaural Ambisonics reproduction exhibits comparable or even better localization performance as compared with traditional static binaural reproduction, although it shows a slightly higher percentage of up-down confusion. To further examine the performance of dynamic binaural Ambisonics reproduction with lower order, a localization experiment on 3-order dynamic binaural reproduction was supplemented.
The experimental conditions and method for analysis were identical to these in Sections 4.1 and 4.2. The results are also listed in Tab. 1. As observed, a 3-order dynamic binaural Ambisonics reproduction exhibits comparable or even better localization performance as compared with traditional static binaural reproduction or 18-order static binaural reproduction. Fig. 7 plots the localization results.

Discussion
The results of above experiment indicate that, for static reproduction, binaural Ambisonics with very high order (for example, 18 or higher order) is required to create localization performance similar to these of traditional binaural reproduction. Static binaural Ambisonics with insufficient order degrades the localization performance, exhibiting higher percentages of front-back and up-down confusion as well as larger unsigned mean angle error and mean discreteness in localization. Dynamic binaural Ambisonics reproduction obviously improves localization performance as compared with the static one with the same order. A 3 to 5-order dynamic binaural Ambisonics reproduction is enough to create appropriate localization performance.
Actually, interaural cue, especially low-frequency interaural time difference (ITD) below 1.5 kHz dominates lateral localization. Both spectral cue at high-frequency and dynamic cue contribute to front-back and vertical localization [16,17]. A coordination of these two cues enhances the localization. However, the information provided by spectral and dynamic cues is somewhat redundant. One cue alone enables front-back and vertical localization to some extent when another cue is lacked.
For static reproduction, dynamic cue is omitted and thus front-back and vertical localization depend on spectral cue. According to Eq. (12) and let rH = 0.0875 m be the average radius of human head, a (L-1) = 18 or higher order binaural Ambisonics is required to recreate correct binaural pressures up to 11-12 kHz (which constrainedly covers the high-frequency spectral range for localization). Therefore, the experimental results for static reproduction are consistent with the simple analysis from Eq. (12). In addition, Tab. 1 indicates that even the localization performance of traditional binaural reproduction is somewhat dissatisfactory. This is due to that the non-individualized HRTFs were used in present work. Using individualized HRTFs in traditional binaural synthesis and reproduction improves localization performances [18]. Including the dynamic cue, front-back and vertical localization in dynamic binaural Ambisonics reproduction depend less on spectral cue. It can be estimated from Eq. (12) that a 3-order Ambisonics is able to create correct binaural pressures up to 1.9 kHz, which covers the frequency range (up to 1.5 kHz) for ITD and its dynamic variation as dominant localization cues at low frequency. Of course, increasing the order of dynamic binaural reproduction enhances the spectral cue at high frequency and thus further improves localization. Therefore, the experimental results for dynamic reproduction are also consistent with the simple analysis from Eq. (12).
There are two practical applications of binaural Ambisonics. One is to convert Ambisonics signals for headphone reproduction. In practice, because the order of original Ambisonics signals is usually not high enough (does not exceed 3 to 5 order), dynamic binaural Ambisonics reproduction is preferred. Otherwise, static binaural Ambisonics reproduction with insufficient order will degrade the localization performance.
Another application of binaural Ambisonics is to synthesize the binaural signals for headphone reproduction directly. M = L 2 pairs of HRTF-based filters are required for the (L-1) order binaural Ambisonics synthesis, which is independent from the number of virtual sources. In contrast, a pair of HRTF-based filters is required for each virtual source in traditional binaural synthesis. Therefore, in the case of synthesizing a single or a few virtual sources, the signal processing of traditional binaural synthesis is simpler than that of binaural Ambisonics, especially much simpler than binaural Ambisonics with very high order (for example, 18 order). In this case, traditional binaural synthesis is preferred, especially for static reproduction. Of course, individualized HRTFs are needed to further improve the localization performance of static reproduction. This may be somewhat difficult in practice.
On the other hand, in the case of synthesizing a complex virtual auditory scene with multiple sound sources (including direct sound sources and image sources for room reflections), the signal processing of binaural Amobisinics may be much simpler than that of traditional binaural synthesis, especially for dynamic binaural Ambisonics with not very high order. This is due to the fact that multiple sound sources share a set of common HRTF-based filters in binaural Ambisonics synthesis. The number of common HRTF-based filters only depends on the order of binaural Ambisonics, and is independent from the number of virtual sources to be synthesized. Moreover, dynamic binaural Ambisonics synthesis avoids the audible artifacts caused by updating the HRTF-based filters in traditional dynamic binaural synthesis [19]. Therefore, dynamic binaural synthesis with appropriate order is suitable for synthesizing a complex auditory scene with multiple sound sources.

Conclusions
Both dynamic and high-frequency spectral cues contribute to front-back and vertical localization. Due to the lack of dynamic localization cue and error in the high-frequency spectral cue, static binaural Ambisonics reproduction with insufficient order degrades the localization performance in terms of front-back confusion, up-down confusion, unsigned mean angle error and mean discreteness. To create correct binaural pressures up to 11 kHz (which constrainedly covers the high-frequency spectral range for localization), an 18 or higher order binaural Ambisonics is required. This makes the signal processing rather complex. Including the dynamic localization cue, dynamic binaural Ambisonics reproduction exhibit much better localization performance than that of static binaural Ambisonics reproduction with the same order. A 3 to 5-order dynamic binaural Ambisonics reproduction is enough to create appropriate localization performance even if non-individualized HRTFs are used in binaural synthesis.
The results of present work are applicable to the design of VAD for various uses. Of course, the quality of a VAD is not uniquely determined by its localization performance. Timbre is another important perceived performance of VAD. The timbre of binaural Ambisonics reproduction should be explored in the future.