Ultrasonic Sensor-Based Personalized Multichannel Audio Rendering for Multiview Broadcasting Services

An ultrasonic sensor-based personalized multichannel audio rendering method is proposed for multiview broadcasting services. Multiview broadcasting, a representative next-generation broadcasting technique, renders video image sequences captured by several stereoscopic cameras from different viewpoints. To achieve realistic multiview broadcasting, multichannel audio that is synchronized with a user's viewpoint should be rendered in real time. For this reason, both a real-time person-tracking technique for estimating the user's position and a multichannel audio rendering technique for virtual sound localization are necessary in order to provide realistic audio. Therefore, the proposed method is composed of two parts: a person-tracking method using ultrasonic sensors and a multichannel audio rendering method using MPEG Surround parameters. In order to evaluate the perceptual quality and localization performance of the proposed method, a MUSHRA listening test is conducted, and the directivity patterns are investigated. It is shown from these experiments that the proposed method provides better perceptual quality and localization performance than a conventional multichannel audio rendering method that also uses MPEG Surround parameters.


Introduction
Recently, a wide range of multimedia technologies for accessing multimedia content through digital TVs (DTVs), personal media players (PMPs), and digital cameras is rapidly being developed.This development is particularly evident in the field related to broadcasting services, which has made progress toward more realistic and immersive broadcasting services [1][2][3][4][5].To this end, a representative next-generation broadcasting service that supports realistic and immersive multimedia is currently entering the spotlight in the form of 3-dimensional television (3DTV) technologies [5][6][7].
3DTV is a technology that is being used to provide realistic and stereoscopic video content to users and can be further classified into either stereoscopic or multiview methods.Stereoscopic 3DTV is currently being produced and sold on the market and has become an essential component for watching 3D movies at home.As an alternative to glassless 3DTV, however, multiview-based 3DTV is emerging as an attractive option, since it not only delivers more realistic visual content to users, but it also has a wider viewing range.Thus, there is a great deal of ongoing research associated with multiview TVs in attempts to miniaturize the screen size and reduce the price [7].
Multiview broadcasting renders the video sequences captured by a set of cameras from different viewpoints.By rendering these video sequences on a multiview monitor or a multiview TV, users can experience 3D effects from different viewpoints without requiring 3D glasses [7].Under a multiview broadcasting framework, however, the transmitted multichannel audio signal must also be realistically rendered at different viewpoints in order to increase both the visual and auditory realism.To realize such an audio service, two sequential processes are necessary: (1) tracking the user's viewpoint and (2) rendering the multichannel audio specifically at the user's position.Thus, this paper proposes a person-tracking-based multichannel audio rendering method for multiview broadcasting services, in which person tracking is performed using ultrasonic sensors, and multichannel audio rendering is performed using MPEG Surround parameters.
The remainder of this paper is organized as follows.Following this introduction, Section 2 briefly explains a multiview broadcasting system.Next, Section 3 proposes an ultrasonic-based person-tracking method for a personalized audio service.After that, Section 4 describes a conventional parameter-based audio rendering method and then proposes a new rendering method using MPEG Surround parameters on the basis of the constant power panning law.Section 5 then evaluates the performance of the proposed method in terms of perceptual audio quality and audio localization.Finally, this paper is concluded in Section 6.

Multiview Broadcasting System
Figure 1 presents a schematic diagram of a multiview and multichannel audio broadcasting system.As shown in this figure, the broadcasting system is composed of two parts: the first part acquires and transmits multiview images and multichannel audio contents, and the second part renders and plays the resultant multiview images and multichannel audio.In the first part, multiview videos consist of video sequences that are simultaneously captured by a set of cameras placed according to different viewpoints, which can be then encoded using a video encoder such as H.264. On the other hand, multichannel audio contents are recorded using multiple microphones or a microphone array, which are then encoded using an audio codec such as MPEG-2 advanced audio coding (AAC).Next, both video and audio contents are transmitted to a multiview receiver via a broadcasting network.In the second part, the transmitted multiview video contents are processed and rendered to generate 3D contents that are adjusted to the particular viewpoint of each user.Similarly, multichannel audio is rendered for each viewpoint and played through 5.1 multichannel loudspeakers or stereo headphones.

Ultrasonic Sensor-Based Person Tracking
In this section, we describe how the viewpoint of a user can be estimated in order to deliver audio effects appropriate to a particular viewpoint, as mentioned in Sections 1 and 2. Recently, a number of methods pertaining to person tracking have been reported [8][9][10][11][12], which are commonly classified into two categories: vision-based tracking and active sensor-based tracking.The former tracks a person's eyes or face [8][9][10][11][12], and the latter tracks a person's position using sensors such as an active badge, a radio frequency identification (RFID) device [11], or other sensors [12,13].It should be noted that vision-based tracking methods have a disadvantage in terms of processing time, since they are based on image-processing techniques.However, active sensorbased tracking methods can be implemented with less processing time than vision-based tracking methods but require sensors for estimating the viewpoint of each user.However, it has been shown that tracking methods utilizing ultrasonic devices can provide a comparatively high accuracy and are relatively inexpensive compared to RFID tags or other active badge devices [14,15].Consequently, in this paper, a person-tracking system using ultrasonic devices is constructed, which consists of two ultrasonic transducers and an ultrasonic receiver for person tracking.
Figure 2 presents the block diagram of a person-tracking system for estimating the user's viewpoint, where an ultrasonic receiver attached to the user's headphones or clothes receives an ultrasonic signal from two ultrasonic transducers.The distance between the ultrasonic receiver and each transducer is estimated and then delivered to a person-tracking server over Bluetooth.Finally, the server estimates the viewpoint using a triangulation technique.
Figure 3 shows how to calculate the view position or coordinate of the user by using the two ultrasonic sensors.The detailed procedure for person tracking is as follows.First, the relative distance between the th ultrasonic sensor and the receiver,   , is calculated using  where ( receiver ,  receiver ) and (  ,   ) are the coordinates of the receiver and the th sensor, respectively.From (1), the coordinate of the receiver is then calculated as ( Finally, ( receiver ,  receiver ) is brought to multi-channel audio panning in order to provide auditory realism in the multiview system.

Parameter-Based Audio Rendering
Figure 4 presents the block diagram for the proposed parameter-based audio rendering method which is based on the constant power panning law using MPEG Surround parameters [16,17].In this figure, panning gains in the proposed method are first calculated according to the user's viewpoint, and  different channel level difference (CLD) parameters are extracted from the audio bitstream after applying a CLD parser.Next, the CLD parameters are transformed into absolute gain values, that is, six channel power gains for the 5.1 audio channels.The relationship between the scale factors for the CLD parameters and channel power gains are given by [16,18] where  is the channel index, and   and  +1 are the th and the ( + 1)th channel power gains, respectively.Note here that the two channels must be adjacently located.Then, if  is equal to ,  +1 indicates  1 , and  ,+1 is a scale factor transformed from CLD using the relationship where CLD ,+1 is the CLD parameter between the th and the ( + 1)th channels.Next, the channel power gains are modified depending on the panning gains calculated from a particular viewpoint, and the modified channel power gains are finally converted back into CLD parameters to create a modified bitstream for the MPEG Surround decoder.
There have been several approaches proposed for audio panning in the MPEG Surround parameter domain [19][20][21][22][23].For example, the constant power panning law was directly applied to the channel power gains according to the desired panning angle [20,21].However, in such a direct application, the panned sound image was incorrectly localized or disappeared when the desired panning angle was larger than the aperture angle among the speaker pairs.The source of this problem was due to the fact that audio rendering coverage was limited to the aperture angles between two speakers and each transformed channel power gain was only related to two adjacent channels.
To remedy this problem, the proposed method applies the constant power panning law to the channel power gains according to the minimum aperture angle, instead of the desired panning angle.This change is especially effective when the desired panning angle is larger than any other aperture angle among the speaker pairs.In this section, a conventional channel power gain modification method in [20,21] is reviewed, and then the proposed method is described in detail.

Conventional Channel Power Gain Modification.
To track the user's viewpoint as stated in Section 3, the angles to be panned are computed and denoted as   for  = 1, 2, . . .,  (Figure 5).Note that  = 5 for a 5.1-channel speaker configuration and the angle associated with the user's viewpoint is  1 .In addition, the low frequency enhancement (Lfe) channel is omitted because it can be generated by using other 5 channels.In a conventional channel power gain modification method [20,21], the proportion of   to an aperture angle between the th and the ( + 1)th speakers,  ,+1 , is calculated as  where  ,+1 =  ,1 .Next, the panning gains associated with  out, are calculated as where   and   denote the power gain of the th input channel and the panning gain that is contributed from the th input channel to the ( + 1)th speaker, respectively.In addition, the power gain of center channel is used as the panning gain of the Lfe channel.However, the conventional audio panning method described previously has some drawbacks.First, due to the sine-law amplitude panning method [24], possible panning angles in the conventional method are limited by the aperture angle of each pair of loudspeakers.Second, the conventional method does not consider the interchannel coherence (ICC) parameters for panning, though the ICC parameters play an important role in providing the spatial diffuseness of audio quality as well as localization performance at low frequencies [20].

Proposed Channel Power Gain Modification.
In this section, a new audio panning method is proposed to overcome the drawbacks of the conventional method.Figure 6 shows the procedure for the proposed channel power gain modification method.In this figure, each panning angle calculated from the user's viewpoint,   , is first compared to the apertures of all loudspeaker pairs, for example, five pairs of loudspeakers for the 5.1-channel speaker configuration,  ,+1 for  = 1, 2, . . ., 5.Then, if the panning angle is smaller than the minimum aperture angle, the conventional method described in Section 4.1 is applied for audio panning.Otherwise, each output signal is rearranged to adjacent channels in advance before CLD panning is applied to each pair.This procedure overcomes the problem in which each channel component disappears in the output channels when the panning angle is larger than the aperture angle in sinelaw amplitude panning method [24].In other words, the output channels are arranged into another output channel corresponding to this minimum aperture angle before the panning process is applied.In addition, the remaining angle  remain can be obtained relative to the desired panning angle   using Next, similar to (5), the proportion of  remain to an aperture angle between the th and the ( + 1)th speakers,  ,+1 , can be calculated as The modified panning gains associated with   out, are then calculated as where   and    denote the power gain of the th input channel and the modified panning gain that is contributed from the th input channel to the ( + 1)th speaker, respectively.Thus, the actual output gains of each channel are calculated as where  out, and   () denote the actual output gains of output channel  and the panned signal component corresponding to each speaker pair , that is, (C&R), (L&C), (C&R), (R&Rs), and (Rs&Ls).
Finally, panned CLDs are obtained from both the conventional and proposed modification methods and are reestimated from the panning gains using the following equations: where CLD panned  denotes the channel level difference of the panned audio from the th one-to-two (OTT) box.In addition,  out, denotes the panning gain calculated for each channel, where  is replaced with R (right channel), L (left channel), C (center channel), Rs (right surround), and Ls (left surround).Subsequently, the panned CLDs are used for MPEG Surround decoding, resulting in the panned multichannel audio shown in Figure 7 [16,17].

Performance Evaluation
To evaluate the performance of the proposed audio panning method, the perceptual quality and localization performance were compared to those obtained using the conventional method.During these experiments, a multiple stimulus with hidden reference and anchor (MUSHRA) test [25] was conducted in order to evaluate the perceptual quality, and a directivity pattern analysis was used to evaluate the localization performance.

Perceptual Quality.
For the MUSHRA listening test, we used the following as references and candidates: (1) a hidden reference, (2) a 7 kHz low-pass filtered anchor, (3) a 14 kHz low-pass filtered anchor, (4) audio signals processed by conventional CLD-based audio panning [20,21], and (5) audio signals processed by the proposed CLD-based audio panning.Three music genres (classical, rock, and heavy metal) were selected as audio signals, and ten people with no hearing problems participated in these experiments.Figure 8 illustrates the experimental results of the MUSHRA test.When the panning angle was smaller than the minimum aperture angle, for example, at a 30 ∘ panning angle, the proposed method had audio quality comparable to the conventional method, except for classical music signals.The reason why the MUSHRA score for classical music signals processed by the conventional CLD-based panning method was slightly higher than that by the proposed CLD-based panning method was that classical music signals were less dynamic than those from other genres such as rock and heavy metal.In other words, while the conventional method computed panning gains once every pair of channels by applying (6), the proposed method computed each panning gain by taking into account more than two channels as shown in (10).Thus, it resulted in perceptual degradation in classical music signals.In spite of such an artifact, it was found that the spatial impression for panned audio processed by the proposed method was more stable than that by the conventional method.
On the other hand, when the panning angle was larger than the minimum aperture angle, for example, at a 60 ∘ panning angle with a 30 ∘ minimum aperture, the audio quality of the panned audio processed by the conventional method notably degraded.Even if the proposed method had smaller MUSHRA score for classical music signals than the conventional method, it was also found that the participants heard unnatural artificial noise due to incorrect panning when the panning angle was larger than the minimum aperture.

Localization Performance.
To evaluate the localization performance, panned audio with only one channel signal was played, and the frequency response was measured using a dummy head.The directivity patterns for panning angles of 0 ∘ , 30 ∘ , and 60 ∘ were then analyzed.The amplitudes of the frequency responses at 500 Hz were measured by rotating the dummy head about 10 ∘ .For this experiment, a KU100 dummy head [26] was used.
Figure 9 shows the directivity patterns of the panned signals for 30 ∘ and 60 ∘ at 500 Hz.To estimate the position of the sound image localization, it was assumed that the sound image was localized at the position exhibiting maximum power.As illustrated in this figure, the measured power became maximal at a rotated position of about 90 ∘ , which corresponds to a forward-facing direction when no audio panning was applied.Similarly, the measured power became maximal at a rotated position of about 120 ∘ , relative to the panned direction, when an audio panning of 30 ∘ was applied.It can also be seen that the directivity pattern of the conventional method is correctly presented for a panning angle of 30 ∘ .However, when the panned angle was increased to 60 ∘ , the polar pattern of the conventional method was not correctly presented, whereas the directivity pattern obtained by the proposed CLD-based panning method shows that the audio signal rotated in the correct direction, although there were localization errors at around 5 ∘ -10 ∘ .

Conclusion
In this paper, an ultrasonic sensor-based personalized multichannel audio rendering method was proposed to increase audio realism in multiview broadcasting services.To this end, a real-time person-tracking method was first developed by using two ultrasonic transducers and an ultrasonic receiver in order to estimate the viewpoint of a user.Secondly, a parameter-based audio panning method using MPEG Surround parameters was proposed to increase the auditory realism.In the proposed method, panning gains were calculated according to the user's viewpoint that was already estimated by the ultrasonic-based person-tracking method.Next, five different channel level difference (CLD) parameters were extracted from the audio bitstream after applying a CLD parser.Finally, the CLD parameters were transformed into six channel power gains for the 5.1 audio channels.In fact, the proposed method applied the constant power panning law to the channel power gains according to the minimum aperture angle, instead of the desired panning angle that was used for a conventional panning method.Thus, the proposed method could be more effective than the conventional method when the desired panning angle was larger than any other aperture angle among the speaker pairs.In order to evaluate the performance of the proposed audio panning method, the perceptual quality and localization performance using an MUSHRA test and a directivity pattern analysis, respectively, were carried out.Consequently, it was shown from the tests that the proposed audio panning method achieved better average MUSHRA score and localization performance than the conventional audio panning method.

Figure 1 :
Figure 1: Schematic diagram of a multiview and multichannel audio broadcasting system.

Figure 3 :
Figure 3: Calculation of the coordinates of a user's view position using two ultrasonic sensors and an ultrasonic receiver.

Figure 4 :
Figure 4: Block diagram of the proposed audio rendering method using MPEG Surround parameters.

Figure 5 :
Figure 5: Schematic diagram of the relationship between the panning and aperture angles in a 5.1-channel speaker configuration, where C, L, R, Ls, and Rs denote the center, left, right left surround, and right surround channels.