Three-dimensional Sound Source Localization Using Inter-channel Time Difference Trajectory

Sound source localization is one of the basic and essential techniques for intelligent robots in terms of human-robot interaction and has been utilized in various engineering fields. This paper suggests a new localization method using an inter-channel time difference trajectory, which is a new localization cue for efficient 3-D localization. As one of the ways to realize the proposed cue, a two-channel rotating array is employed. Two microphones are attached on the left and right sides of the spherical head. One microphone is in a circular motion on the right side, while the other is fixed on the left side. According to the rotating motion of the array, the (source) direction-dependent characteristics of the trajectories are analysed using the Ray-Tracing formula extended for 3-D models. In simulation, the synthesized signals generated by the fixed and rotating microphone signal models were used as the output signals of the two microphones. The simulation showed that the localization performance is strongly dependent on the azimuthal position of a source, which is caused by the asymmetry of the trajectory amplitude. Additionally, the experimental results of the two experiments carried out in the room environment demonstrated that the proposed system can localize a Gaussian noise source and a voice source in 3-D space.


Introduction
Recently, intelligent robots have been developed to not only support arduous human tasks, but also to interact with people in order to meet various human needs [1,2]. As an independent object with its own intelligence [3], a robot needs to recognize environmental changes, such as the appearance of unidentified objects or the acoustic events for missions completed. For example, robots working in households should detect user voices and simultaneously be aware of other acoustic events, such as noises emitted from home appliances and other voices from electric devices. As a result, they can pay attention to speakers with more natural human-robot interaction (HRI) skills. In this situation, the technology of sound source localization (SSL) is employed to estimate the acoustic source direction using the acoustic signals from the microphone array; this is one of the most important building blocks of HRI. In addition, intelligent robots need to estimate the azimuth and elevation angles of a source together (i.e., 3-D SSL), due to the fact that a sound event occurs at an arbitrary direction in 3-D space. It is noteworthy, however, that a lot of techniques need to be carried out simultaneously with the given limited resources and the computational power for SSL is restricted. As a result, the computationally efficient 3-D SSL method is increasingly required. In addition, the source direction is defined by the inter-aural polar coordinates shown in Figure 1.
In the last few decades, many different SSL algorithms that are applicable to intelligent robots [4,5] and also other engineering systems (e.g., teleconference systems [6] and surveillance units [7]) have been proposed. Even if the microphone array size, shape and number of microphones differ due to the constraints of various applications, certain localization cues (or direction estimation cues) are commonly used: inter-channel time difference (ICTD), interchannel level difference (ICLD) and inter-channel spectral difference (ICSD). ICTD is defined as the time difference between the arrivals of a sound wave-front to the microphones; ICLD is defined as the difference between the sound pressure levels at the microphones; and ICSD means the difference between the spectral contents at the microphones. ICTD has been used as the most powerful localization cue [8][9][10] in almost all applications. Comparatively, ICLD and ICSD have not been used as frequently for practical applications, but are nevertheless employed in biomimetic research, including that of ear-based SSL systems [11,12]. The sagittal plane is a vertical plane dividing the space into right and left halves. Sources on each sagittal plane share the same azimuth. The median plane is the mid-sagittal plane that bisects the space symmetrically from left to right. The horizontal plane is perpendicular to the sagittal plane and passes through the centre of the coordinates.
In general, most of the SSL systems use more than four microphones for 3-D SSL. Except for circumstances in which directional microphones are used, if only two microphones are used and their locations do not change with time, 3-D SSL is not possible, because front-back confusion occurs due to the existence of many directions sharing the same localization cues, even for a single SSL [13,14]. This is called the cone-of-confusion in 3-D space. In the absence of an additional structure (e.g., a spiralshaped structure), more than four microphones in different (imaginable) planes are necessary in order to solve the cone-of-confusion problem inherent to 3-D SSL [15]. However, in the situation where a two-channel moving array for 3-D SSL is used, the measured localization cue such as ICTD will change according to the array motion. Then, from the changing pattern of ICTDs, it is expected that 3-D SSL can be achievable. Therefore, the (microphone) position-dependent ICTDs can be represented as below: where α is the parameter that identifies the location or the motion of the moving microphone. Here, the (source) direction-and (microphone) position-dependent ICTDs are named as the ICTD trajectory, which is the new concept for a localization cue for efficient 3-D SSL.
This paper proposes a 3-D SSL method using the ICTD trajectory induced by the circular motion of the twochannel rotating array. This array was selected as one of the possible ways to realize a specific ICTD trajectory. One of the two microphones is attached to the rotating plate on the right side of the spherical head, indicated by the redcoloured circle in Figure 2; the other microphone is fixed on the left side of the head, indicated by the blue-coloured circle. Figure 2 shows the schematic drawing of the suggested two-channel rotating microphone array installed on the spherical head. In this paper, in order to generate the known movement of the array, the circular motion is given to the right-sided plate.
This paper is organized as follows: in section 2, we introduce the new specific localization cue that is the ICTD trajectory relevant to the circular motion of the array. The mathematical derivation of the ICTD trajectory is presented using the extended Ray-Tracing formula for 3-D models. The relationship between the parameters of the ICTD trajectory and the source direction is also presented. Section 3 describes the proposed 3-D SSL algorithm based on the source direction estimator. In section 4, the localization performance of the proposed SSL algorithm is examined using simulations: the signal models of both rotating and fixed microphones are presented. Section 5 shows the experimental setup and the results using two kinds of sources. The discussion is presented in section 6 and the concluding remarks are given in section 7. Schematic of the rotating microphone array installed on the spherical head. One of two microphones is attached to the rotating plate on the right side of the head (red-coloured circle). In this paper we call this component the "rotating microphone". The rotating part is moving in a clockwise direction on theY R Z R plane and the shift angle of this part is measured with respect to the +Z R axis. The other microphone is fixed on the left side of the head (blue-coloured circle). It is called the "fixed microphone".

Localization Cue: ICTD Trajectory
Most of the conventional localization algorithms have been developed using a microphone array fixed at given positions [4,5,[8][9][10][11][12]15] and with constant directiondependent cues, on the assumption that a source position does not vary. The most commonly used time delay estimation (TDE) method is to use the generalized crosscorrelation (GCC) function, which is employed to estimate ICTD between the selected pair of microphones [16,17]. Among various GCC functions, the GCCphase transform (PHAT) function is widely used because it is well known for its robust estimations of ICTD in the reverberant field [18].
If we use N omnidirectional microphones, N(N-1)/2 GCC-PHAT functions are obtainable. The GCC-PHAT function between the i th and j th microphones is defined as below [16]: where f is the frequency and τ is a time delay variable. x i and x j are defined as the i th and j th microphone output signals. G x i x j and R x i x j are the cross-spectral density function and the GCC-PHAT function between x i and x j , respectively. The τ ij , the measured ICTD between x i and x j , is calculated by τ ij = argmax τ R x i x j (τ). After that, the estimated direction of the sound source can be found by the relationship between the geometry of the microphone array and the multiple ICTDs. For example, when two microphones are placed on the horizontal plane and apart from each other by L m, the azimuth of a source can be estimated by using equation (3) [19]: where c is the speed of sound and φ S is the estimated azimuth. For example, SSL using ICTD maps [20,21] was applied to the microphone array fixed on the robot head.

ICTD Trajectory
When using an immovable microphone array, the observation of the constant localization cues (e.g., the ICTDs) can be used to efficiently estimate a direction, provided there are sufficient sensors. However, like the two-channel array, the measurement of the single constant localization cue does not guarantee successful SSL due to the cone-ofconfusion problem. Thus, it is concluded that the useful 3-D SSL cue should have other characteristics dependent on the source direction, i.e., azimuth and elevation angles.
When using the two-channel rotating array, if the angular velocity is given as w R , the position-dependent ICTD must be a periodic function with a period of 2π / w R . It is also noteworthy that the localization cue we want to suggest is similarly based on the ICTD concept. We assumed that the Doppler effect caused by the relative motion between the rotating microphone and a source can be ignored, because the speed of the rotating microphone, i.e., the radius of the rotating circle multiplied by the rotating angular speed is quite small compared with the speed of sound. Furthermore, when considering the application scenario where the talking person as a target source is walking inside a room, the sound source is supposed to move slowly. Thus, the new specific localization cue, including the positiondependent feature of the circular motion of the rotating array, is defined in equation (4) for the source at (φ S , θ S ): where θ shift ∈ 0, 2π and is measured clockwise from the +Z R axis. Also, θ shift indicates the position of the microphone on the right side given a constant rotating radius (r R ).

Extended Ray-Tracing Formula for 3-D Models
The well-known Ray-Tracing formula [22,23] has been widely used for 2-D models to approximate the inter-aural time difference. However, in order to achieve the ICTD trajectories, this formula should be extended for 3-D models such as a two-channel rotating array installed on the sphere. Figure 3 shows the nomenclatures required to model the propagation distance between the source and the sensor locations. Once the propagation distance is derived, depending on the shift angle of the rotating part, the ICTD trajectory is obtainable with the assumption that the speed of the sound is independent of frequency in a non-disper-sive medium. In order to derive the propagation distance, three direction vectors fromC H to the positions of the two microphones and the source need to be expressed as a function of the shift angle and the source direction. The Ray-Tracing formula for the 3-D model includes the concept of the critical circle, which is the counterpart of the critical point in the 2-D model [22]. When the observation point is hidden by the head, the wave-front from the source initially propagates to the critical circle directly and then secondarily propagates along the surface to the observation point. These propagation steps are shown in Figure 4   The red circle represents the rotating microphone at the rotating part on the Y R Z R plane. This microphone is rotating with a constant speed (r R × w R ), where r R and w R are the rotating radius and angular velocity, respectively. `The rotation centre C R is located at ( r H 2 − r R 2 , 0, 0). The shift angle (θ shift ) of the rotating plate is defined as the angle from the + Z R axis in a clockwise direction. The blue circle represents the fixed microphone that is located at ( − r H , 0, 0). r H is the radius of the spherical head and θ R is the angle between the X axis and the direction vector to the rotating microphone from C H .    Table 1, the resulting ICTD trajectories of the frontal sources on the horizontal plane are those shown in Figure 5. It is clear that the mean of the ICTD trajectories varies according to the change in the azimuth angle of the source, because theY R Z R plane, on which the circular motion of the rotating microphone occurs, is perpendicular to the X axis. Also, for the frontal sources, the resulting upand-down pattern of the ICTD trajectories is expected because of the clockwise microphone motion from the +Z R axis. On the other hand, we can conjecture that the pattern of the ICTD trajectories of the rear sources will be reversed. In addition, for the laterally biased sources, no significant features are visible because the source is on the X axis, which is the symmetric axis of the array's rotating motion. Figure 6 shows the ICTD trajectories for the sources on the median plane. The vertical axis has no physical meaning because we added a 0.3 ms offset to the original ICTD trajectories as the elevation angle of the source increases by 45°, in order to present all ICTD trajectories in a single graph. As shown in Figure 6, as the elevation angle is increased, the phase of the trajectory is shifted. The source elevation angle is defined on the sagittal plane and the shift angle of the rotating part is also defined on the sagittal plane, i.e., the Y R Z R plane; see Figures 1 and 3. This is why the ICTD trajectory is shifted as much as the change of the elevation angle of a source. Then, it seems possible to estimate a source's elevation angle by finding the phase shift of the trajectory. Therefore, we can expect that the mean value of the ICTD trajectories will be a useful cue for the azimuth estimation and an amount of the shift of the ICTD trajectory can be used to efficiently estimate the elevation.

Characteristics of the ICTD Trajectory of the Rotating Microphone Array
In section 2.2, examples of the ICTD trajectories obtained using the Ray-Tracing formula were shown. In section 2.3, we describe the characteristics of the ICTD trajectories of the rotating microphone array. First, the relation between the mean of the ICTD trajectory and the azimuth angle of the source will be derived. Second, the relation between the phase shift of the trajectory and the elevation angle will be shown. In addition, the amplitude of the ICTD trajectory will be presented as a function of the azimuth angle only.
First of all, the mean value of the ICTD trajectory is defined as equation (13): The wave propagation from a source to a microphone is strongly dependent on the azimuth angle of a source; see Figure 4 and equation (10). For example, when a sound source is to the left of the head, only direct wave propagation occurs to the fixed microphone. On the other hand, the consecutive propagation along the direct and indirect paths occurs from the source to the rotating microphone because the rotating microphone is hidden by the head from the view of the source. If a source is to the right of the head, the wave propagation characteristics are reversed. In particular for a source with an azimuth angle within − θ R , + θ R , the propagation characteristic to the rotating microphone changes according to its shift angle. In order to represent propagation characteristics more precisely, we divide them into three categories: (case 1) the wave propagation is along the direct path only, (case 2) the consecutive propagation is along the direct and indirect paths and (case 3) There is a transition between case 1 and case 2, depending on the shift angle. In terms of these categories, Table 2 shows the propagation characteristics according to the azimuth angle of the source. For the sources with azimuth angles within − θ R , 0 , the propagation to the rotating microphone corresponds to case 3. The transition from case 1 to case 2 occurs when the rotating microphone passes θ b and the subsequent transition from case 2 to case 1 occurs at π + θ b , where θ b is defined in equation (14). For the sources with azimuth angles within 0, + θ R , the transition from case 1 to case 2 occurs at θ b and the consecutive transition from case 2 to case 1 occurs where θ shift ∈ 0, 2π . With respect to the azimuth interval, the mean value of the ICTD trajectory is derived as a function of the azimuth angle of the source only as shown in Figure 7. It is apparent that a one-to-one relationship exists between the mean value of the ICTD trajectory and the azimuth angle of the source. Therefore, it is possible to estimate the azimuth angle of the source once the mean value of the ICTD trajectory is obtained. Figure 6. The ICTD trajectories for sources on the median plane are shown.
As the source elevation changes, the trajectory pattern shifts. For the source on the top of the head, the distance between the rotating microphone and the source is the shortest at θ shift = 0°, which indicates that the ICTD is maximal. The distance increases up to θ shift = 180° and goes back to the shortest distance within a single period.   Table 1.
On the other hand, the specific shift angles, which correspond to the maximal or minimal values of the ICTD trajectory, are useful for finding the elevation angle of the source. These specific shift angles are defined as below: which implies that θ shift max and θ shift min are equal to π / 2 − θ S and 3π / 2 − θ S , respectively. It is obvious that the elevation angle increases from the +Y axis in an anticlockwise direction, while the shift angle of the rotating microphone increases from the +Z R axis in a clockwise direction (see Figures 1 and  3). In summary, by finding the two parameters of the ICTD trajectory, i.e., ICTDT and θ shift max/min , the azimuth and elevation angles of the source can be found independently.
In addition, as shown in Figure 5, the amplitude of the ICTD trajectory changes as the azimuth angle is varied. Naturally, we can expect the trajectory amplitude to be dependent on the azimuth angle only. Its definition is given below: We express the amplitude of the ICTD trajectory as its peakto-peak value using the specific shift angles in equation (15). Figure 8 visualizes its amplitude as a function of the azimuth angle. It is notable that the ICTD trajectories of the left-sided sources have larger ICTDT pp compared with those of the right-sided sources, except the source at (φ S , θ S ) = ( − 90°, 0°). The variation of the ICTD trajectory is affected due to the motion of the rotating microphone only. When the sphere hides the entire trajectory of the rotating microphone's motion from the view of the source, the wave propagation in case 2 occurs, and the variation of the propagation distances becomes the largest (see Table  2). Also, when the source moves from the left to the right, the portion of the direct wave propagation increases and the ICTDT pp decreases. Equation (17) shows the ICTDT pp according to azimuth intervals:  Figure 8. The values of ICTDT pp are a function of the azimuth angle. In particular, the left-sided sources within − π / 2 + θ R , − θ R have the same ICTDT pp , which corresponds to the time taken for a wave-front to travel the length of 2r R θ R , and 2r R θ R / c is equal to 0.206 msec. 2r R θ R is the greatest length made by the rotating motion on the surface within a full revolution. As a source approaches the right, ICTDT pp decreases. Exceptionally, the ICTDT pp s of the sources at (-90°, 0°) and (+90°, 0°) are zero, because these sources are located on the X axis, which is perpendicular to the Y R Z R plane.

Localization Algorithm
The localization of a source can be achieved using the oneto-one relationship between the parameters of an ICTD trajectory and a source direction, as described in section 2.3. However, it is not easy to apply this approach to a real situation where a source and other noises are present simultaneously. In addition, the duration of a source varies and can be too short to calculate τ(θ shift ), even for a single source case. Therefore, to apply the practically feasible SSL to a real environment, a new SSL method is necessary. Section 3.1 presents the source direction estimator (SDE) based on the ICTD trajectory and section 3.2 summarizes the proposed 3-D SSL algorithm.

Source Direction Estimator
As mentioned before, we used the conventional GCC-PHAT function [16] to obtain the ICTD trajectories. Equation (18) redefines a GCC-PHAT function that is dependent on the shift angle of the rotating microphone: Thus, G x F x R ( f | θ shift ) is strongly dependent on the shift angle of the rotating microphone. It should be noted that the relative motion between a sensor and a source is so small that the Doppler effect in the measured signals is negligible [24]. Therefore, it is reasonable to assume that  (19) where τ(φ S , θ S | θ shift ) is one of the constructed ICTD trajectory databases for a source at (φ S , θ S ). SDE at only, then SDE is 1 at (φ, θ) = ( φ a , θ a ) and 0 at other directions, ideally. Thus, if SDE is generated once, it is possible to estimate the source direction via peak detection.

Localization Algorithm for Rotating Microphone Array
In this section, the proposed SSL algorithm is described. On the basis of the weak Doppler effect (due to the small relative motion), the collected signals of the fixed and rotating microphones within (at least) a single period are segmented intoN f frames, each including N fft samples. In addition, the angle allocated to each frame is the shift angle, which is measured at the time the middle sample in the frame is collected. Sections 4.1 and 4.2 give more information about the segmentation process. In the real system, the shift angle is measured directly by the encoder signal; see Figures 17 and 18 for more details. Then, we can obtain R x F x R (τ | θ shift ) and SDE(φ S , θ S ) for every possible direction using equations (18)(19). The final decision is made by detecting the peak in SDE; we assume that the number of dominant sources is given by the recognition group prior to the SSL process. If it is reported that a single source is recognized, then the estimation of the source direction can be done by equation (20): where φ S and θ S are the estimated azimuth and elevation angles of a source, respectively. For multiple SSL, various peak detection strategies are applicable when multiple peaks in the SDE are present. However, since our research focused on a single SSL, we used the simplest global peak detection using equation (20). Figure 9 shows the procedure of the proposed SSL algorithm. to calculate R x F x R (τ | θ shift ). Next, by using the constructed ICTD trajectory database, the SDE can be obtained for every direction. Finally, by finding the dominant peak(s) in SDE in descending order, the source direction(s) can be estimated.

Simulation
In section 4, we evaluate the performance of the proposed SSL algorithm using synthesized signals. To do this, signal models of the fixed and rotating microphones were needed. These models are given in section 4.1 and the results of the simulation for a single source are described in section 4.2.
The localization performance is evaluated with respect to the localization error, which is defined as the angle between the true and perceived direction vectors. In this simulation, the physical dimensions of the rotating microphone array are given in Table 1.

Signal Models of the Fixed and Rotating Microphones
As shown in Figure 3, the rotating microphone array is installed on a spherical head with a radius of r H . One of the two microphones is fixed at ( − r H , 0, 0) on the surface of the spherical head (this microphone is hereafter called the "fixed microphone" for convenience). Then, the output signal of the fixed microphone in a continuous time domain, denoted as x F (t), can be modelled as below: where h S x F (t) is the spherical impulse response [25] from the source position to the fixed microphone position on the spherical head, s(t | φ s , θ s ) is the source signal contents, and * indicates the convolution operator. As shown in equation (21), h S x F (t) is not a function of θ shift because this microphone does not move. However, the other microphone (i.e., the rotating microphone) is located on the rotating plate and moves in a circular motion on theY R Z R plane (see Figures  2 and 3). Then, the measured signal of the rotating microphone should be strongly dependent on the shift angle. The signal model of the rotating microphone denoted as x R (t) can be defined as: where h S x R (t | θ shift ) is the spherical impulse response from the source position to the rotating microphone position. In this case, h S x R is a function of θ shift due to its circular motion.
The synthesized signal refers to the discrete-time domain signal. The generation of the synthesized signal of the fixed microphone, denoted as x F n , is carried out by simply discretizing x F (t), as shown below: where Δt S is the sampling time and x F n is the n th sample of the synthesized signal of the fixed microphone. On the other hand, the motion of the rotating microphone makes the generation of x R n more complicated. For example, when we assume that the rotating microphone is shifted +θ N° in a clockwise direction from the +Z R axis and fixed during the measurement, then the output signal x R (t) of the rotating microphone is: By using this notation, M x R ( ⋅ ), which is the matrix of conditioned (continuous) output signals, can be modelled as equation (25) and is composed of N f output signals with Δθ N degree resolution: where x R (t|θ shift ) is equal to x R (t|θ shift + 2π) due to the circular motion of the rotating microphone, which means a cyclo-stationary process when a source content is stationary [26]. We assume that the other dimensions do not vary.
In this simulation, we set the sampling frequency (f S ) and the number of frames (N f ) as 44.1 kHz and 360, respectively. Thus, Δθ N becomes 1°, and M x R ⋅ , which is the matrix of conditioned discretized signals, can be modelled in equation (26): From equation (26), the synthetized signal of x R n along the shift angle axis can be represented as follows: For instance, when the source is located in the direction of (φ s , θ s ) = ( 0 o , 0 o ) and its signal content is a Gaussian white noise signal, the resulting values included in Mx s ⋅ are presented in Figure 10. It is found that the amplitude of the synthesized signal is increasing as the shift angle of the rotating microphone gets close to 90° and is generally decreasing as the shift angle becomes close to 270°. This is a reasonable result: when the rotating microphone approaches the source direction, the measured signal must be less attenuated by the spherical head. In this simulation model, the angular velocity of the rotating plate is 600 rpm. The synthesized output signal of the rotating microphone is collected along the signal detection line with w R of 600 rpm. In this case, the synthesized microphone outputs are presented in Figure 11.

Simulation Results
Various criteria to evaluate the SSL performance have been suggested by previous researchers [4, 6, 11-12, 15, 20]. One of the most commonly used criteria is based on the absolute error between true and perceived directions and it can be applied to the evaluation of azimuth or elevation angle estimations separately. However, for the evaluation of 3-D SSL performance, it would be more reasonable to incorporate both azimuth and elevation together. If we express the perceived (or estimated) azimuth and levation angles as φ S and θ S respectively, then the true and perceived direction vectors (tdv → , pdv → ) are defined with respect to the inter-aural polar coordinate:   By using equation (19), SDE is obtained using the GCC-PHAT functions and the database of the approximated ICTD trajectories. Figure 13 shows the calculated SDE. The dominant peak is quite visible and bell-shaped side edges originating from the peak are spread out primarily along the elevation angle axis. This result is due to several factors. If the time resolution is infinitesimally small, the bellshaped edges become invisible. However, the acquisition or processing system has its limitations, such as finite f S . As a result, adjacent ICTD trajectories may overlap each other. More specifically, the locations of the peaks of the GCC-PHAT functions are matched with more than one ICTD trajectory partially in the time domain. Thus, the side edges become visible. Also, we can expect that as the time interval increases, the overlapped region will expand and the SDE values corresponding to the side edges will increase. Secondly, even if SDE is calculated in the discrete-time domain with a denser time resolution, the side edges should appear, because the signal bandwidth is limited. Thus, it can be expected that the calculated GCC-PHAT function is not equal to an ideal impulse. Besides, the effect of the rotational motion on the synthesized signals remains, although it is not remarkable. Therefore, the processing in the discrete-time domain and the motion of the rotating microphone cause the bell-shaped edges.
We examined the 3-D SSL performance of the proposed SSL algorithm for a Gaussian white noise source with respect to the localization error as mentioned above. The range of the source direction is as follows: its azimuth angle spans from -90° to +90° with 10° intervals and its elevation angle varies from 0° to +330° (-30°) with 30° intervals. The number of source directions is 228. It is assumed that the rotating microphone array system was located in a free field. Figure  14 shows the localization error distribution for all of the source directions. Generally, the performance gets better as the source is close to the left, opposite to the rotating microphone, due to the left and right asymmetry of the azimuth-dependent ICTD trajectory amplitude (see Figure  8). Also, it is reasonable that an elevation-dependent feature was not visible. The distribution of the mean errors along the azimuth angle was shown in Figure 15

Computational load comparison
To be an efficient 3-D SSL method, the signal processing costs must be light. In this section, the computational load of the proposed localization method is compared with those of the delay-and-sum beamformer [27] and the steered response power (SRP) -PHAT method [28]. For example, SRP-PHAT requires the frequency-domain processing to do the phase transform (PHAT). Here, if the Figure 13. SDE for the source at the front side of the rotating array is shown. It was found that the dominant peak is around the true direction of the source (0 o , 0 o ). Additionally, the bell-shaped side edges originated from the peak due to the regional overlap of adjacent ICTD trajectories. The shape of the peak is stretched in the direction of the elevation angle axis due to the short up-and-down motion of the rotating microphone, compared with the width of the array.   Figure 14.Localization errors for 228 directions are depicted. As we can see, the elevation-dependent feature was not found. However, it was quite visible that the SSL performance is strongly dependent on the azimuth angle of a source only. elevation angle varies from 0 o to +330 intervals. The number of source directions is 228. assumed that the rotating microphone array system was located in a free field. Figure 14 shows error distribution for all of the Generally, the performance gets better as the source close to the left, opposite to the rotating microphone to the left and right asymmetry of the azimuth ICTD trajectory amplitude (see Figure 8). Also, it is reasonable that an elevation-dependent feature visible. The distribution of the mean errors along the azimuth angle was shown in Figure 1

Computational load comparison
To be an efficient 3-D SSL method, the signal processing costs must be light. In this section, the computational load of the proposed localization method is compared  Localization errors for 228 directions are depicted. As dependent feature was not found. However, it was quite visible that the SSL performance is strongly dependent on the azimuth angle of a source only.
to +330 o (-30 o ) with 30 o source directions is 228. It is assumed that the rotating microphone array system was shows the localization of the source directions. the performance gets better as the source is rotating microphone, due to the left and right asymmetry of the azimuth-dependent igure 8). Also, it is dependent feature was not he mean errors along the igure 15.
D SSL method, the signal processing costs must be light. In this section, the computational proposed localization method is compared The mean error along the azimuth angle of a source. sided sources are almost the except for the leftmost source. The right-sided sources tend to be estimated with worse resolution compared with the  Localization errors for 228 directions are depicted. As we can see, the elevation-dependent feature was not found. However, it was quite visible that the SSL performance is strongly dependent on the azimuth angle of a source only.
Thus, the total SRP-PHAT processing cost is M(M+1)/2 X 5N fft log 2 N fft + M(M-1)/2 X (N φ N θ + 7N fft ). In the same way, the cost of the proposed localization algorithm is (3M-1)/2 X 5N fft log 2 N fft + (M-1) X (N φ N θ + 7/2N fft ) and the cost of the delay-and-sum beamformer is MN X (N φ N θ ) where N is the frame length, 100. Therefore, the approximated costs are computed: (1) delay-and-sum beamformer, MN (N φ N θ ), (2) SRP-PHAT method, M 2 (N φ N θ +12N fft ), (3) the proposed method, M (N φ N θ +9/2N fft ). It is noteworthy that if the N fft is smaller than N φ N θ , the signal processing cost of the proposed localization method is N or M times less than the delay and sum beamformer and SRP-PHAT methods. It is reasonable that the proposed localization cue can be computed by M microphone pairs when using the (M+1)channel microphone array. However, when using the SRP-PHAT method, the M(M-1)/2 microphone pairs are utilized for a single localization process. On the other hand, it is known that the SRP-PHAT method can be applied to the situation where the signal to noise ratio (SNR) is less than zero. However, in the proposed localization method, the TDE error will have an effect directly on the localization performance because the proposed cue is based on the measured ICTD trajectory. Thus, it can be reasonably expected that the proposed localization performance will be significantly more degraded than that of the SRP-PHAT method when SNR<0.

Experiment
We developed a rotating microphone array according to the proposed design (see Figure 2 and Table 1). It should be noted that the two microphone signals needed to be transmitted wirelessly for safety reasons. Thus, both a microphone and a transmitter needed to be placed inside the rotating block. An ultrasonic motor was chosen to make this block rotate inside the head. Details about the structure of the proposed array and the measurement process are provided in section 5.1. Section 5.2 shows the results of the two experiments for the feasibility test: one involving a Gaussian white noise source and the other involving a voice source.

Experimental Set-up
For our proposed rotating microphone array, we chose a wireless system (Q240, RFQ) consisting of a dual-channel receiver, two transmitters, and two microphones (QB686, RFQ). In order to put a transmitter unit and a microphone together in a rotating block, the electronic boards inside the transmitter unit had to be rearranged and installed in a cylindrical plastic block. Figure 16 shows the interior arrangement of the necessary blocks and other units inside the spherical head. There are two cylindrical blocks, two ultrasonic motors, one encoder, and one motor driver. The cylindrical block on the right side is called the "rotating block" and this block consists of the rearranged electronic boards used to transmit the microphone signal (#. 1) and the pin-type microphone located 3 cm from the centre of the cap. This block is connected to the ultrasonic motor (USR-E3T/24V, SHINSEI), which is driven by the motor driver (D6060E, SHINSEI). Additionally, the encoder is attached to the motor. Thus, the shift angle of the microphone is measured using the encoder signal. The other block on the left side is hereafter called the "fixed block" for convenience. The pin-type microphone (#. 2) is attached at the centre of the cap. The transmitter unit is outside the block. The left and right side views are also presented in Figure 17. The physical dimensions are the same as those in Table 1, except the rotating radius r R , which is 3 cm in the real array. Therefore, the ICTD trajectory database needed to be reconstructed.  Figure 16. A top view of the hemisphere showing the interior arrangement of the rotating and fixed blocks, two ultrasonic motors, one encoder, and one motor driver. The rotating block on the right side contains the electronic boards for transmitting the microphone signal. The shift angle of the rotating microphone is measured by using the encoder

Experimental Set-up
For our proposed rotating microphone array, we chose a wireless system (Q240, RFQ) consisting of a dual-channel receiver, two transmitters, and two microphones (QB686, RFQ). In order to put a transmitter unit and a microphone together in a rotating block, the electronic boards inside the transmitter unit had to be rearranged and installed in a cylindrical plastic block. Figure 16 shows the interior arrangement of the necessary blocks and other units inside the spherical head. There are two cylindrical blocks, two ultrasonic motors, one encoder, and one motor driver. The cylindrical block on the right side is called the "rotating block" and this block consists of the rearranged electronic boards used to transmit the microphone signal (#. 1) and the pin-type microphone located 3 cm from the centre of the cap. This block is connected to the ultrasonic motor (USR-E3T/24V, SHINSEI), which is driven by the motor driver (D6060E, SHINSEI). Additionally, the encoder is attached to the motor. Thus, the shift angle of the microphone is measured using the encoder signal. The other block on the left side is hereafter called the "fixed block" for convenience. The pin-type microphone (#. 2) is attached at the centre of the cap. The transmitter unit is outside the block. The left and right side views are also presented in Figure 17. The physical dimensions are the same as those in Table 1, except the rotating radius ‫ݎ‬ ோ , which is 3 cm in the real array. Therefore, the ICTD trajectory database needed to be reconstructed.

Experimental Results
The experiments for the feasibility test were carried out in the room environment: the room size was 3.2 x 5.5 x 2.8 m 3 (width x length x height) and the reverberation time was 0.26 seconds (t 60 ). The input signal was produced through a full range speaker (TC9FSD13, VIFA) on the speaker jig. Figure 18 shows the rotating array system placed in the room. Two experiments were conducted in order to check the feasibility: one involving a Gaussian white noise signal and the other using a male voice as a source signal.

Gaussian white noise source
First, the experiment using a Gaussian noise source as an input signal to the speaker was conducted. In this experi-ment, the SSL performance for a source in the median plane was evaluated. Only the elevation angle of a source was varied from -30° to 210° with 10° intervals. The source content was Gaussian white noise signal with frequency contents from 1.5 kHz to 20 kHz generated by the random noise generator (SF-06, RION) and was produced longer than the one rotating period. The angular frequency was set to 54 rpm. For example, when the source is at (0 o , 0 o ), the measured microphone signals and the z-phase encoder signal are depicted in Figure 19. The total measurement time was 3 seconds and the signal duration was set to 2 seconds. By using the encoder signal in the z-phase, we collected the samples within a single rotating period and allocated N fft samples to each frame. For 25 directions in the median plane, the mean localization error was 1.75° and the standard deviation was 1.65°. Therefore, the experimental result showed that our proposed SSL algorithm is applicable to the SSL of a Gaussian noise source.

Male voice source
The previous experiment employed a Gaussian white noise signal as a source. In this experiment, a male voice was used as the sound source, without using a speaker jig. The male's position was fixed during the measurement so that his mouth was at (45°, 0°) while speaking. The angular frequency of the rotating block was reduced to 21 rpm in order to involve the silent region. The output signals of the two microphones and the encoder signal are depicted in Figure 20. It is known that voice signals are not stationary with time. Also, the spectral modification is strongly dependent on the relative position of the sensor and the source. If the microphones are not attached on an object such as a sphere, but located in the free field, the spectral contents in the measured microphone signals will be the same. Figure 21 shows the GCC-PHAT functions along the shift angle of the rotating microphone. In the region where sufficient signal contents were collected, the GCC functions were obtained quite reasonably because the peak location seemed to change in a sinusoidal form. The empty blackcoloured circles show the estimated ICTDs.  In the region with sufficient signal contents, the functions were obtained easily. This can be interpreted to mean that the peak location of each function is shifting up and down as the rotating microphone moves in a circular motion. The more smoothed peaks result from the comparatively narrow frequency band of the measured voice signals. Figure 22 shows the SDE for all possible directions with 2°r esolution on both the azimuth and elevation directions. Consequently, the dominant peak in the SDE was found. As we examined earlier, bell-shaped side edges originate from the peak. Negative values were found at some regions. This result seems reasonable because a GCC function can have a negative value, which indicates that considerable contents in the measured signals are out-ofphase with each other. The final step to find the location of the (positive) peak in the SDE was carried out to estimate the direction of a source as equation (20).

Discussion
The concept of the proposed localization cue, which is a (source) direction-and (microphone) position-dependent ICTD trajectory, can be applied to the circular microphone array as well. In general, if a microphone array is composed of (M+1) sensors, all the information from every possible microphone pair is under consideration, in order to practically improve the SSL resolution. If the M-channel circular microphone array is located on the right side of the sphere and the one additional microphone is fixed on the other side, the (microphone which is the element of the Mchannel circular array) position-dependent ICTD trajectory can be reproduced exactly the same as the proposed ICTD trajectory. Thus, the proposed localization cue-based 3-D SSL can be also applicable to the circular microphone array. However, the more microphones that are used for SSL, the more costly it is to produce the microphone array, especially due to the price of the Analog-to-Digital converters (ADC), which is proportional to the number of channels. However, sequential sampling and signal processing could be an alternative to reduce the production cost.
On the other hand, the source position was supposed to be outside the rotating microphone array. However, noises 13 Sangmoon Lee, Youngjin Park and Youn-sik Park: Three-dimensional Sound Source Localization Using Inter-channel Time Difference Trajectory emitted by the (ultrasonic) motor and its driver inside the sphere could be interior noisy sources. Thus, we needed to suppress the propagation of these noises into the microphone by combining the microphone and the electronic boards in a cylindrical block, as shown in Figures 16 and  17. In addition, the directivity of the pin-type microphone (QB686, RFQ) utilized in the research was compared with that of the omnidirectional ¼ inch microphone (4178, B&K). It is generally known that the remote microphone is used for public speaking, i.e., the primary source is a speaker's voice. Thus, this type of microphone needs to have directionality. For comparison, two directivity patterns were measured and shown in Figure 23. The omni-directionality of the B&K microphone is clearly visible and the directivity pattern of the pin-type microphone is asymmetric with respect to the 90° direction. If we consider that the microphones are facing outward through the block cap and that the directivity pattern of the microphone is asymmetric, the interior noises are not a serious problem. As mentioned before, we assumed that a sound source is fixed. In daily life, a source moves slowly compared with the rotation period of the array. However, in a situation where there is a fast-moving source, the patterns of the peak and the side edges in the SDE would be quite different compared with those in Figures 12 and 21. Usually, the movement of the source occurs along the azimuth angle axis. Therefore, the peak shape in the SDE would be stretched along the time axis according to the direction of the source movement and the magnitude of the peaks would be suppressed. In this case, without the information about the initial direction of the fast-moving source, its direction cannot be estimated using a single measurement because the peak shape in the SDE is not a time-dependent feature. Even though it is possible to track a fast-moving source when increasing the angular velocity of the rotating part, a safety issue can arise.

Conclusion
This paper proposed an ICTD trajectory as the new 3-D SSL cue and, as one of the possible ways to realize the proposed cue concept, the two-channel rotating microphone array was discussed. The characteristics of the ICTD trajectory induced by the circular motion of the rotating array were presented by the Ray-Tracing method: the mean value of the ICTD trajectory is dependent on the azimuth angle of a source only and the shift angle corresponding to the maximum (or minimum) ICTD is directly related to the elevation angle of a source. Also, the amplitude of the ICTD trajectory is asymmetric with respect to the front side, which is caused by the circular motion of the rotating microphone on the right side of the sphere. The simulation results demonstrated that the amplitude of the ICTD trajectory is the essential factor for the SSL performance. The results of the two experiments carried out in the room environment demonstrated that the 3-D SSL method using the ICTD trajectory of the two-channel rotating microphone array can effectively localize a Gaussian white noise source and a voice source in 3-D space. It is noteworthy that the estimator was in the form of the line-integral of GCC-PHAT functions similar to the steered beam power (SRP)-PHAT method [27,28].