Sound Field Interpolation for Rotation-Invariant Multichannel Array Signal Processing

In this paper, we present a sound field interpolation for array signal processing (ASP) that is robust to rotation of a circular microphone array (CMA), and we evaluate beamforming as one of its applications. Most ASP methods assume a time-invariant acoustic transfer system (ATS) from sources to the microphone array. This assumption makes it challenging to perform ASP in real situations where sources and the microphone array can move. Therefore, considering a time-variant ATS is an essential task for the use of ASP. In this study, we focus on one such movement, the rotation of the CMA. Our method interpolates the sound field on the circumference of a circle, where microphones are equally spaced, based on the sampling theorem on the circle. The interpolation enables us to estimate the signals at the microphone positions before the rotation. Hence, conventional ASP, which assumes a time-invariant ATS, is applicable after interpolation without modification. We developed two beamforming schemes, one for batch and one for online processing, that combine the minimum power distortionless response beamformer and sound field interpolation. We evaluated the dependences of the interpolation on frequency and rotation angle using the signal-to-error ratio. Additionally, simulation results demonstrated that the two proposed schemes improve the beamformer's performance when the CMA rotates.


Sound Field Interpolation for Rotation-Invariant
Multichannel Array Signal Processing

I. INTRODUCTION
A RRAY signal processing methods remain an important research topic. Examples of its related topics include beamforming, source separation, and estimation of the direction and the time difference in source arrival. Independent low-rank matrix analysis [1] and multichannel nonnegative matrix factorization [2] are state-of-the-art source separation methods using sophisticated models. They have been extended to studies such as independent deeply learned matrix factorization [3] and an alternative update rule of the demixing matrix, that is, the so-called iterative source steering [4]. Novel approaches for beamforming have also been presented, e.g., time-frequency-bin-wise switching beamforming [5], [6] and the time-varying spatial covariance matrix (SCM) estimation [7], [8]. These advanced methods have achieved high performance through the modification of the spatial or source models or the calculation methodology. At the same time, they require a time-invariant acoustic transfer system (ATS) to maintain the performance. In other words, in these methods, it is assumed that the microphone array and the sources do not move. A change in the ATS imposes the re-estimation of the spatial filter, making real-time processing difficult. Most array signal processing (ASP) methods require statistical information such as the SCM to estimate the spatial model. Therefore, re-estimation of the filter requires a long time duration, which becomes a bottleneck for online processing in a real environment. As mentioned above, dealing with the time-invariant ATS is one factor to be considered in the practical use of online ASP in a real environment. The problem of time-variant ATS is separated into two cases: moving sources and moving sensors. The basic approach in the former case is blockwise processing by combining the direction-of-arrival (DOA) information estimated by another module [9], [10], tracking multisources using DOA estimates, and separating the sources. However, even such an approach requires the re-estimation of the spatial filter for every block in which the DOAs change. In comparison, Taseska and Habets [11] realized online source separation by estimating the SCM sequentially with DOA information estimation. Our method adopts the latter case, but this is not a highly active area of research; the overview of ASP by Gannot et al. [12] introduces several studies in the former case, but not in the latter case. Examples of moving sensors include the situation where a robot or human wearing a microphone array on the head or a human wearing hearing aids rotates the head to listen attentively to the ambient conversation. Also, Valimaki et al. [13] have considered controlling spatial acoustic information in the virtual reality space. For such a situation, Tourbabin and Rafaely [14] interpolated the sound field with a motion compensation matrix in the spherical harmonic (SH) domain to estimate a DOA for a moving humanoid robot. They used the Wigner's D matrix related to spherical and symmetric rigid rotors as the rotation matrix. Corey and Singer [15] surveyed beamforming performance with the two types of deformable microphone array when these rotate, assuming that the SCM after rotation is obtained as the training data. Casebeer et al. [16]  to deal with the pose change of mobile microphone arrays and proposed time-varying SCM estimation using a recurrent neural network. In addition to processing in a dynamic scenario where sensors and sources move, various methods for dynamic scenario simulation have also been studied [17], [18].
We assume the sensor-moving situation described above. In this study, in particular, we take one of the ATS variations, the rotation of a microphone array, into consideration and propose a sound field interpolation for ASP robust to the array's rotation, using a circular microphone array (CMA). That is, even if the CMA rotates, the proposed scheme enables the time-variant ATS to be regarded virtually as a time-invariant one, thereby enabling any conventional ASP to work well. A strength of the proposed method is its independence on any particular kind of downstream ASP. The conceptual diagram of the framework is shown in Fig. 1. Interpolation of the sound field before ASP compensates for the time-invariant ATS. The proposed method utilizes the periodicity of the sound field on the circumference of a circle and the relationship between sensing the sound field with an equally spaced CMA and discretizing the sound field. These two points and the noninteger sample shift theorem of a discrete signal enable the estimation of the sound field of the position at which there are no microphones by a simple calculation. Moreover, we apply this scheme to beamforming, which is one ASP, extend it to online beamforming, and confirm its efficacy via numerical simulation.
Our study, focusing on the CMA rotation, is related to two research topics: modal ASP and interpolation. First, it is necessary to touch on modal ASP, i.e., SH-domain processing or, in the case of restriction to the two-dimensional space, circular or cylindrical harmonics (CH)-domain processing [19]. SHdomain processing decomposes the sound field into different directivities with characteristics such as monopole and dipole. Handling frequency-wavenumber spectra in each directivity enables the control of the spatial filter and the generation of the desired directivity patterns. It is often applied, for example, to beamforming [20], [21], [22], [23], [24], SCM estimation [25], sound field reproduction [26], and DOA estimation [14], [27]. In particular, the concept in [14] resembles that of our work, and they have a theoretical relationship (see the discussion in Section II-C2). Unlike the method in [14], our method works in the original signal domain, not in the SH nor CH domain. It allows us to utilize any ASP techniques as they are and make a connection with the interpolation as we describe next. This idea enables beamforming in the short-time Fourier transform (STFT) domain to naturally be extended from batch to online processing. It works seamlessly even under the CMA rotation. Also, the other related topic is interpolation. Ueno et al. [28] interpolated the sound field of the desired area by solving the optimization problem using the Helmholtz equation. Yamaoka et al. [29] interpolated the generalized cross-correlation function of two sound sources via the sinc function to obtain their time difference at the noninteger sample level. Schüldt [30] introduced the trigonometric interpolation to solve the problem of oscillation in polynomial beamforming [31] by using the symmetry and periodicity of the CMA, as in our method. These studies show that the acquisition of noninteger sample points improves the estimation accuracy. Our method applies the noninteger sample shift theorem to the sound field on a circle to achieve interpolation. It leads to a simple linear transformation for the equally spaced CMA. An expansion with unequally spaced CMA has been discussed in [32].
This paper includes some of the content of the conference paper [33] in which we reported the sound field interpolation and applied it to beamforming. The contribution of this paper is that we extended beamforming to online processing. In addition, note that we modified the formulation of the sound field interpolation for an even number of microphones because we found an error in the sign in equation (3) of [33]. Moreover, we rethought the handling of the Nyquist frequency component in the sound field interpolation and conducted new experiments. The remainder of this paper is organized as follows. In Section II, we explain the idea of our sound field interpolation and formulate it. Also, we discuss several points related to the formulation. In Section III, we describe how to apply the interpolation to beamforming as an example of multichannel signal processing with batch and online methodologies. Then, in Section IV, we evaluate the performance of the sound field interpolation itself using the signal-to-error ratio (SER) as a metric and measure the accuracy of beamforming as downstream processing. Finally, we conclude this paper in Section V.

A. Formulation
First, we consider a continuous sound field on the circumference of a circle, z(θ), θ ∈ [0, 2π), as shown in Fig. 2. Here, θ indicates the spatial angle. Obviously, z is a periodic function with 2π, although we cannot a priori know its concrete formulation. When we set M microphones on the circle circumference at even intervals and record a sound field with them, the mth observed signal is represented as That is, sensing a sound field with a CMA is equal to the discretization of the sound field along the spatial angle. Then, assuming that the maximum frequency of z(θ) is less than half of the spatial sampling frequency on the circle's circumference, M/2π, according to the sampling theorem, we can say that the discrete signal z m can reconstruct the continuous sound field z(θ). Generally speaking, this assumption is not strictly satisfied. For example, for a planar wave e jωt coming from a direction of 0 rad, the Fourier coefficients of (1) are represented by the Bessel function of the first kind, such as Z k = e jωt 2π 0 e −j(kθ+rω cos θ/c) dθ, where k is the order of the Fourier coefficient, c is the sound speed, and r is the circle's radius. As is well known, Z k does not have finite support of k. However, for a large k, |Z k | can be sufficiently small for practical use, depending on ω. Concretely, |Z k | is almost negligible for ωr/c > k, which has been discussed by Alon and Rafaely [34]. Fig. 3 shows some examples of |Z k | in the case of r = 0.05 m. We can see that |Z k | is close to zero for a large k. On the basis of this observation, we proceed to a discussion, assuming the sampling theorem holds. We will investigate the effect of the error of this approximation experimentally in a later section.

B. Formulation of Linear Interpolation
To formulate a sound field interpolation, we utilize the noninteger sample shift theorem in the Fourier domain. We can clearly say that the shifted signal z m (δ) of the δ-sample corresponds to the observation with the CMA rotated Δ = 2πδ/M rad. Specifically, when we designate the sound field with the CMA in the reference position (i.e., not rotating) as z m , the observation of the same sound field with the Δ rad-rotated CMA is represented as z m (δ) = z(2πm/M + Δ). From the shift theorem, z m (δ) can be expressed as Although the shift theorem using the discrete Fourier transform (DFT) F D does not strictly hold with a noninteger δ, we assume its satisfaction. The coefficient U mn (δ) is defined as where L = (n − m − δ)/2 M and j = √ −1. The detailed derivation of (3) is provided in Appendix A. From (2), we can also represent the sound field interpolation using matrix operation with the rotation transform matrix U (Δ) defined as Note that U (Δ) is a cyclic and unitary matrix and its components do not depend on the frequency of the observation.

1) Discussion of Nyquist Frequency Component:
The major difference between odd and even M in (3) is whether or not we must handle the Nyquist frequency (NyqF) component. It is generally known that we cannot identify whether NyqF is positive or negative. Thus, a noninteger sample shift of an evennumbered-point signal requires handling of the shifted NyqF component. As shown in (3), the numerator of the first term when M is even, −(−1) n−m e −jδπ , corresponds to the NyqF component. For example, supposing a noninteger shift of a realvalued even-numbered-point signal via (2) and (3), this complexvalued term translates the signal into the complex-valued signal, causing a contradiction. To avoid this, we neglect the negative effect by considering two alternates. One is that we substitute zero into δ of only this term. The other is that we handle only the real part of this term. We respectively call them "zero phase NyqF (ZPN)" and "real part of NyqF (ReN)" in this paper. In addition, we call the case in which the complex value of the term is directly used "CxN" to discriminate them.
2) Relationship to Spherical Harmonics Expansion: This sound field interpolation has a close relationship with the SH expansion in the two-dimensional circle circumference, i.e., the CH expansion [19]. It is particularly meaningful to compare the rotation transform matrix (4) with the rotation matrix in the previous work by Tourbabin and Rafaely [14]. They proposed a rotation matrix using Wigner's D matrix in the SH domain, R(θ), as follows: where diag( * ) is the diagonal matrix whose elements are * , and Ψ is the SH order. On the other hand, coming back to the beginning of our derivation for sound field interpolation, i.e., the sample shift in the Fourier domain, By translating (7) to the equation with the DFT matrix F , we can formulate where is the matrix representing phase rotation for the δ-sample shift, and * indicates the ceiling function. By multiplying the matrix F from the left, we obtain the following equation: Multiplying multichannel observation vectors, z(δ) and z, by F constitutes the CH transform of the sound field. Therefore, our sound field interpolation method is equivalent to multiplying the observation by E(Δ) in the CH domain. Here, we raise a concrete example with the CH order Ψ = 2, i.e., the number of microphones M = 5. At this time, the rotation matrix in [14] corresponding to the second order is transcribed as which is equal to the phase rotation matrix E(Δ) above. In addition, (8) shows the diagonalization of the rotation transform matrix U (Δ) = F −1 E(Δ)F by the DFT matrix, which implies that U (Δ) is a cyclic matrix.

D. Versatility
Although in this paper we demonstrate only beamforming as downstream processing, (4) shows the versatility of the proposed method. That is, if the multichannel observation by the CMA is feasible, any downstream ASP is applicable, e.g., source separation, DOA estimation, and sound source localization. As one example, in [35], the self-rotation angle of the CMA is localized on the basis of our method. Moreover, another advantage is the adaptation to rapid rotation. In conventional approaches for dynamic scenarios, gradual temporal changes of environments are assumed, making it difficult for them to adapt to abrupt sensor movement. In contrast, the proposed method performs sound field interpolation by utilizing the rotation angle information at each frame, enabling it to effectively accommodate even rapid sensor movement. This advantage enables the extension from batch processing to online processing.

A. Problem Setting
Although the sound field interpolation is available for various types of multichannel signal processing, such as blind source separation and source localization, this paper focuses on beamforming as the aim of the interpolation. We consider the following condition as described in Section I. An example of this situation is when a humanoid robot or human wearing a CMA on the head rotates the head to listen to ambient conversations carefully while the speakers talk without moving, as shown in Fig. 4.
2) The steering vector of the target sound source observed by the CMA at the reference position, a f , is given, where f indicates the frequency index. The steering vector is often assumed to be obtained or estimated in advance in beamforming research [6], [36], [37].
3) The rotation angle for every time frame, θ t , is given, where t indicates the time frame index. We can obtain the information using another module, e.g., a CMA with an acceleration sensor, such as an inertial measurement unit [38], hardware measurement by an outer camera(s), or some sensor localization scheme [35], [39], [40]. Under the above conditions, we design two methodologies for applying the sound field interpolation to beamforming. One is the use of a pre-estimated spatial filter (Section III-B), The other is online spatial filtering (Section III-C). We assume the use of the minimum power distortionless response beamformer (MPDR-BF) [41]. In the following sections, we call the observation by the CMA in the reference position without rotation the "reference observation" and assume the CMA is at the reference position at the start time.

B. Batch Processing With Predesigned Spatial Filter
In this process, we always use a fixed spatial filter w f that is estimated in advance. In other words, we require the reference observation with a long enough time to estimate the SCM V f under the assumption that sound sources and a CMA do not move during all that time. Here, the MPDR-BF is formulated as where tf , assuming the ergodic process. Before beamforming, estimating the reference observation x ref,tf by the following equation enables the direct use of the pre-estimated spatial filter.
which means that the interpolation along the inverse rotation can put the rotated CMA back into the reference position. Then we can enhance the target source by using the filter (12) and the reference observation as usual: Another viewpoint in the pre-estimated beamformer: Beamforming with sound field interpolation by (14) results in filtering of the interpolated observation. At the same time, (14) indicates that we can obtain a spatial filter at a CMA position different from the reference position using the interpolation, as follows: The interpolated filter w f (−θ) is expanded as where is unitary. This formula might imply that interpolation of the observation and that of the steering vector are identified theoretically. One possibility based on this discussion is that the steering vector at the reference position and the sound field interpolation may enable the estimation of a steering vector of a sound source after rotation. Although this is not the main focus of this paper, it may be a topic of future work.

C. Online Processing of Spatial Filter
Although the batch processing described in Section III-B can be a powerful solution when the ATS, except for the CMA rotation, is stationary, it is rare to satisfy this condition strictly in the real world. Considering the situation in which the ATS changes slightly, we advocate online processing with sound field interpolation for beamforming and design an updated algorithm Algorithm 1: Online Beamforming Update Algorithm With Sound Field Interpolation.
in this section. We expect online processing to enable dealing with the slight variation of the ATS except for the CMA rotation. We introduce a well-known smoothing (forgetting) factor [42] to update the SCM in the online processing. In addition, we use a matrix inversion lemma, the Sherman-Morrison formula, which can reduce the complexity of calculating the covariance inversion that appears in the MPDR-BF formulation.
Firstly, we estimate the reference observation from the observation by (13), as in the batch processing. By using the interpolated observation, we can estimate the SCM at the tth frame from that of the (t − 1)th frame,V (t−1)f , and the smoothing factor α as follows: Such a formulation with the smoothing factor has often been seen in various studies for online signal processing [43], [44]. Furthermore, the Sherman-Morrison formula enables the calculation of the inversion ofV tf with low complexity and its application to MPDR beamforming (12) in every time frame. Its inversion is formulated aŝ After that, updating the spatial filter using the inversion matrix enhances the target source online. Algorithm 1 illustrates the whole pseudocode summarizing the above formulation for framewise online processing. Note thatV tf −1 can be initialized with, for example, a random matrix, the identity matrix, or the inversion of x tf x H tf averaged in the first several time frames.

A. Experimental Condition
We conducted computational simulations to evaluate the performance of the proposed sound field interpolation and its influence on ASP. We used eight samples (four female and four male voices) with the sampling rate of 16 kHz from the SiSEC database [45] as anechoic sound sources. We generated observed signals that are convolutive mixtures of room impulse responses (RIRs) simulated by the RIR generator [46] on the basis of the image method [47]. In this environment, the reverberation time RT60 was approximately 100 ms. We used such a small RT60 for the conceptual confirmation of the proposed method, although it is not theoretically affected by reverberation. We mixed two arbitrary sources selected from among eight sources from different directions so that the angle between two sources is 30, 60, ..., 180 deg, as shown in Fig. 5. In this manner, we simulated twelve environments (two patterns at each of six angles). We simulated the signals with the equally spaced M -channel CMA with a radius of 0.05 m. We set the reference position of the CMA such that the first channel microphone is in the positive direction on the horizontal axis. Also, we simulated the same sound field with the CMA rotated Δ rad. Then, we estimated observation signals at the reference position from signals obtained after rotation using the rotational angle φ = Δπ/180 deg as a known value. For analysis, we conducted the STFT using a 1/8-shifted Hamming window with a length of 64 ms. We performed two simulations. First, we evaluated the sound field interpolation performance from the SER defined as where x mtf is the mth-channel STFT complex-valued spectrum at the tth time frame and f th frequency bin, andx mtf is its estimate. We set the number of microphones M in the range of 3-8 and varied the rotation angle φ. We evaluated the case of one source, that is, we did not mix sound sources in the first simulation. Second, we evaluated the source enhancement performance with an MPDR-BF [41] after interpolation using the signal-to-distortion ratio (SDR) and source-to-interference ratio (SIR) [48] in two ways as described in Section III. The former is batch processing described in Section III-B, and the latter is online processing in Section III-C. In the former experiment, we set M = 5 and φ as 10, 20, 30, 36, and 40 deg. The rotational angle of 36 deg corresponds to a 0.5 sample shift along the circle because the angle between the two microphones is 72 deg. In the latter experiment, we set M = 5 and 8, used sound sources with a long time length, and changed the CMA position twice halfway through playing the sound. The detailed setup will be explained in the following section (Section IV-E1). In either case, to estimate the filter, we used the relative transfer function (RTF) [49] calculated using RIRs from the target source to each microphone.
In addition to the above, we evaluated the robustness against microphone positions. Sound field interpolation requires an equally spaced CMA, while there are some cases where microphones are difficult to locate at equal spaces completely. In this analysis, we considered the microphone position perturbation as the difference from equally spaced positions along the same circumference, which follows the zero-mean Gaussian distribution. The details are shown in Section IV-D. In comparison, it was not easy to estimate the higher frequency component. There also was no variance among channels.

B. Interpolation Accuracy
To simplify the analysis, we restricted the frequency range to 0-3 kHz in SER and averaged the SERs in decibels. Fig. 7 shows the dependence of the SER on the rotation angle, where the  vertical axis is the average of the SERs in the eight environments with M = 5 and 8. Also, the baseline illustrates the SER without interpolation. As shown, we find that the behavior is periodic. This is because the CMA rotation by the angle between adjacent microphones merely shifts the microphone index, e.g., when M = 5 and the rotation is 72 deg, the corresponding permutation is  Fig. 8 shows channelwise SER improvements in the frequency range up to 3 kHz relative to the case without sound field interpolation. It is considered that using more microphones would result in better performance because the spatial sampling rate would increase. However, interestingly, these results illustrate that using more microphones does not always improve the SER, as in the case of changing the number of microphones M from 3 to 4 in the CxN case, i.e., when the NyqF component is considered. As explained in Section II-C1, in ZPN, the NyqF component does not contribute to the interpolation, except at the angle where exp(jδπ) = 1. When M is even, one Fourier component (NyqF component) is wasted for interpolation. For example, in the ZPN case with M = 4, one of four components, 25% of the information, is wasted. Therefore, using odd M is sometimes more efficient than using even M in our method if the available number of microphones is three or four. By contrast, ReN always improves the SER with increasing M . From these results, ReN is judged to be effective for sound field interpolation, and we use ReN in the following evaluation when M is even.

C. Source Enhancement with Batch Processing
First, we produced the MPDR-BF filter w using the RTF and the multichannel STFT spectrogram observed with the CMA at the reference location (no rotation). We applied w to the following three spectrograms: without CMA rotation (No-Rot), without interpolation when the CMA rotates (No-Int), and with interpolation when the CMA rotates (Int). Also, we applied another MPDR-BF calculated from the same steering vector and the interpolated spectrogram to the interpolated spectrogram (Int+Re-est). We can use this method to predict the performance of the online beamforming described in the next section. We also used the unprocessed case (No-Proc), i.e., the observation, and No-Rot as baselines and compared them with the other three cases. No-Rot shows the best performance of the MPDR-BF when the ATS does not change. Fig. 9 shows that small changes in ATS greatly affect the ASP performance, and the proposed method (Int) outperforms the case without interpolation (No-Int) and comes close to the best performance (No-Rot). The degradation in the No-Int case along the rotational angle resembles the SER curve of "5ch baseline" in the range of 0-40 deg in Fig. 7. This shows that the distortion of the observation itself directly affects the SDR and SIR. One of the reasons why the Int+Re-est case did not perform worse than the Int case is the mismatch between the SCM estimated from the interpolated spectrogram and the pre-estimated steering vector. The mismatch interferes with the production of the spatial filter that supresses the interfering signal and degrades the enhancement performance. Because such a mismatch might occur in the case of the extension of online processing, we expect its performance also to be similar to the performance of online processing. The proposed method could not estimate the high-frequency component, as shown in SER results (Fig. 6), but significantly improved the ASP performance, especially in batch processing.

D. Robustness Evaluation
In this evaluation, we gave the microphone position perturbation as the difference from equally spaced microphone position and confirmed the robustness of sound field interpolation and downstream ASP against the perturbation. In this experiment, we assumed that the position error of each microphone, m , follows the zero-mean Gaussian distribution, m ∼ N (0, ς 2 ). We conducted 100 trials by varying the variance ς 2 from 1 to 10 deg 2 ; for example, ς 2 = 10 means that the perturbation within 3ς (≈ 10 deg) occupied by 99.7% from the property of Gaussian distribution. Fig. 10 shows the boxplots of SERs obtained with M = 5, 6, 7, 8 when a sound source is present, and the rotation angle is 10 deg. The horizontal axis is ς 2 . As shown in the figure, a perturbation of several degrees affects the performance, although the degradation of SER was within approximately 1 dB when position errors were small. We can also find that the larger the number of microphones, M , the easier for SER to degrade. The conceivable reasons for this observation include the fact that a smaller angle between adjacent microphones is sensitive to perturbation, or the larger number of microphones results in a larger total amount of perturbation. Table I   E. Online Beamforming 1) Setup: The experimental conditions are almost the same as described in Section IV-C. The different points are as follows: we used two source signals with the time length of 40 s and simulated the impulse response with the reverberant time of RT60=330 ms, which corresponds to a typical office room, in addition to 100 ms. The position of the two sources was at the angles of π/6 and 4π/6 rad with the same alignment as shown in Fig. 5. We set the frame length of 4096 samples (256 ms). We also used the 1 s-wise SDR, i.e., the segmental SDR, to evaluate the source enhancement performance and confirmed the effect of changing the smoothing factor α. We initialized the inversion of the SCM V tf −1 as the inversion of x tf x H tf averaged over the first 10 frames (≈ 320 ms). The CMA rotated twice; the first rotation of θ 1 deg occurred at 10 s, and the second one of θ 2 deg at 25 s. We also considered the two cases of θ 1 = 30 and θ 2 = 60 and θ 1 = 72 and θ 2 = 0. We generated such observations by concatenating the observations of the three different microphone arrays in the simulation. Note that 0 deg means the reference position. In addition, we designed the direction-dependent covariance (DDC) method as an additional baseline of the online approach. In DDC, the SCM is reset at the time of rotation and preserved. Then, when the CMA position moves to a revisited angle, the preserved SCM is used immediately. In other words, DDC is a block processing based on angle, not time, and has an anglewise dictionary for SCM.
2) Segmental SDR: Fig. 11 shows segmental SDRs in the two patterns of CMA rotation. Here, we used the smoothing factor α of 0.99 that produced the highest segmental SDR in a preliminary experiment. As shown in the top four figures (0 ⇒ 72 ⇒ 0 deg), Int-online outperforms No-Int-online and No-Int-DDC-online and is close to No-Rot because the interpolation is completely achieved at the rotation of 72 deg when M = 5, as described in the previous section. When M = 8, although the interpolation is not completely achieved, high performance is maintained because more microphones are used. This tendency is the same regardless of reverberation time. The difference between Int and Int-online is only the time length of observation for estimating the spatial filter; Int-online uses only information in the previous time frame, and therefore, it degrades SDR slightly more than Int. In comparison with them, the bottom four graphs show different trends for the reverberation time and the number of microphones. In the case of the rotation of 0 ⇒ 30 ⇒ 60 deg, Int-online improves SDR in RT60=100 ms, although it cannot improve SDR in RT60 = 330 ms because of incomplete interpolation, compared with the case where interpolation is completely achieved above. The estimation accuracy of the spatial filter is also one possible reason for degradation. However, this possibility is denied by the result that No-Rot works well and Int does not. A possible reason is that room reflection sound causes the assumption of a planer wave in sound field interpolation to be invalid. This degradation is solved by using more microphones, whereas a more severe reverberation environment will need more microphones to achieve source enhancement.
Moreover, the comparison between No-Int-online and No-Int-DDC-online is interesting. Their rough trends in SDR are almost the same, that is, both SDRs remain degraded during rotation in any scenario. Considering that the difference between the two methods is only the update of the SCM, this demonstrates that even if the SCM is updated immediately, the mismatch of the SCM and the steering vector always occurs as long as the steering vector is fixed (as explained in problem setting 2) in Section III-A). In contrast, the difference is the speed of improving SDR. In the 0 ⇒ 72 ⇒ 0 scenario, after the second rotation, which is the same as the initial position, the SCM is updated correctly after several frames, and the mismatch is resolved. Therefore, No-Int-DDC-online reaches the same SDR as Int-online, more rapidly than No-Int-online owing to the immediate reset of internal statistics. In comparison, the mismatch is always not resolved in the 0 ⇒ 36 ⇒ 60 scenario, and SDR does not improve. These observations indicate the necessity of steering vector estimation for managing microphone movement situations without interpolation.

F. Discussion
In this study, we assume that there are no rigid bodies, such as reflectors, in the CMA during the evaluation. Our approach is based solely on the Fourier series expansion of the sound field on a circle. Although a rigid body may exist in the CMA, the smoothness of the sound field on the circle might not be significantly affected, particularly if the rigid body exhibits rotational symmetry, such as a sphere or a head. Therefore, our approach is expected to work well in such cases. Furthermore, we confirm that the interpolation technique proposed in this study works well, even in the presence of such rigid bodies within the CMA, in an ongoing work of realizing a real-world system of self-rotation angle estimation [35].
Possible sensor movements in 3D space include three translations along each axis and three rotations: "pitch", "yaw", and "roll". In this study, we focus on "yaw", which corresponds to a head rotation along the horizontal plane due to two factors: 1) it involves rapid motion and 2) the head may not immediately return to its original position. The translational motion of the head along the three (but usually two) axes tends to be slower, as it occurs during activities such as walking. Additionally, "pitch" and "roll" (tilting and nodding of the head) typically result in the head rapidly returning to its initial position. Therefore, we consider that the interpolation against "yaw" is more important than that against "pitch" and "roll". A spherical microphone array might be required to manage all types of motion, which will be considered in our future studies.

V. CONCLUSION
We presented a new framework of beamforming robust to CMA rotation using a sound field interpolation method and applied it to batch and online beamforming. The interpolation method could virtually regard the time-variant ATS as a timeinvariant one by using the periodicity of the sound field observed using the CMA and its noninteger sample shift. Experimental results illustrated that our simple method could estimate the lower band spectrum and assist the ASP, even when the CMA rotates. Future work includes improving the estimation accuracy of the higher frequency component by another approach, conducting additional experiments with different ASP methods, e.g., source separation and source localization, and confirming the performance with a real device in real environments. Although our proposed method uses the characteristics of CMA, the extension to an arbitrary array configuration could also be an important future work.

ACKNOWLEDGMENT
The authors would like to thank S. Luan and Prof. Toda of Nagoya University, who pointed out a misformulation.

APPENDIX DERIVATION OF SOUND FIELD INTERPOLATION
We consider the case where M is even. Let the Fourier coefficient of z m (m = 0, . . . , M − 1) be Z k , (k = −M/2 + 1, . . . , M/2). Then, we formulate the sample shift using the shift theorem as follows: Kouei Yamaoka (Student Member, IEEE) received the B.Sc. in information engineering and M.E. degrees in engineering from the University of Tsukuba, Tsukuba, Japan, in 2017 and 2019, respectively. He is currently working toward the Ph.D. degree with Tokyo Metropolitan University, Hino, Japan. His research interests include acoustic signal processing, signal enhancement, source localization, and asynchronous distributed microphone array. He is a Member of the Acoustical Society of Japan.
Nobutaka Ono (Senior Member, IEEE) received the B.E., M.S., and Ph.D. degrees in mathematical engineering and information physics from the University of Tokyo, Tokyo, Japan, in 1996, 1998, and 2001, respectively.
He was a Research Associate with the University of Tokyo in 2001, and became a Lecturer in 2005. He was also an Associate Professor with the National Institute of Informatics, Tokyo, Japan, in April 2011, and became a Professor in 2017. In 2017, he was with Tokyo Metropolitan University, Hino, Japan. He is the author or co-author of more than 280 articles in international journal papers and peer-reviewed conference proceedings. His research interests include acoustic signal processing, especially microphone array processing, source localization and separation, machine learning, and optimization algorithms. He was a Tutorial Speaker with ISMIR 2010 and ICASSP 2018. Dr