Enhancing target speech based on nonlinear soft masking using a single acoustic vector sensor

Enhancing speech captured by distant microphones is a challenging task. In this study, we investigate the multichannel signal properties of the single acoustic vector sensor (AVS) to obtain the inter-sensor data ratio (ISDR) model in the time-frequency (TF) domain. Then, the monotone functions describing the relationship between the ISDRs and the direction of arrival (DOA) of the target speaker are derived. For the target speech enhancement (SE) task, the DOA of the target speaker is given, and the ISDRs are calculated. Hence, the TF components dominated by the target speech are extracted with high probability using the established monotone functions, and then, a nonlinear soft mask of the target speech is generated. As a result, a masking-based speech enhancement method is developed, which is termed the AVS-SMASK method. Extensive experiments with simulated data and recorded data have been carried out to validate the effectiveness of our proposed AVS-SMASK method in terms of suppressing spatial speech interferences and reducing the adverse impact of the additive background noise while maintaining less speech distortion. Moreover, our AVS-SMASK method is computationally inexpensive, and the AVS is of a small physical size. These merits are favorable to many applications, such as robot auditory systems. Abstract: Enhancing speech captured by distant microphones is a challenging task. In this study, we investigate the multichannel signal properties of the single acoustic vector sensor (AVS) to obtain the inter-sensor data ratio (ISDR) model in the time-frequency (TF) domain. Then, the monotone functions describing the relationship between the ISDRs and the direction of arrival (DOA) of the target speaker are derived. For the target speech enhancement (SE) task, the DOA of the target speaker is given, and the ISDRs are calculated. Hence, the TF components dominated by the target speech are extracted with high probability using the established monotone functions, and then, a nonlinear soft mask of the target speech is generated. As a result, a masking-based speech enhancement method is developed, which is termed the AVS-SMASK method. Extensive experiments with simulated data and recorded data have been carried out to validate the effectiveness of our proposed AVS-SMASK method in terms of suppressing spatial speech interferences and reducing the adverse impact of the additive background noise while maintaining less speech distortion. Moreover, our AVS-SMASK method is computationally inexpensive, and the AVS is of a small physical size. These merits are favorable to many applications, such as robot auditory systems.


Introduction
With the development of information technology, intelligent service robots will play an important role in smart home systems. Auditory perception is one of the key technologies of intelligent service robots [1]. Research has shown that special attention is currently being given to human-robot interaction [2], and especially speech interaction in particular [3,4]. It is clear that service robots are always working in noisy environments, and there are possible directional spatial interferences such as the competing speakers located in different locations, air conditioners, and so on. As a result, additive background noise and spatial interferences significantly deteriorate the quality and intelligibility of the target speech, and speech enhancement (SE) is considered the most important preprocessing technique for speech applications such as automatic speech recognition [5].
Single-channel SE and two-channel SE techniques have been studied for a long time, while practical applications have a number of constraints, such as limited physical space for installing large-sized microphones. The well-known single channel SE methods, including spectral subtraction, Wiener filtering, and their variations, are successful for suppressing additive background noise, but they are not able to suppress spatial interferences effectively [6]. Besides, mask-based SE methods have predominantly been applied in many SE and speech separation applications [7]. The key idea behind mask-based SE methods is to estimate a spectrographic binary or soft mask to suppress the unwanted spectrogram components [7][8][9][10][11]. For binary mask-based SE methods, the spectrographic masks are "hard binary masks" where a spectral component is either set to 1 for the target speech component or set to 0 for the non-target speech component. Experimental results have shown that the performance of binary mask SE methods degrades with the decrease of the signal-to-noise ratio (SNR) and the masked spectral may cause the loss of speech components due to the harsh black or white binary conditions [7,8]. To overcome this disadvantage, the soft mask-based SE methods have been developed [8]. In soft mask-based SE methods, each time-frequency component is assigned a probability linked to the target speech. Compared to the binary mask SE methods, the soft-mask SE methods have shown better capability to suppress the noise with the aid of some priori information. However, the priori information may vary with time, and obtaining the priori information is not an easy task.
By further analyzing the mask-based SE algorithms, we have the following observations. (1) It is a challenging task to estimate a good binary spectrographic mask. When noise and competing speakers (speech interferences) exist, the speech enhanced by the estimated mask often suffers from the phenomenon of "musical noise". (2) The direction of arrival (DOA) of the target speech is considered as a known parameter for the target SE task. (3) A binaural microphone and an acoustic vector sensor (AVS) are considered as the most attractive front ends for speech applications due to their small physical size. For the AVS, its physical size is about 1-2 cm 3 and AVS also has the merits such as signal time alignment and a trigonometric relationship of signal amplitudes [12][13][14][15][16]. A high-resolution DOA estimation algorithm with a single AVS has been proposed by our team [12][13][14][15][16]. Some effort has also been made for the target SE task with one or two AVS sensors [17][18][19][20][21]. For example, with the minimum variance distortionless response (MVDR) criterion, Lockwood et al. developed a beamforming method using the AVS [17]. Their experimental results showed that their proposed algorithm achieves good performance for suppressing noise, but brings certain distortion of the target speech.
As discussed above, in this study, we focus on developing the target speech enhancement algorithm with a single AVS from a new technical perspective in which both the ambient noise and non-target spatial speech interferences can be suppressed effectively and simultaneously. The problem formulation is presented in Section 2. Section 3 shows the derivation of the proposed SE algorithm. The experimental results are given in Section 4, and conclusions are drawn in Section 5.

Problem Formulation
In this section, the sparsity of speech in the time-frequency (TF) domain is discussed first. Then, the AVS data model and the corresponding inter-sensor data ratio (ISDR) models are presented for completeness, which was developed by our team in a previous work [13]. After that, the derivation of monotone functions between ISDRs and the DOA is given. Finally, the nonlinear soft TF mask estimation algorithm is derived specifically.

Time-Frequency Sparsity of Speech
In the research of speech signal processing, the TF sparsity of speech is a widely accepted assumption. More specifically, when there is more than one speaker in the same spatial space, the speech TF sparsity implies the following [5]. (1) It is likely that only one speaker is active during certain time slots. (2) For the same time slot, if more than one speaker is active, it is probable that the different TF points are dominated by different speakers. Hence, the TF sparsity of speech can be modeled as: where S m (τ,ω) and S n (τ,ω) are the speech spectral at (τ,ω) for the mth speaker and nth speaker, respectively. (3) In practice, at a specific TF point (τ,ω), it is most probably true that only one speech source with the highest energy dominates, and the contributions from the other sources can be negligible.

AVS Data Model
An AVS unit generally consists of J co-located constituent sensors, including one omnidirectional sensor (denoted as o-sensor) and J-1 orthogonally oriented directional sensors. Figure 1 shows the data capture setup with a single AVS. It is noted that the left bottom plot in Figure 1 shows a 3D-AVS unit implemented by our team, which consists of one o-sensor with three orthogonally oriented directional sensors depicted as the u-sensor, v-sensor, and w-sensor, respectively. In theory, the directional response of the oriented directional sensor has dipole characteristics, as shown in Figure 2a, while the omnidirectional sensor has the same response in all of the directions, as shown in Figure 2b. In this study, one target speaker is considered. As shown in Figure 1, the target speech S(t) is impinging from (θ s ,φ s ) meanwhile, interference S i (t) are impinging from (θ j ,φ j ), where φ s , φ i (0 • ,360 • ) are the azimuth angles, and θ s , θ i (0 • ,180 • ) are the elevation angles.  For simplifying the derivation, without considering room reverberation, the received data of the AVS can be modeled as [13]: where x avs (t), n avs (t) and a(θ s ,φ s ) are defined respectively as: are the received data of the u-sensor, v-sensor, w-sensor, and o-sensor, respectively; n u (t), n v (t), n w (t), n o (t) are assumed as the additive zero-mean white Gaussian noise captured at the u-sensor, v-sensor, w-sensor, and o-sensor, respectively; s(t) is the target speech; s i (t) are the ith interfering speech; the number of interferences is M i ; a(θ s ,φ s ) and a(θ j ,φ j ) are the steering vectors of s(t) and s i (t), respectively.

Monotone Functions between ISDRs and the DOA
Definition and some discussions on the inter-sensor data ratio (ISDR) of the AVS are presented in our previous work [13]. In this subsection, we briefly introduce the definition of ISDR first, and then present the derivation of the monotone functions between the ISDRs and the DOA of the target speaker.
The ISDRs between each channel of the AVS are defined as: where i and j are the channel index, which refers to u, v, w, and o, respectively. Obviously, there are 12 different computable ISDRs, which are shown in Table 1. In the following context, we carefully evaluate I ij , and it is clear that only three ISDRs (I uv , I vu and I wo ) hold the approximate monotone function between ISDR and the DOA of the target speaker. According to the definition of ISDRs given in Equation (13), we look at I uv , I vu and I wo first. Specifically, we have: Substituting Equations (9) and (10) into Equation (14) gives: where ε tus (τ,ω) = N tu (τ,ω)/S(τ,ω), and ε tvs (τ,ω) = N tv (τ,ω)/S(τ,ω). Similarly, we get I uw and I wo : In Equation (19), ε tws (τ,ω) = N tw (τ,ω)/S(τ,ω) and ε tos (τ,ω) = N to (τ,ω)/S (τ,ω). Based on the assumption of TF sparsity of speech shown in Section 2.1, we can see that if the TF points (τ,ω) are dominated by the target speech from (θ s ,φ s ), the energy of the target speech is high, and the value of ε tus (τ,ω), ε tvs (τ,ω), ε tws (τ,ω) and ε tos (τ,ω) tends to be small. Then, Equations (17)- (19) can be accordingly approximated as: where ε 1 , ε 2 , and ε 3 can be viewed as the ISDR modeling error with zero-mean introduced by interferences and background noise. Moreover, ε i (τ,ω) (i = 1, 2, 3) is inversely proportion to the local SNR at (τ,ω). Furthermore, from Equation (5), we have u s = sinθ s ·cosφ s , v s = sinθ s ·sinφ s and w s = cosθ s . Then, substituting Equation (5) into Equations (20)- (22), we obtain the following equations: From Equations (23)- (25), it is desired to see that the approximate monotone functions between I uv , I vu , and I wo and the DOA (θ s or φ s ) of the target speaker have been obtained since arccot, arctan, and arccos functions are all monotone functions.
However, except for I uv , I vu , and I wo , other ISDRs do not hold such a property. Let's take I uw as an example. From the definition in Equation (13), we can get: where ε 4 can be viewed as the ISDR modeling error with zero-mean introduced by unwanted noise. Obviously, Equation (26) is valid when w s is not equal to zero. Substituting Equation (5) into Equation (26) yields: From Equation (27), we can see that I uw is a function of both θ s and φ s . In summary, after analyzing all of the ISDRs, we find that the desired monotone functions between ISDRs and θ s or φ s , which are given in Equations (23)- (25), respectively. It is noted that Equations (23)- (25) are derived conditioned by assuming v s , u s , and w s are not equal to zero. Therefore, we need to find out where v s , u s , and w s are equal to zero. For presentation clarity, let's define an ISDR vector I isdr = [I uv , I vu , I wo ].
From Equation (5), it is clear that when the target speaker is at angles of 0 • , 90 • , 180 • , and 270 • , one of v s , u s , and w s becomes zero, and it means that I isdr is not fully available. Specifically, we need to consider the following cases: Case 1: the elevation angle θ s is about 0 • or 180 • . In this case, u s = sinθ s ·cosφ s and v s = sinθ s ·sinφ s are close to zero. Then, the denominator in Equations (20) and (21) is equal to zero, and we cannot obtain I uv and I vu , but we can get I wo .
Case 2: θ s is away from 0 • or 180 • . In this condition, we need to look at φ s carefully.
(1) φ s is about 0 • or 180 • . Then, v s = sinθ s sinφ s is close to zero, and the denominator in Equation (20) is equal to zero, which leads to I uv being invalid. In this case, we can compute I vu and I wo properly. (2) φ s is about 90 • or 270 • . Then, u s = sinθ s ·cosφ s is close to zero, and the denominator in Equation (21) is equal to zero, which leads to I vu being invalid. In this case, we can obtain I uv and I wo properly. To visualize the discussions above, a decision tree of handling the special angles in computing I isdr is plotted in Figure 3. When I isdr = [I uv , I vu , I wo ] has been computed properly, with simple manipulation from Equations (23)-(25), we get: From Equations (28)-(30), we can see that arccot, arctan, and arccos functions are all monotone functions, which are what we expected. Besides, we also note that (θ s ,φ s ) is given, and I uv , I vu and I wo can be computed by Equations (14)- (16). However, ε 1 , ε 2 , and ε 3 are unknown, which reflect the impact of noise and interferences. According to the assumptions made in Section 2.1, if we are able to select the TF components (θ s ,φ s ) dominated by the target speech, and the local SNR at this (τ,ω) is high, then ε 1 , ε 2 , and ε 3 can be ignored, since they will have values approaching zero at these (τ,ω) points. In such conditions, we obtain the desired formulas to compute (θ s ,φ s ):

Nonlinear Soft Time-Frequency (TF) Mask Estimation
As discussed above, Equation (31) is valid when the (τ,ω) points are dominated by target speech with high local SNR. Besides, we have three equations to solve two variables, θ s and φ s . In this study, from Equation (31), we estimate θ s and φ s in the following way: where ∆η 1 and ∆η 2 are estimation errors. Comparing Equation (31) and Equations (32)-(35), we can see that if the estimated DOA values (φ s (τ, ω),θ s (τ, ω)) approximate to the real DOA values (θ s ,φ s ), then ∆η 1 and ∆η 2 should be small. Therefore, for the TF points (τ,ω) dominated by the target speech, we can derive the following inequality: whereφ s (τ, ω) andθ s (τ, ω) are the target speaker's DOA estimated by Equations (34) and (35), respectively. θ s and φ s are given the DOA of the target speech for the SE task. The parameters δ 1 and δ 2 can be set as the predefined permissible parameters (referring to an angle value). Following the derivation up to now, if Equations (36) and (37) are met at (τ,ω) points, we can infer that these (τ,ω) points are dominated by the target speech with high probability. Therefore, using Equations (36) and (37), the TF points (τ,ω) can be extracted, and a mask associated with these (τ,ω) points dominated by the target speech can be designed accordingly. In addition, we need to take the following facts into account. (1) The value of φ s belongs to (0,2π]. (2) The principal value interval of the arccot function is (0,π), and the arctan function is (−π/2,π/2). (3) The value range of θ s is (0,2π]. (4) The principal value interval of the arccos function is [0,π]. (5) To make the principal value of the anti-trigonometric function match the value of θ s and φ s , we need to add Lπ to avoid ambiguity. As a result, a binary TF mask for preserving the target speech is designed as follows: where L = 0, ± 1. (∆φ(τ,ω), ∆θ(τ,ω)) is the estimation difference between the estimated DOA and the real DOA of the target speaker at TF point (τ,ω). Obviously, the smaller the value of (∆φ(τ,ω), ∆θ(τ,ω)), the more probable it is that the TF point (τ,ω) is dominated by the target speech. To further improve the estimation accuracy and suppress the impact of the outliers, we propose a nonlinear soft TF mask as: where ξ is a positive parameter and ρ (0 ≤ ρ < 1) is a small positive parameter tending to be zero, which reflects the noise suppression effect. The parameters ∆ 1 and ∆ 2 control the degree of the estimation difference (∆φ(τ,ω), ∆θ(τ,ω). When parameters ∆ 1 , ∆ 2 , and ρ become larger, the capability of suppressing noise and interferences degrades, and the possibility of the (τ,ω) being dominated by the target speech also degrades. Hence, selecting the values of ρ, ∆ 1 , and ∆ 2 is important. In our study, these parameters are determined through experiments. Future work could focus on selecting these parameters based on models of human auditory perception. In the end, we need to emphasize that the mask designed in Equation (39) has the ability to suppress the adverse effects of the interferences and background noise, and preserve the target speech simultaneously.

Proposed Target Speech Enhancement Method
The diagram of the proposed speech enhancement method (termed as AVS-SMASK) is shown in Figure 4, which is processed in the time-frequency domain. The details of each block in Figure 4 will be addressed in the following context.

The FBF Spatial Filter
As shown in Figure 4, the input signals to the FBF spatial filter are the data captured by the u, v, and w-sensor of the AVS. With the given DOA (θ s ,φ s ), the spatial matched filter (SMF) is employed as the FBF spatial filter, and its output can be described as: where w m H = a H (θ s ,φ s )/||a(θ s ,φ s ) || 2 is the weight vector of the SMF, and a(θ s ,φ s ) is given in Equation (5). [.] H denotes the vector/matrix conjugate transposition. Substituting the expressions in Equations (5), (3), and (9)- (11) in Equation (40) yields: where N tuvw (τ,ω) is the total noise component given as: It can been seen that N tuvw (τ,ω) in Equation (42) consists of the interferences and background noise captured by directional sensors, while Y m (τ,ω) in Equation (41) is the mix of the desired speech source S(τ,ω) and unwanted component N tuvw (τ,ω).

Enhancing Target Speech Using Estimated Mask
With the estimated mask in Equation (39) and the output of the FBF spatial filter Y m (τ,ω) in Equation (42), it is straightforward to compute the enhanced target speech as follows: where Y s (τ,ω) is then the spectra of the enhanced speech or an approximation of the target speech. For presentation completeness, our proposed speech enhancement algorithm is termed as an AVS-SMASK algorithm, which is summarized in Table 2. (1) Segment the output data captured by the u-sensor, v-sensor, w-sensor, and o-sensor of the AVS unit by the N-length Hamming window; (2) Calculate the STFT of the segments: X u (τ,ω), X v (τ,ω), X w (τ,ω) and X o (τ,ω);

Experiments and Results
The performance evaluation of our proposed AVS-SMASK algorithm has been carried out with simulated data and recorded data. Five commonly used performance measurement metrics-SNR, the signal-to-interference ratio (SIR), the signal-to-interference plus noise ratio (SINR), log spectral division (LSD), and the perceptual evaluation of speech quality (PESQ)-have been adopted. The definitions are given as follows for presentation completeness.
(1) Signal-to-Noise Ratio (SNR): (2) Signal-to-Interference Ratio (SIR) (3) Signal-to-Interference plus Noise Ratio (SINR): where s(t) is the target speech, n(t) is the additive noise, s i (t) is the ith interference, and x(t) = s(t) + s i (t) + n(t) is the received signal of the o-sensor. The metrics are calculated by averaging over frames to get more accurate measurement [22]. (4) Log Spectral Deviation (LSD), which is used to measure the speech distortion [22]: where ψ ss ( f ) is the power spectral density (PSD) of the target speech, and ψ yy ( f ) is the PSD of the enhanced speech. It is clear that smaller LSD values indicate less speech distortion. (5) Perceptual Evaluation of Speech Quality (PESQ). To evaluate the perceptual enhancement performance of the speech enhancement algorithms, the ITU-PESQ software [23] is utilized.
In this study, the performance comparison is carried out with the comparison algorithm AVS-FMV [17] under the same conditions. We do not take other SE methods into account since they use different transducers for signal acquisition. One set of waveform examples that is used in our experiments is shown in Figure 5, where s(t) is the target speech, s i (t) is the i-th interference speech, n(t) is the additive noise, and y(t) is the enhanced speech.

Experiments on Simulated Data
In this section, three experiments have been carried out. The simulated data of about five seconds duration is generated, where the target speech s(t) is male speech, and two speech interferences s i (t) are male and female speech, respectively. Moreover, the AURORA2 database [24] was used, which includes subway, babble, car, exhibition noise, etc. Without loss of generality, all of the speech sources are placed one meter away from the AVS.

Experiment 1: The Output SINR Performance under Different Noise Conditions
In this experiment, we have carried out 12 trials (numbered as trial 1 to trial 12) to evaluate the performance of the algorithms under different spatial and additive noise conditions following the experimental protocols in Ref. [25]. The details are given below: (1) The DOAs of target speech, the first speech interference (male speech) and the second speech interference (female speech) are at (θ s ,φ s ) = (45 • ,45 • ), (θ 1 ,φ 1 ) = (90 • ,135 • ), and (θ 2 ,φ 2 ) = (45 • ,120 • ), respectively. The background noise is chosen as babble noise n(t); (2) We evaluate the performance under three different conditions: (a) there exists only additive background noise: n(t) = 0 and s i (t) = 0; (b) there exists only speech interferences: n(t) = 0 and s i (t) = 0; (c) there exists both background noise and speech interferences: n(t) = 0 and s i (t) = 0; (3) The input SINR (denoted as SINR-input) is set as −5 dB, 0 dB, 5 dB, and 10 dB, respectively. Following the setting above, 12 different datasets are generated for this experiment.
(3) For comparing algorithm AVS-FMV: F = 32, M = 1.001 followed Ref. [17]. The experimental results are given in Table 3. Table 3. Output signal-to-interference plus noise ratio (SINR) under different noise conditions. As shown in Table 3, for all of the noise conditions (Trial 1 to Trial 12), our proposed AVS-SMASK algorithm outperforms AVS-FMV [17]. From Table 3, we can see that our proposed AVS-SMASK algorithm gives about 3.26 dB, 4.14 dB, and 2.25 dB improvement compared with that of AVS-FMV under three different experimental settings, respectively. We can conclude that our proposed AVS-SMASK is effective in suppressing the spatial interferences and background noise.

Experiment 2: The Performance versus Angle Difference
This experiment evaluates the performance of SE methods versus the angle difference between the target and interference speakers. Let's define the angle difference as ∆φ= φ s − φ I and ∆θ = θ − θ i (here, the subscripts s and i refer to the target speaker and the interference speaker, respectively). Obviously, the closer the interference speaker is to the target speaker, the speech enhancement is more limited. The experimental settings are as follows. (1) PESQ and LSD are used as metrics.
(2) The parameters of algorithms are set as the same as those used in Experiment 1. (3) Without loss of generality, the SIR-input is set 0 dB, while SNR-input is set 10 dB. (4) We consider two cases.

•
Case 1: ∆θ is fixed and ∆φ is varied, (θ 1 ,φ 1 ) = (45 • ,0 • ), the DOA of the target speaker moves from (θ s ,φ s ) = (45 • ,0 • ) to (θ s ,φ s ) (45 • ,180 • ) with 20 • increments. Hence, the angle difference ∆φ changes from 0 • to 180 • with 20 • increments. Figure 6 shows the results of Case 1. From Figure 6, it is clear to see that when ∆φ→0 • (the target speaker moves closer to the interference speaker), for both algorithms, the PESQ drops significantly, and the LSD values are also big. These results indicate that the speech enhancement is very much limited if ∆φ→0 • . However, when ∆φ > 20 • , the PESQ gradually increases, and LSD drops. It is quite encouraging to see that the performance of PESQ and LSD of our proposed AVS-SMASK algorithm is superior to that of the AVS-FMV algorithm for all of the angles. Moreover, our proposed AVS-SMASK algorithm has the absolute advantage when ∆φ ≥ 40 • .  Figure 7 shows the results of Case 2. From Figure 7, we can see that when ∆θ→0 • (the target speaker moves closer to the interference speaker), for both algorithms, the performance of PESQ and LSD are also poor. This means that the speech enhancement is very much limited if ∆θ→0 • . However, when ∆θ > 20 • , it is quite encouraging to see that the performance of PESQ and LSD of our proposed AVS-SMASK algorithm outperforms that of the AVS-FMV algorithm for all of the angles. In addition, it is noted that the performance of two algorithms drops again when the ∆θ > 140 • (the target speaker moves closer to the interference speaker around a cone). However, from Figure 6, this phenomenon does not exit. In summary, from the experimental results, it is clear that our proposed AVS-SMASK algorithm is able to enhance the target speech and suppress the interferences when the angle difference between the target speaker and the interference is larger than 20 • . Experimental results are given in Figures 8 and 9 for Case 1 and Case 2, respectively. From these results, we can clearly see that when the DOA mismatch is less than 20 • , our proposed AVS-SMASK algorithm is not sensitive to DOA mismatch. Besides, our AVS-SMASK algorithm outperforms the AVS-FMV algorithm under all of the conditions. However, when the DOA mismatch is larger than 20 • , the performance of our proposed AVS-SMASK algorithm drops significantly. Fortunately, it is easy to achieve 20 • DOA estimation accuracy.

Experiments on Recorded Data in an Anechoic Chamber
In this section, two experiments have been carried out with the recorded data captured by an AVS in an anechoic chamber [25]. Every set of recordings lasts about six seconds, which is made by the situation that the target speech source and the interference source are broadcasting at the same time along with the background noise, as shown in Figure 1. The speech sources taken from the Institute of Electrical and Electronic Engineers (IEEE) speech corpus [26] are placed in the front of the AVS at a distance of one meter, and the SIR-input is set to 0 dB, while the SNR-input is set to 10 dB, and the sampling rate was 48 kHz, and then down-sampled to 16 kHz for processing.

Experiment 4: The Performance versus Angle Difference with Recorded Data
In this experiment, the performance of our proposed method has been evaluated versus the angle difference between the target and interference speakers (∆φ = φ s − φ i and ∆θ = θ s − θ i ). The experimental settings are as follows. (1) PESQ is taken as the performance measurement metric.
(2) The parameters of algorithms are set as the same as that of Experiment 1. (3) Considering page limitation, here, we only consider the changing of azimuth angle φ s while θ s = 90 • . The interfering speaker s 1 (t) is at (θ 1 ,φ 1 ) = (90 • ,45 • ). φ s varies from 0 • to 180 • with 20 • increments. Then, there are 13 recorded datasets. The experimental results are shown in Figure 10. It is noted that the x-axis represents the azimuth angle φ s . It is clear to see that the overall performance of our proposed AVS-SMASK algorithm is superior to that of the comparing algorithm. Specifically, when φ s approaches φ 1 = 45 • , the PESQ degrades quickly for both algorithms. When the angle difference ∆φ is larger than 30 • (φ s is smaller than 15 • or larger than 75 • ), the PESQ of our proposed AVS-SMASK algorithm goes up quickly, and is not sensitive to the angle difference.

Experiment 5: Performance versus DOA Mismatch with Recorded Data
This experiment is carried out to evaluate the performance of speech enhancement algorithms when there are DOA mismatches. The experimental settings are as follows. (1) PESQ and LSD are taken as the performance measurement metric. (2) The parameters of algorithms are set the same as those of Experiment 1. (3) The target speaker is at (θ s ,φ s ) = (45 • ,45 • ), and the interference speaker is at (θ 1 ,φ 1 ) = (90 • ,135 • ). The azimuth angle φ s is assumed to be mismatched. We consider the mismatch of φ s (denoted as φ s " ) varying from 0 • to 30 • with 5 • increments. The experimental results are shown in Figure 11, where the x-axis is the mismatch of the azimuth angle φ s (φ s " ). It is noted that our proposed AVS-SMASK is superior to the compared algorithm under all conditions. It is clear to see that our proposed algorithm is not sensitive to DOA mismatch when the DOA mismatch is smaller than 23 • . We are encouraged to conclude that our proposed algorithm will offer a good speech enhancement performance in practical applications when the DOA may not be accurately estimated.

Conclusions
In this paper, aiming at the hearing technology of service robots, a novel target speech enhancement method has been proposed systematically with a single AVS to suppress spatial multiple interferences and additive background noise simultaneously. By exploiting the AVS signal model and its inter-sensor data ratio (ISDR) model, the desired monotone functions between ISDR and the DOA of the target speaker is derived. Accordingly, a nonlinear soft mask has been designed by making use of speech time-frequency (TF) sparsity with the known DOA of the target speaker. As a result, a single AVS-based speech enhancement method (named as AVS-SMASK) has been formulated and evaluated. Comparing with the existing AVS-FMV algorithm, extensive experimental results using simulated data and recorded data validate the effectiveness of our AVS-SMASK algorithm in suppressing spatial interferences and the additive background noise. It is encouraging to see that our AVS-SMASK algorithm is able to maintain less speech distortion. Due to page limitations, we did not show the derivation of the algorithm under reverberation. The signal model and ISDR model under reverberant conditions will be presented in our paper [27]. Our preliminary experimental results show that the PESQ of our proposed AVS-SMASK degrades gradually when the room reverberation becomes stronger (RT60 > 400 ms), but LSD is not sensitive to the room reverberation. Besides, there is an argument that learning-based SE methods achieve the state-of-art. In our opinion, in terms of SNR, PESQ, and LSD, this is true. However, learning-based SE methods ask for large amounts of training data, and require much larger memory size and a high computational cost. In contrast, the application scenarios of this research are different to learning-based SE methods, and our solution is more suitable for low-cost embedded systems. A real demo system was established in our lab, and the conducted trials further confirmed the effectiveness of our method where room reverberation is moderate (RT60 < 400 ms). We are confident that with only four-channel sensors and without any additional training data collected, the subjective and objective performance of our proposed AVS-SMASK is impressive. Our future study will investigate the deep learning-based SE method with a single AVS to improve its generalization and capability to handle different noise and interference conditions.