Multi-Angle Fusion for Low-Cost Near-Field Ultrasonic in-Air Gesture Recognition

With the growing interest in Human-Machine interface (HMI), an increasing amount of effort is made to provide always-on low-cost touchless control solutions for Internet-of-Things (IoT) edge devices. In this study, we explore near-audio ultrasound for in-air ultrasonic gesture design and recognition. We propose to use beamforming followed by stages of feature extraction and a Temporal Convolution Network (TCN) for classification. The study is applied to a small form factor concentric hexagonal array of 7 microphones where a beamforming stage is leveraged for spatial feature extraction and ultrasonic gesture features fusion from different angles. In case of such limited number of microphones, we show that a customized Filter-and-Sum (FaS) beamformer is well suited for this application with a set of 5-tap filters. We optimize the beamformer by fitting a fixed beam in the ultrasonic frequency domain, to make the ultrasonic band of interest (18 kHz - 24 kHz) available for use. The beamformer generates parallel readings in time of Doppler shifts from a set of assigned angles as a hand gesture is performed near the array. A TCN of only 10k parameters is used to classify these parallel readings into predefined symbols to build a gesture alphabet. The TCN operates on two modes sharing the same TCN structure with the option to switch between them by loading different set of coefficients. Features from concatenated beamformed frequency points are learned with a per symbol classification accuracy in the range 92%-100% computed on a test set and visualized in the form of normalized confusion matrix. The proposed system gives users a degree of flexibility where gesture diversity can be obtained by grouping the trained symbols from the built alphabet in a post-training design stage. This paves the way for flexible, intuitive and easy to remember gestures.


I. INTRODUCTION
Over the last decade, consumer electronics such as smartphones, have witnessed innumerable innovative features that changed the way we communicate with smart devices. In the era of Internet-of-Things (IoT) and smart homes, more emphasis is paid to Human-Machine interface (HMI) as opposed to the current traditional means of controlling. This type of obsolete control interface can take the form of physical buttons or touch screens. With the trend of miniaturization, these IoT edge devices are of small form factor lacking The associate editor coordinating the review of this manuscript and approving it for publication was Wei-Wen Hu . space to host physical buttons or a touch screen. IoT edge devices are referred to as always-on and low-power devices, where natural language processing solutions are implausible because of their compute-intensive nature. This encourages researchers and product engineers to look at touch-less and in-air control mechanisms falling under the umbrella of Human-Machine interface (HMI) and gesture recognition.
HMI and Deep-Learning (DL) empower different technologies to build gesture recognition systems. Generally, HMI is a term for various technologies in the physical layer. Gesture definition is dependent on variations in 1) technologies, 2) design strategies, and 3) data sets. It would require a thorough literature survey paper to cover all such variations in full depth. Instead, therefore, we survey examples of distinct sensing technologies and highlight their fundamental advantages and disadvantages. We focus on ultrasound as an emerging technology of interest which is the main topic of this article, and as demonstrated in commercial products and technical literature, referenced in this article.
Wearable gloves such as stretch gloves [1], can be used to map hand gestures. Although the obtained gestures are of finger-movement resolution enabling complex gestures, the inconvenience of having to wear gloves to perform a gesture hinders its employment in real products. Optical sensor-based systems such as Microsoft Kinect and Leap Motion [2], [3], are also explored for gesture recognition. The high gain from these technologies is the ability to capture 3D (depth) information of the performing gestures. However, the vulnerability of optical based solutions to illumination conditions and the high computational costs of processing stay the main concerns in the context of always-on gesture recognition system. Electromagnetic sensor-based technologies such as WiFinger [4] constructs fine-grained gesture by reusing single commodity WiFi. The existence of wireless access points outdoors makes this type of system unstable. Furthermore, dynamic multipath, shadowing, and fading signal components due to environmental effects make finegesture recognition more challenging. Another example is Soli [5], which is a gesture sense technology from Google that is used for motion tracking. The underlying approach to hand gesture recognition in Soli is to utilize the temporal accuracy in radar while maintaining the high throughput of the sensor to minimize latency. In 2019, Google employed Soli in its Pixel 4 smartphone series [6], where only swipe left/right gestures are demonstrated. In fact, these gestures can already be achieved with existing hardware in smartphones using ultrasound as shown in these studies [7], [8] [9], [10]. In Radar systems the wave propagates with the speed of light that is ∼ 1 million times greater than that of sound. The range resolution in a Radar system is given by the ratio between the speed of the transmitted wave divided by twice its bandwidth. This implies that more bandwidth is needed in microwaves if compared to ultrasound [11], which further entails more processing effort.
Recently, some industries are paying attention to ultrasound-powered applications in consumer electronics. For example, OnePlus and Xiaomi are replacing the infrared sensor (IR) used for ear-proximity detection, with an ultrasound-based solution to enable users with notch-less screen and to save hardware cost [12]. Although smartphones are not targeted in this study, we follow the same endeavor in building ultrasound-powered applications in small form factor IoT devices. We propose a MIMO system (7-microphone array) to collect echoes from hand movements and then classify them into user-defined gestures. We emphasize on a small form factor array in line with the trend of miniaturization. The overall system consists of a dedicated beamformer, feature extraction, a low-cost TCN classifier and finally a task-driven gesture design stage as shown in Fig. 1. In our study, we take advantage of beamforming to collect spatial ultrasonic echoes and improve their Signal-to-Noise Ratio (SNR). As an array does not include any mechanical parts, it can be used unobtrusively to collect ultrasonic features from multiple angles. In the context of this study, we look at the use-case of in-air ultrasonic gesture recognition on a small form factor array. The proposed array is a concentric hexagonal array of 7 microphones suitable to be deployed in IoT devices that are too small to host physical buttons.
If a hand hoovers at an array, Doppler shift readings are generated according to the speed of the hand. Accordingly, the array is responsible for localizing these ultrasonic reflections. Off-the-shelf beamforming techniques such as Minimum Variance Distortionless Response (MVDR) and Delay-and-Sum (DaS) suffer from some practical and performance issues which will be highlighted in the following sections.
DaS or phased array can be implemented in both the analogue and the digital domain. In the analogue domain, the delay and weighting of the arriving signals are implemented using analogue components such as analogue amplifiers and delay line components. These components are susceptible to noise and are expensive to implement giving the floor to discrete-time digital beamforming. In the discrete-time domain, the signal is first sampled, weighted and delayed to achieve a beamformed output. When reflections are received at high elevated angles, fractional sampled delay will require extra computational cost to be represented [13]. An example of such computational cost includes convolving the incoming samples with a fractional delay shifted sinc function otherwise beamforming will be unstable at certain elevated angles resulting in noise leakage and unstable beamforming performance.
Considering ultrasound-based use-cases [7]- [9], [11], [14]- [22] (summarized in Table 1), we review some beamforming technologies used in these studies. The authors in [20] employ beamforming to find directional ultrasonic Time-of-Flight (ToF) readings from 7 sensors organized in a zigzag geometry. One of these sensors is a transducer operating at 217 kHz. As the authors highlight, they forgo 3D beamforming since the tiny y-axis aperture does not provide any y-axis resolution. A similar study [21] uses DaS beamforming in a 7-MIC linear array to form depth images of e.g. a hand. In an attempt of visual and ultrasonic features fusion, [22] combines ToF ultrasonic readings from 16-MIC array with a camera output to form depth and visual images. In another similar study [18], DaS beamformer is mapped to a 64-MIC array to create bigger depth images. The same study suggests using a handheld reflector to improve the received SNR, as using bare hands did not result in enough SNR for classification of 8 in-air gestures. This can fall in the category of wearable-based gesture recognition systems in which many users would find holding a reflector while performing a gesture to control a device inconvenient. High SNR is important to classify ultrasonic gestures; therefore, beamforming can be positioned as an essential pre-processing stage to achieve high SNR of ultrasonic echoes [17], [23].
MVDR; as an option, is employed in some use-cases where the noise and interference are assumed to be uncorrelated with the signal-of-interest (SOI) [24]. Interfering signals; reflections from nearby objects, can be described as coherent signals; amplitude-scaled and phase-shifted, suggesting that MVDR is not well suited for in-air ultrasonic echoes beamforming [18]. In fact, the authors in [19] use an MVDR beamforming with Convolutional Neural Network (CNN) layers for feature extraction combined with Long Short-Term Memory (LSTM) to support sequence prediction. This is followed by a classification of 5 static gestures on the array; however, the top-1 accuracy of the system is only 64%. Therefore, we forgo practical comparisons with MVDR in this study. Given the low number of microphones in the proposed 7-MIC array, we use beam fitting Filter-and-Sum (FaS) beamformer to 1) stabilize the beamformer performance at the 3D surroundings where gestures are most likely to occur 2) to improve the beamforming performance in the wideband ultrasonic spectrum up to the allowed Nyquist range defined by the λ/2 of the array.
FaS beamformers have been used before in a linear array as in [25] where 20-tap polynomial filter structure is applied to match the array response with a reference beam. Furthermore, authors in [26] apply 39-tap filters for each microphone in an 8-MIC circular array for far-field speech applications. In a different study [27], a 71-tap filter is assigned for each microphone to create a FaS beam response using a weighting vector to control the importance of fitting a beam at different frequencies. However, [25] - [27] analyze only beamforming patterns on the azimuth span with no given insights on the elevation angle and using a relatively long tap FIR filter. In this study we tailor FaS for the purpose of ultrasonic Doppler shift localization. As far as realistic ultrasonic gestures are concerned, users perform such gestures at the 3D surroundings of the array. Therefore, the whole 3D beam response of the array should be considered to assess the beam pattern at high elevation angles. 3D multi-angle beamforming is used to collect low-dimensional features in a form of series of down-mixed frames having 300 Hz of bandwidth around the transmitted tone. This is implemented to avoid exploding the complexity of a classifier. Examples of such complexity scaling can be seen in UltraGesture [15] and [14], where CNNs of 2.48M and ∼ 170M parameters, respectively, are trained for ultrasonic gesture recognition without beamforming. These scenarios of the so-called 'curse of dimensionality' [28] should be avoided in always-on applications. The beamformed frames are represented in low-dimensional frequency points with background subtraction to remove a predefined transmitted ultrasonic tone. A single 23 kHz ultrasonic tone is used in this study as an example.
The proposed beamformed features in section IV are passed to a Temporal Convolutional Network (TCN). TCNs are sequence labeling classifiers that are gaining attention over recursive ones such as LSTMs [29]. They provide flexible receptive field over input frames leveraging compute parallelism. They also have more stable gradients during training [30] unlike LSTMs that are vulnerable to conditions of vanishing gradients. Therefore, a TCN is used in this study to accept multi-angle beamformed frequency features and classify a set of designed gesture symbols. We benchmark the work in this study with related literature in Table 1. We show how all of the mentioned studies consider fixed number of gestures (5 -12) using ultrasound and 24 if combined with a gravity sensor in a smartphone [9]. We further highlight the gesture recognition performance by showing the classification accuracy of each related study. Since ultrasound is vulnerable to noise cases, we show noise consideration in each study and how there is a lack of attention to noisy environments. The used ultrasound frequency ranges from a single tone to a chirp and from low frequencies (e.g. 18 kHz) to high frequencies (e.g. 300 kHz). In return, each study processes the ultrasonic echoes differently. To give an overview, some echoes are used to construct Time-of-Flight information, B-mode images (depth images), Doppler shifts, Signal-to-Noise Ratio (SNR) or difference Channel Impulse Response (dCIR). The hardware complexity used in each study is highlighted which is shown in the number of the hardware elements used, such as the number of microphones or speakers, and whether beamforming is involved in each study. The beamforming algorithms used in each applicable study is mentioned where we show that the Delayand-Sum algorithm dominates for its simplicity. We also distinguish the studies by the classifier used in each study, for instance whether a Convolutional Neural Network (CNN) is used as a classifier or whether no classifier is involved.
In a previous work [10], gesture recognition capabilities using only one microphone and one speaker are explored. The study shows how a limited number of gestures can be identified using Press/Hold/Release patterns that are performed at the top of a microphone. Press/Hold/Release patterns are produced by a hand mimicking pressing a virtual button in air. This allows users to design gestures that vary with the length of the Hold period. In [23], delay-and-sum is employed to capture gestures on the azimuth span of an array; however, recognition becomes unstable for gestures that are performed at elevated angles. In this study, we combine improved beamforming to enable better detection of symbols forming a gesture alphabet. Symbols in a designed gesture alphabet are automatically labeled using a tool to assign labels to ultrasonic patterns. We further show how these symbols can be combined in a post training stage to generate user specific gestures. The contribution of this work can be summarized in the following points: • To design ultrasonic gestures by fusing spatial and timevariant Doppler shift readings from different angles (section II).
• To tailor Filter-and-Sum beamformer for the purpose of Doppler shift localization (section III).
• To employ a low-cost temporal convolutional network (TCN) having a small number of parameters for ultrasonic patterns classification (section V).
• To integrate and profile a low-cost MIMO system for in-air ultrasonic feature collection (section VI).
• To propose the concept of post-training user specific compound ultrasonic gestures, instead of a trained and fixed set (section VII).

II. GESTURE DESIGN
We start by intuitively defining the physics of the induced ultrasonic patterns by the performing hand. Let f t,θ,β be the Doppler shift at time t and at a steering angle (θ, β), noted as azimuth and elevation, respectively. The rate of change of this Doppler shift can be described as d dt f (t−T :t),θ,β on past T events. Let mode 1 patterns M 1 t,θ,β and mode 2 patterns M 2 t,θ,β be the gesture alphabet definitions from the two modes at t event and at a steering angle (θ, β). The overall system with the dual mode functionality is shown in Fig. 2.
Referring to (1) as mode 1, we leverage readings from 6 steering angles surrounding the array, from which 6 possible directions to tap at are indicated by Tap n , where n is the index of the direction. When a hand performs a tap a harmonic Doppler shift pattern H is produced. The users can also use two hands to perform synchronous double tap at two different steering angles producing two H patterns having the same phase. This pattern is called DTap n1,n2 . Users can also use both of their hands to tap alternatively at two steering angles. We describe this as one hand creates pattern H at one direction while the other hand creates the same pattern with a phase shiftH at another direction. We refer to this as ATap n1,n2 . All patterns or noises that do not abide to the mentioned definitions in (1) are set to No-Gesture. Mode 1 forms in total an alphabet with 8 symbols.
Mode 2 on the other hand is designed with a set of tapless symbols. These symbols are →, ←, ↑, ↓ and choose. The basic principle in mode 2 is to track the Doppler shift readings as they appear at the steering angles during the hand movement. For instance, as a hand moves from left to right at the top of the array the angles steered at the left will first capture the induced Doppler shift. When the hand is midway from left to right, the angles steered at the middle will localize the Doppler shift and the same applies when the hand completely stops at the right. Mode 2 is defined in (2), where StA is an abbreviation for steering angle. Mode 2 forms in total an alphabet with 5 symbols.
The aforementioned ultrasonic patterns that satisfy the definitions in (1) and (2) . A shift of 10 ms (960 samples) is applied to these series of frames. This means that each two consecutive frames have a 20 ms of shared samples maintaining a smooth transition between them. A matrix X i for each ith microphone in (3) can be written as, If we set the maximum number of overlapped frames to be F and the time samples to be N , then each X i is of size N × F. Each ith matrix at each receiving microphone contains samples without beamforming. Beamforming in section III is applied to produce readings from steering angles with the same format as the matrix in (3), this matrix is given the name X θ,β . In section IV, we perform feature extraction at these steering angles to retrieve discriminative features as per the aforementioned definitions in (1) and (2). A TCN; discussed and built in section (V), is used to distinguish mode 1 and mode 2 symbols.

III. BEAMFORMING
Beamforming is a technique used for spatial discrimination in sensor arrays for directional signal transmission or reception. In our system, beamforming is performed at the receiver end to spatially filter ultrasonic echoes from a moving hand. To provide a self-contained literature review, we show more details of the beamforming algorithms that are used in most ultrasonic gesture recognition systems. Designing beamformers is dependent on the optimality criterion and noise assumptions. A design choice is mostly motivated by the problem setup. In our case, we show how these assumptions are translated into equations and how we converge to the chosen Filterand-Sum solution. In the coming subsections, we present Delay-and-Sum, Minimum Variance Distortionless Response (MVDR) and Filter-and-Sum beam fitting as possible methods to implement beamforming. We emphasize on the usage of a general-purpose Filter-and-Sum, where the optimality criterion is set to fit a beam-shape to be the same across all the ultrasonic band 18 kHz − 24 kHz. Solving such Filterand-Sum (FaS) beamforming can be interpreted as inserting a larger number of virtual microphones that improve the spatial discrimination in wideband signals and hence improve the global signal-to-noise ratio (SNR), as highlighted in [26].
The fact that we are using beam response matching in the frequency domain as the optimality criterion allows for better wide-band noise reduction unlike, for instance, MVDR, which assumes uncorrelated noise with the signal of interest.
In this study, gesture recognition can be built on any 300 Hz of bandwidth on the ultrasonic bandwidth bounded by the pitch of the concentric hexagonal array ∼ 7 mm. In other words, we can transmit any pure single tone in this range and downmix the received signal. The 300 Hz around the received downmixed tone is monitored for unique patterns. We emphasize that the desired beamformer operates at these frequency points because there might be incidents where some other electronic devices use the same narrow bandwidth. Any incidents of interference should be avoided by using a frequencyhopping approach to occupy unused ultrasonic bandwidth in 18 kHz − 24 kHz.

A. DELAY-AND-SUM
Delay-and-Sum is a simple beamformer that is used for source localization by delaying and summing the signals received by the array. Recall x i,f to be a recorded frame at each ith microphone from hand reflections s f [n] at steering angle (θ, β). These reflections reach each microphone with a unique physical delay represented as δ (θ,β,i) which can be measured to the center of the array such that where a simple phase-delay operator produces z i,f [n] by compensating δ (θ,β,i) by a calculated discrete time (θ,β,i) = f sδ(θ,β,i) as shown here, Considering y (θ,β,f ) [n] as the final beamformed output frame at the defined steering angle, the whole set of the compensated ith microphones outputs z i,f [n] is summed and averaged by the number of the microphone elements P, as shown in (7) y Since proper physical delay calculation is essential for a functioning beamformer, the pitch between the array elements must be at λ/2, where λ is the wavelength corresponding to the maximum frequency in the desired band. Given λf = v s and a maximum ultrasonic tone of 24 kHz, the pitch between the is the speed of sound. To satisfy this condition, a concentric hexagon of P = 7 microphones is built, 6 of them divide the azimuth view into equal angular steps of 60 o . The equilateral triangles generated from the circumferential 6 microphones satisfy that all the elements on the array are equidistant. A slightly similar approach is seen in [20], where cascaded hexagons of pMUT elements are fabricated forming a beehive structure for ultrasound Time-of-Flight (ToF) measurements.
where R i+n is the interference-plus-noise covariance matrix at a steering vector D. Although an MVDR beamformer is effective in rejecting uncorrelated noise, it might be of less value in cases of correlated noise which is a known drawbacks of MVDR and other similar adaptive beamformers [18].
In the context of this study, correlated noise may appear from ultrasound reflections from room walls or nearby objects resulting in coherent signals such as amplitude-scaled and phase-shifted nearby reflections.

C. FILTER-AND-SUM
Let x i,f [n] (4) be represented in the frequency domain as where (θ,β,i) ∼ δ (θ,β,i) such that there is a set of different ith FIR coefficients h i,k of length K + 1; where K is an even integer, multiplied with X i,f (e jω ) to obtain Z i,f (e jw ) as Let the response of all of the microphones be summed together using the (K + 1)P FIR coefficients to generate a microphone array transfer function F θ,β (e jω ).
The delay (θ,β,i) = f sδ(θ,β,i) is estimated by constructing virtual sources forming a hemisphere as shown in Fig. 3 and using (12). Loc (θ,β) is the location of the source at the surface of the hemisphere and at a radius d r from the center of the  array (at the zero-origin location). Each ith microphone in the array (Loc i ) is located at the xy-plane. 1 The distance d r is subtracted from ||Loc (θ,β) − Loc i || to find the intra-delay between the array elements, whereδ (θ,β,i) can be estimated asδ We find a set of FIR coefficients h such that the transfer function is matching with a reference beam R θ,β (Fig. 4). In this study, we use 3D Gaussian-shaped reference beam. This is mainly motivated to address the lack of directionality at lower frequencies [31].
The transfer function in (11) can be further written as where h is a row vector of length (K + 1)P containing the coefficients of the microphones arranged as h = are reformation of the array 3D spatial response in terms of (θ, β, ω). The 2-MMSE optimization problem can be for where is a hyper-parameter used for L2 regularization, r n is the reference beam at each n-elevation level in Fig. 3 and

IV. FEATURE EXTRACTION FROM STEERING ANGLES
For validation, we generate a beam at one direction (0 o , 45 o ) and leverage the symmetry of the hexagonal array. A circular shift of the FIR coefficients of the circumferential microphones is performed excluding the central microphone to capture in total 6 beamformed outputs. Looking at (11) and (14), the FIR filter length of each microphone can be set by the user. If K is set to 0, this means that there will be a single tap-filter for each microphone. Therefore, the solution will boil down to DaS-like beamforming with a scaling factor |h i | and a phase angle(h i ). However, we increase the efficiency of beamforming by gradually adding more taps while recording the gained performance as shown in Fig. 5. This is performed using the aforementioned optimization (18), and assessment using Mean-Square-Error (MSE). We track the MSE of all possible solutions by finding the difference between the reference beam and the generated beam over the Nyquist frequency range with different filter taps. The average across the frequency points is indicated by the black dotted line. Fig. 5 and (14) can be thought of as gradually inserting virtual microphones to increase the sampling points of the received acoustics. As the number of taps increase, more delay and scale interpolation is introduced. Fig. 5 shows that the MSE saturates at around (K +1 = 5 where K /2 = 2); a 5-tap filter, suggesting using this number of taps for the FaS beamformer in this application. FaS beamforming is not sensitive to the nature of the environmental noise where a beam pattern R θβ is mapped to the real array response (in terms of θ, β) at each frequency (ω). Furthermore, we show in Fig. 6 the beam response of the proposed FaS when a chirp (18 kHz − 24 kHz) is received by the array, to test beamforming in the whole desired ultrasonic band. FaS achieves a minimum normalized gain of 0.1 with no leakage at high elevation angles. This positions FaS beamforming as a candidate for directional noise attenuation in the whole desired ultrasonic band to allow for assigning an unused ultrasonic tone in cases of interference with other devices.
In this study, a continuous ultrasonic tone; in the range 18 kHz − 24 kHz, is omni-directionally transmitted to the surrounding 3D space, after which FaS beamforming is applied to form a set of beamformed frames in a matrix X θ,β . These beamformed frames are down-mixed to a 300 Hz bandwidth of frequency points. The choice of the 300 Hz bandwidth is decided based on a maximum hand speed (v h ) of 1 m/s. This can be derived from (19), given a speed of sound v s = 331.5 m/s, where f r(app) and f r(dep) are the generated Doppler shifts when a hand approaches and departs the array, respectively. The down-mixing is performed on the overlapped frames of the matrix X θ,β at each steering angle. A mixing tone is generated at the receiver (f mix = f tr − 0.5BW ), where BW ∼ 300 and f tr = 23 kHz, to be mixed with the beamformed frames and then low-pass filtered to remove any aliasing effect. The resulting frames are down-sampled by 100 after which a 32-point FFT operation is applied on the down-mixed samples. The absolute values of the first 11 points in the generated FFT represent the frequency magnitude of ∼ 300 Hz of the ultrasound bandwidth, where the transmitted ultrasonic tone is centering these values. Background subtraction is achieved by means of continuousabsolute-frame-subtraction to generate clean Doppler shift readings for the definitions in (1) and (2). We concatenate 11 frequency points from six angles to form in total 66 points. Samples of some of the collected data are shown in Fig. 11.

V. BUILDING A TCN A. TCN STRUCTURE
As motivated in the introduction, a fully connected 3-layer TCN with causal convolutions [30] is structured to be used in classifying the predefined ultrasonic patterns. The built TCN uses a set of 1D filters of size 3 across all the layers as shown in Fig. 7. These 1D filters are dilated during the convolution process. The definition of such dilated convolution (Conv dilated (f )) at each layer is given as, where kernel(j) is a filter of length J = 3, (in) is an input vector, (d) is the assigned dilation rate and (f ) is the index of each streamed frame. Recall that we start by performing gestures at ∼ 45 o degrees of elevation and beamform across 6 directions surrounding the array. The 6 independent 2D features are concatenated to form a 2D matrix of size 66×30; where 30 is the number of overlapped frames. This forms the size of the multi-angle fused 2D input features to be used in training.
The TCN learns the patterns from real-time recordings of Doppler shifts in a 300 Hz of downmixed bandwidth shown in the form of a band limited spectrogram. Therefore, there is no need to decode or phase-lock the incoming signal with the transmitter.
The used TCN in Fig. 7 is fully connected across the channels in each layer where the dilation ratio (d in (20)) increases as a function of the layer index (2 l−1 ) i.e. as the features propagate inside each lth layer. The Lth receptive field (RF L ) of the final layer (L = 3), can be calculated as, = 15 input frames (21) where RF 0 is the receptive field of the input to itself, which is equivalent to a single input frame. The calculated receptive fields are shown in Fig. 7 where 15 input frames are needed to output the first 16-channel values at the last layer. Given that the first input frame is equivalent to 30 ms and the next 14 consecutive frames are overlapped by 20 ms, the resulting receptive field amounts to (10 × 14 + 30 = 170 ms).

B. ULTRASONIC PATTERNS COLLECTION AND LABELING
We transmit the ultrasound tone omni-directionally from underneath the center of the array. The array is built on a breadboard with tiny holes which allows a speaker to transmit the assigned ultrasound tone through them. To have a robust gesture recognition system, the collected training data set has to be properly labeled. As soon as a gesture is performed near the array all of the microphones receive echoes from the performing hand in a relatively small period of time. This can be as low as 0.2s. Properly labeling a classifier at the correct time is difficult and time consuming since the gestures are relatively fast and spontaneous. Therefore, we have built an automatic tool for labeling in-air ultrasonic patterns by spotting time-variant Doppler shift values at all angles.
It is important to note that in our study we assume that the transmitter (the speaker) location is fixed relative to the location of the array. Moving the speaker while performing the gestures will cause two components of Doppler shift readings, the first component is caused by the movement of the speaker and the second one is caused by the hand gesture.   Therefore, the definitions mentioned in (1) and (2) will no longer be valid.
Each gesture symbol set in the definitions (1) and (2) is collected and stored in a separate '.wavefile'. Each '.wavefile' contains a number of gesture symbols of the same type. The principle behind the built tool is to utilize the change in echoes intensity and Doppler shift to label the input of the TCN. When the array is continuously beamforming at these steering angles, there will be moments where the TCN input buffer is covering only silence. This can be seen when the allocated ∼ 300 Hz bandwidth is empty indicating that there is no recorded activity near the array as shown in the red segment in Fig. 8. The tool performs a sum operation (integration) on all frequency bins from all steering angles and across the ∼ 300 Hz bandwidth of each. A peak detection algorithm is applied on non-overlapping windows of 30 frames (equivalent to the assigned TCN input buffer). Whenever a peak is spotted in a non-overlapping window, it will be marked as a valid TCN input, and invalid otherwise. The gesture symbol activity is depicted and highlighted as shown in the detected peaks in Fig. 8, green and red segments for valid and invalid TCN inputs, respectively. For each mode, we collect a data set of ∼ 200, 000 sliding windows, similar to the ones shown in Fig. 11. Algorithm 2 and Fig. 8 show the detailed steps of labeling and data collection.
Using Keras [32]; a neural-network library written in Python, with Adam: A Method for Stochastic Optimization [33], the TCN is trained separately on the two different data sets, for mode 1 and mode 2. We activate the TCN layers using rectified linear activation units (ReLU) and the output classification layer using Softmax activation. Dropout is applied during training for regularization. All experiments were performed on a Windows 10 machine with 16GB RAM, an Intel-i5 7300HQ, and a GTX1050 with 4GB VRAM.
For the peak detection algorithm to work as a labeling tool, the data collection and the labeling are done in normal environments such as offices. Once the labeling is complete, different noise recordings can be superimposed on the labeled data. The most critical types of noise are the ones that can clip the microphones. If the microphones receive ultrasonic reflections with clipped information, the received echoes will contain wideband noise that can contribute to false accept

Algorithm 2 Labeling algorithm
Input: Stream of concatenated frequency points X con from six steering angles Output: Labeled sliding windows 1 Initialization: sum X con across the rows /* Create non-overlapping vectors of m 1×30 , such that (m 1×30 ⊂ M j ), and 30 is the TCN input buffer length for ∀(m 1×30 ⊂ M j ) do 2 Find location of the local maximum in m 1×30 , store it in PK vector of size equivalent to the number of peaks (the number of gesture symbols). 3 end /* Slide a window over X con , skip windows where peak location pk is not visible. and false reject cases. Fortunately, clipping incidents happen instantly, unlike ultrasonic patterns that last longer. We show in section VII a naive approach to remove flickers of false patterns.

VI. SYSTEM PERFORMANCE AND ANALYSIS
For prototyping, 7 analog microphones of type Knowles (SPU0410LR5H−QB) operating on 100 Hz ∼ 80 kHz are used in the experiment. Roland OCTA−CAPTURE; a sound card and an audio interface, is used for data acquisition and for analog-to-digital conversion (sampling at 96 kHz and at 24-bit resolution). Playrec (http://www.playrec.co.uk/); a Multi-channel Matlab utility (MEX file) for sound card access, is used as an acquisition tool. Using the sound card, the 7 microphones channels are read in parallel and are ported to Matlab 2019 in which the FaS beamforming and the TCN inference are applied. For testing, we transmit a 23 kHz tone from the speaker underneath the array and instantiate the process shown in Fig. 11. Using circular shift of the coefficients, the 7 channels are convolved with the corresponding FIR coefficients to generate outputs at 6 steering angles. The rest of Fig. 11 shows features extraction process containing a mixer, LPF, Down-sampling and finally 32-point FFT. As shown in Section V, the first 11 points in each beamformed FFT vector (∼ 300 Hz) are background-subtracted to generate sliding windows of 30 frames for training the TCN. The collected data set is augmented with noise files recorded from sources that can be considered as background noise such as shouting, fan, running water, hairdryer, keyring, etc. To prepare the data for the TCN, the augmented data set is split into ratios of 80% used for training, 10% used for validation and a remaining 10% used for reporting the test accuracy in the form of confusion matrices. The experiment includes first training the TCN using the training and validation set and then testing the TCN using the test set. The training, validation and test accuracies for both mode 1 and mode 2 achieve ∼ 99%, ∼ 98%, and ∼ 98%, respectively. We emphasize on the test accuracy as it shows the performance of the TCN on an unseen data set. The confusion matrices in Fig. 9 and Fig. 10 show the per symbol test classification accuracy for mode 1 and mode 2. It reports the percentage of the automatically labeled test samples that are classified correctly where the x-axis shows the true labels and the y-axis shows the predicted labels. Accordingly, each column of the confusion matrices sums to 100%. The range of the per symbol test accuracy is 92%-100% giving more insights on how well the system detects each independent symbol in both mode 1 and mode 2. There is a trade-off between the power of the transmitted signal and the maximum distance the hand can be positioned at. In normal cases, it is not desired to transmit high power and drain the battery fast. Therefore, we transmit enough power to enable Doppler shift patterns at 10 cm − 30 cm from the array. The measured ultrasound tone intensity is around 50 dBSPL at 10 cm from the speaker. Starting with detection of mode 1 symbols, we show real-time TCN classification output in a form of Softmax values of the predicted gesture symbols in Fig. 12. Using both hands, a user starts by performing double-tap (DTap n1,n2 ) for around 3 seconds then shifts to performing alternatingtap (ATap n1,n2 ) for another 3 seconds. The user momentarily stops for 1 second and then continues to perform a single hand tapping at direction (300 o , 45 o ) for around 7 seconds. The user again stops for around 2 seconds and then continues the single hand tapping at (0 o , 45 o ) for 3 seconds after which the user completely stops. In another experiment, the same TCN structure is loaded with mode 2 coefficients with which the TCN can recognize mode 2 gesture symbols. Fig. 13 shows 15 seconds of a user performing gesture symbols in mode 2. The user performs the following sequence of symbols ↑, ↓, →, ← and finally Choose. An obvious difference between mode 1 and mode 2 symbols is the width of the detected symbol. This is mainly attributed to the nature of the gesture. We refer to gesture symbols from mode 1 alphabet as VOLUME 8, 2020  state-based and to gesture symbols from the mode 2 alphabet as event-based. For instance, a swipe left is only visible when the hand starts the swipe and finishes the swipe, resulting in a fraction of a second gesture symbol indicated by the narrow pulses of the Softmax values. More specifically, in mode 1 symbols almost the whole TCN input buffer is occupied with valid features while in mode 2 symbols the TCN input buffer is partially occupied with valid features.

A. COMPUTE COMPLEXITY
The pipeline shown in Fig. 14 where MAC F stands for MAC operations for feature extractions (before the TCN) and MAC TCN is the TCN MAC operations. This is an example of how the compute complexity of this pipeline can be reduced leveraging the naturally slow hand movements.

VII. GESTURES DIVERSITY IN POST TRAINING
In studies such as [8] and [9], gestures like (←, →), (→, ←), (↑, ↓) or (↓, ↑) are dealt with as separate gestures that are added to the system during training. While this increases the complexity of a classifier due to the increasing number of classes, we provide this diversity in a post training stage giving various combinations of mode 1 or mode 2 symbols.
Recalling that the number of symbols in the gesture alphabet is equivalent to M 1 t,θ,β or M 2 t,θ,β , a new set of compound gestures can be created by grouping these symbols together from a single mode at a time. Let G comp be the number of new compound gestures and N comp be the number of gesture symbols inside each G comp , e.g. 2. If we assume that G comp includes repetitive gesture symbols, then the number of possible compound gestures G comp is This is 8 2 = 64 in M 1 t,θ,β and 5 2 = 25 in M 2 t,θ,β . On the other hand, the number of G comp without repetitive symbols is where the new set of non-repetitive symbols is 8! 6! = 56 in M 1 t,θ,β and 5! 3! = 20 in M 2 t,θ,β . The new compound gestures are described as series of TCN classes from mode 1 or mode 2 to produce the compound gestures which can be set by the user as shown in Fig. 17. In this article, we highlight on the ability of the classifier to robustly detect the ultrasonic gesture symbols (mode 1 and mode 2 patterns) such that the compound gestures can be accurately detected.
However, it is noticed that during fast transitions between the symbols, flicker noise appears at other classes. This is seen at seconds 3, 8 and 9.5. This type of flicker noise is attributed to overlapping features from different symbols that appear on the same TCN input buffer, leading the TCN to confuse between symbols. Therefore, we add a buffer vector O at the output of the TCN classification that stores the latest credible Softmax values of all the symbols. The credibility assessment is implemented using a threshold value th such that if the maximum Softmax value exceeds this threshold, the buffer is updated, otherwise the latest credible inference (inf ) result O inf −1 is repeated. Using this naive denoising, we show how Fig. 15 is denoised to Fig. 16. This naive denoising algorithm is described in Algorithm 2 and in Fig. 17. As these compound gestures are defined by the user, we maximize the performance of detecting the ultrasonic symbols as shown in the diagonal elements of the confusion matrices in Fig. 9 and Fig. 10. Therefore, we only need to emphasize on the ability of the classifier to robustly detect the designed ultrasonic symbols to form ultrasonic words. When the TCN outputs a vector containing the Softmax values of the designed symbols, the maximum value in this vector is compared with a threshold, we use a threshold of 0.9 in this study. If the maximum Softmax value is greater than or equal to 0.9, the detected symbol is marked as credible and is stored in a series of buffers forming a vector. This vector contains a sequence of maximum Softmax values that is compared with a set of user defined sequences, to finally report the matched sequence as shown in Fig. 17. An intuitive explanation of our approach is to build a classifier that robustly detects A, B or C symbols, such that the added post detection stage can track AB, BA, CB, etc., by element-wise comparison. Note that each compound gesture is bounded by No − Ges class. In practice, this is equivalent to a small pause at the beginning and end of the users' defined sequences to prevent overlapping between compound gestures.
It is important to note that there is a trade-off between the length of the sequence, the latency and the accuracy of the compound gestures. On average, a user performs a gesture symbol in 0.5 seconds. Increasing the size of the sequence would result in a long compound gesture and accordingly a slow interactive system. Furthermore, increasing the size of the sequence would reduce the classification accuracy. and Pr(C) are the associated TCN output classification accuracy (diagonal elements in Fig. 9 and Fig. 10). Therefore, as the sequence increases, the classification accuracy decreases and the compound gesture latency increases. We set the compound gesture sequence length to have a minimum of 1 symbol and a maximum of 3 symbols. For example, using mode 2 symbols (5 symbols), one can generate gesture sequences of 3 symbols (i.e., 5 3 = 125   compound gestures). Using the independence assumption, we show a reduction in the compound classification accuracies as shown in Fig. 18. The user can find the permutations that correspond to high compound classification accuracies. Our real-time system operates under real-time requirements, all the measurements shown in Fig. 8 -Fig. 16 are recorded from the designed real-time system. Essentially, this ultrasound-based gesture recognition system is designed to be hosted by edge devices where all the computations are happening locally without any connections to cloud servers. Because many interface systems are moving towards miniaturization, there will not be enough area to host physical buttons. Instead, a small form factor ultrasonic array can be leveraged as a contact-less control interface.

Algorithm 3 Denoising algorithm
Input: • Softmax values vector C of size 1 × G where G−1 g=0 c(g) = 1, and G is the number of gesture symbols.
• confidence threshold th = 0.9. Output: Denoised class buffer O 1 Initialization: Create a buffer O of size 1 × G with an initial No-Gesture Softmax value of 1. One counterargument would be to use Natural language Processing (NLP) for control proposes. However, this requires a connection to the cloud as NLP recognition models are computationally demanding and cannot be ported locally on an offline edge device. The proposed ultrasonic array is compact, and the proposed gestures are flexible, simple and easy to remember. Some of many use-cases that can be served by such technology in a smart home, is lowering down a curtain using slide down, or fast forwarding a song by double slide right.

VIII. CONCLUSION
This study proposes to leverage a low-cost pipeline of beamforming, feature extraction and temporal convolutions for the purpose of in-air ultrasonic gesture recognition. The study showed how hand movement patterns are localized at different directions using customized beamforming to construct frequency features from multiple angles. The microphone array spots Doppler shift readings generated at a set of predefined angles from a moving hand. Unique ultrasonic echoes are visualized across time and space. Features from six beamformed outputs are fused using fully connected causal convolutions in a Temporal Convolutional Network (TCN) to classify two modes of gesture alphabet. This is all achieved under low-resource settings of compute complexity and memory usage where the TCN is of only 10k parameters. The user is given a post-training degree of flexibility to gain gesture diversity. We envision such small form factor array to be hosted by small form factor IoT edge devices leveraging low-cost ultrasound-based Human-Machine interface.
MIN LI (Member, IEEE) received the B.E. degree (Hons.) in electrical engineering from Zhejiang University, China, and the Ph.D. degree in electrical engineering from Katholieke Universiteit Leuven, Belgium. He has been with Lucent Bell Labs Research, Microsoft Research, the University of Illinois at Urbana-Champaign, USA, IMEC, NXP semiconductors, and Goodix Technology. He is currently a Senior Principal Architect with Goodix Technology, leading various innovations of human-computer interaction from high-level concept down to low-level silicon implementations. He is also a part-time Associate Professor with the Eindhoven University of Technology, Eindhoven, The Netherlands. He is also a member of the IEEE Design and Implementation of Signal Processing Systems Technical Committee (IEEE-DISPS TC). He has been serving on TPCs of many premium international conferences.
JOSÉ PINEDA DE GYVEZ (Fellow, IEEE) is currently a Fellow with NXP Semiconductors, where he coordinates research and development efforts on low-power design technologies. His industrial responsibilities are positioned at the interface between design and technology. He was a Faculty Member with the Department of Electrical Engineering, Texas A&M University, USA. He also holds the professorship Resilient Nanoelectronics (part-time) with the Department of Electrical Engineering, Eindhoven University of Technology, The Netherlands. This professorship fills a gap between industry and academia by bringing industrial knowledge into classrooms, and open innovation into NXP. He has been an Associate Editor of several IEEE TRANSACTIONS and is often involved in program and steering committees of international symposiums.