Canonical Cortical Graph Neural Networks and its Application for Speech Enhancement in Audio-Visual Hearing Aids

Despite the recent success of machine learning algorithms, most models face drawbacks when considering more complex tasks requiring interaction between different sources, such as multimodal input data and logical time sequences. On the other hand, the biological brain is highly sharpened in this sense, empowered to automatically manage and integrate such streams of information. In this context, this work draws inspiration from recent discoveries in brain cortical circuits to propose a more biologically plausible self-supervised machine learning approach. This combines multimodal information using intra-layer modulations together with Canonical Correlation Analysis, and a memory mechanism to keep track of temporal data, the overall approach termed Canonical Cortical Graph Neural networks. This is shown to outperform recent state-of-the-art models in terms of clean audio reconstruction and energy efficiency for a benchmark audio-visual speech dataset. The enhanced performance is demonstrated through a reduced and smother neuron firing rate distribution. suggesting that the proposed model is amenable for speech enhancement in future audio-visual hearing aid devices.


Introduction
According to the World Health Organization (WHO), the number of people requiring hearing rehabilitation in the world is estimated to rise from 430 million ⋆ The authors are grateful to FAPESP grants #2013/07375-0, #2014/12236-1, #2017/02286-0, #2018/21934-5, #2019/07665-4, and #2019/18287-0, CNPq grants #307066/2017-7, and #427968/2018-6, as well as the Engineering and Physical Sciences Research Council (EPSRC) grant EP/T021063/1. nowadays up to 700 million until 2050, with nearly 2.5 billion people presenting at least some degree of hearing impairment [23]. Despite the impairment itself, deafness also impacts on social relationships [19,16] and environment perception [22], leading to other psychological and health conditions [15]. Over such circumstances, employing high-end energy-efficient technological approaches to build cross-modal sensory devices, i.e., combining audio and visual inputs to enhance hearing aid devices, seems a plausible way to improve individuals' life quality.
In the last decades, machine learning-based techniques have shown themselves as a suitable approach to tackle issues related to virtually any field of science, industry, or even daily life, ranging from computer vision [36] to medicine [29], and satellite imagery processing [12,28]. Machine learning also has been successfully employed in the context of speech enhancement [34,35], whose aim is to enhance speech quality and intelligibility when noise degrades them significantly [5]. Nevertheless, such methods may suffer massive performance degradation in the presence of overwhelming noise [7]. Many works address this problem using multimodal audio-visual (AV) information fusion. Combining AV information usually demands more sophisticated approaches, which intrinsically comprises several challenges, like data alignment, finding semantic gaps between low-level features and high-level information [8], and learning coherent and correlated latent patterns on different input modalities.
Combining noisy audio and clean images for clean signal reconstruction is analogous to reading the lips and body movements of a speaker in a boisterous environment, e.g., a pub with loud music, to obtain some additional information and create a context to enhance the information quality of a speech suppressed by the loud sound. Ngiam et at. [21], for instance, proposed a multimodal method capable of improving a target modality feature representation. Further, Adeel et al. [2] provided several improvements to the field, presenting a chaotic model for lip-reading integrating Internet of Things (IoT) and 5G Cloud-Radio Access Network, further improving AV information transmission for real-time speech reconstruction. Further work employ deep learning-based approach to exploits AV cues to estimate clean audio [4,20], also considering distinct language speakers [13].
Recently, Passos et al. [25] proposed a multimodal self-supervised Graph Neural Network (GNN) that combines AV data through using Canonical Correlation Analysis Graph Neural Networks (CCA-GNN) [37], also modeling the temporal information in the graph using the so-called prior-frame positional encoding. The method obtained outstanding results considering audio reconstruction and energy efficiency, analyzed in terms of neuronal activation rate.
Despite the advantages presented [25], the model lacks some points in the context of a biologically plausible approach. In this context, Passos et al. [24] proposed a multimodal approach using burst-dependent learning [26], a method inspired by more recent studies on the physiological mechanism of pyramidal neurons that regulates the learning by the frequency of bursts, where the credit assignment problem is addressed by the primary principles of pyramidal neurons suggested by Körding and König [18]. In parallel, the study of canonical cortical circuits [9,14] provides some interesting insights regarding the brain procedure toward multimodal information processing. Canonical cortical circuits model pyramidal neurons to receive different modalities of information, modulated in an excitatory or inhibitory fashion on deeper layers. Moreover, biologically plausible models should not underestimate the importance of memory in the learning process, which performs a fundamental role in information inference, acting as an intrinsic context for novelty comprehension.
The attributes mentioned above motivated the development of the Canonical Cortical Graph Neural Network, a novel self-supervised architecture that remodels and improves the ideas developed in [25] by introducing a more biologically plausible approach to modulating and filtering the multimodal information inside the so-called cortical graph layers. It also introduces a memory concept inside each cortical graph block, composed of a mechanism to "forget" irrelevant facts and update itself considering every consecutive node, which is presented in a logical time-step sequence using prior-frame-based positional encoding. Experiments conducted over the AV ChiME3 [5] dataset show the Canonical Cortical Graph Neural Network obtained state-of-the-art results, outperforming CCA-GNN in the contexts of faster convergence, audio reconstruction, and firing neuron activation rates. Regarding the latter, the model not only obtained lower rates of activation, but also produced smothier firing-rate distributions.
The main contributions of this paper are presented as follows: 1. To propose the Canonical Cortical Graph Neural Network, a biologically plausible model for multimodal information feature extraction. 2. To introduce a novel paradigm for intra-layer information fusion and memory modeling. 3. To provide a self-supervised energy-efficient model for correlated feature extraction for AV-based clean audio data reconstruction. 4. To foster the literature regarding speech enhancement and AV hearing aids.
The remainder of this paper is presented as follows. Section 2 provides a theoretical background regarding Graph Neural Networks with Canonical Correlation Analysis and the prior frame-based graph positional encoding, while Section 3 introduces the novel Canonical Cortical Graph Neural Network. Further, Sections 4 and 5 present the methodology and the experimental results, respectively. Finally, Section 6 states conclusions and future works.

Theoretical Background
This section provides a brief theoretical background regarding Graph Neural Networks with Canonical Correlation Analysis and the prior frame-based graph positional enconding.

Graph Neural Networks with Canonical Correlation Analysis
Let G = (X, A) be a graph where A ∈ R N ×N represents the adjacency matrix and X ∈ R N ×F the input data represented by graph nodes. Additionally, F represents the feature space dimension, and N denotes the number of nodes. The CCA-GNN [37] comprises three main steps, i.e., a random graph generator T , a graph neural network encoder f θ , where θ denotes the network's learnable weights, and an objective function based on Canonical Correlation Analysis. The graph generator produces two augmented versions of the same graph, which are presented to the graph neural network encoder for further computing and maximizing the canonical correlation between their outputs.
The idea behind such an approach is discarding decorrelated components while preserving correlated ones. In a nutshell, the model tries to keep the more significant information present in both augmented versions and to avoid individual behaviors, such as anomalies and noise. Figure 1 depicts the Canonical Correlation Analysis Graph Neural Network. Each sample represents a node in a graph whose edges describe the relationship between pairs of samples. The random graph generator T produces two augmented versions of this graph, which are employed to feed the GNN model. The output of both versions is compared using the canonical correlation analysis, and the network parameters are adjusted to maximize this metric.
The graph augmentation process comprises the same approach presented in [38,30], which conducts a random feature masking and edge dropping. In this context, each t i ∼ T comprises a distinct view, i.e., a transformed version of G, sampled at each iteration i.
The encoder is composed of a two-layered GNN but can be easily replaceable by any fancier model. The target function is designed to model the learning process as a canonical correlation maximization problem [10] using a self-supervised approach that considers two normalized views, Z A and Z B , produced over randomly augmented versions of the original graph. The objective is to maximize the correlation between these views, formalized as follows: where λ is a non-negative trading-off hyperparameter, and I is the identity matrix. The left term indicates the invariance term, which is responsible for minimizing the invariance between the two views. In contrast, the term on the right side describes the decorrelation term, which facilitates distinct features to capture different semantics through a regularization procedure. The terms in Equation 1 can be decomposed using a variance-covariance perspective [31]. Let s be an augmented version of the graph sampled from an input x, and Z s is a view of s obtained from a decoder output. The invariance term is minimized using expectation, described as follows: where V denotes the variance. Similarly, one can formalize the decorrelation term as follows: where ρ denotes the Pearson correlation coefficient, and Cov stands for the covariance matrix.

Prior Frame-based Graph Positional Enconding
A time-sequence based approach for graph positional encoding was recently proposed by Passos et al. [25]. The method computes the node neighborhood by connecting them to their k prior frame nodes in time sequence and attributing connection weights to the edges according to their distances in this time-based space. Figure 2 depicts this idea. The calculation of the edge weight w ij that connects a node i to a node j is performed as follows: where d ij denotes the distance from node i to node j in a frame-step space. Those values are stored in a distance matrix used to compute the positional encoding of the nodes.

Canonical Cortical Graph Neural Network
This paper presents a novel self-supervised approach for training multi-modal graph neural networks in a more biologically plausible way. In this context, the architecture combines several concepts inspired on cortical circuits observed in the brain [14,9] to model memory and multimodal information fusion with canonical correlation analysis [10], and to maximize the correlation of the information extracted from different inputs. Figure 3 presents a general overview of the proposed model, which suggests the more interesting procedures of the model are implemented at layer level. Therefore, Figure 4 illustrates this in-layer process, also depicting the behavior of each operation. A formal description of the procedures depicted in Figure 4 and performed inside each cortical layer is provided bellow. Firstly, one should compute the audio f a , visual f v , memory f m , and modulation f w filters as follow: and where σ stands for the Sigmoid function, W a , W v , W m , and W w , are the weight matrices for the audio, visual, memory, and modulation filters, respectively, and b a , b v , b m , and b w are the biases for the audio, visual, memory, and modulation filters, respectively, while [h a , h v ] denotes the concatenation of the audio h a and the visual h v graph convolution outputs. Further, the pre-modulation ρ is computed as follows: where W ρ and b ρ are the pre-modulation weight matrix and bias, respectively. Finally, the modulation ω is computed as follows: where ⊗ stands for the dot product. The memory µ is the more tricky updating since it considers both "forgetting" irrelevant memories using the memory filter f m and introducing new experiences presented in the modulation ω. Moreover, the operation is performed individually for each node n ∈ {0, . . . , N } since they are connected in a logi-cal temporal sequence established by the prior frame positional encoding [25], described as follows: where µ n , ω n , and f n m are the node's n memory, modulation, and memory filter. In the sequence, the memory is updated as follows: where W µ and b µ are the memory weight matrix and bias, respectively. Finally, the layer output, i.e., the new node representation of the audio and visual graphs h ′ a and h ′ v , respectively, are computed as follows: and Finally, h ′ a and h ′ v become the the node representations of the subsequent layer audio and visual graphs, respectively, in case of an intermediate layer. Regarding the output layer, the model performs the canonical correlation analysis between h ′ a and h ′ v using Equation 1 and backpropagates this value to optimize the network parameters.

Methodology
This section describes the dataset and configuration employed during the experiments.

AV ChiME3 Dataset
The dataset used in this paper aims to combine environmental information, i.e., audio and visuals, to train an efficient model for enhancing and amplifying clean audio signals considering multimodal data. The dataset comprises triples composed of image, clean audio, and noisy audio signals, from which the image and noisy audio denote the model's input while the clean signal stands for the desired output, i.e., the instance target in the context of supervised learning. The videos are extracted from Grid [11] dataset, in which different classes of noises (public transport, pedestrian area, street junction, cafe) with signal to noise ratios (SNR) ranging from -12 to 12dB extracted from ChiME3 [6] are introduced, composing the AV ChiME3 [5] dataset. Further, the samples are preprocessed to improve the sentence alignment and incorporate multiple visual frames to include temporal data. In total, the dataset contains 989 sequences from 5 different speakers, described as one black male, two white males, and two white females. Each sequence comprises 48 frames, summing up to a total of 47, 472 synchronized triples of samples.
Audio feature extraction Log-FB vectors were employed to extract both clean and noisy audio features. The technique samples the audio signal at 22, 050kHz for further segmenting it into M 16ms frames with 800 samples per frame and 62.5% increment rate. Furthermore, it uses a hamming window and Fourier transformations to produce a 2048-bin power spectrum. Moreover, it employs a logarithmic compression to obtain the 22-dimensional log-FB signals.
Visual feature extraction The samples extracted to generate the visual set of features were obtained using an encoder-decoder architecture over the Grid Corpus dataset. Lip-regions were detected using Viola-Jones [32] algorithm, for further tracking the frame sequence using the method proposed in [27]. Additionally, a manual effort was employed to randomly inspect the sentences, ensuring good lip tracking [1]. The encoder-decoder approach is then used to create vectors of pixel intensities, in which the 50 first components are vectorized in a zigzag order and then interpolated to match the equivalent audio sequence.

Experimental Setup
The experiments conducted in this paper compare the proposed Canonical Cortical Graph Neural Networks against the recent state-of-the-art model CCA-GNN for the task of multimodal clean audio reconstruction based on noisy audio and visual features, extracted using logFB and an encoder-decoder approach, respectively, as described in Section 4.1. Both models comprise a similar architecture composed of two hidden layers, the first comprising 512 and the second 256 neurons. The parameters were selected empirically. The learning is conducted to maximize the canonical correlation analysis for coherent feature extraction during 200 epochs using the Adam optimizer with a learning rate of 10 −3 and a trading-off parameter of λ = 0.0001 (Equation 1). At the same time, the hyperparameters of the CCA-GNN follow the configuration employed in [25]. Finally, the graphs are generated considering eight distinct prior-frame scenarios, i.e., k ∈ [3,5,7,10,15,20,25,30]. Notice that the plots provided in experimental results comprise only k ∈ [3, 10, 30] to illustrate the difference between a low, medium, and a high number of neighbors/prior frames.
After maximizing the canonical correlation analysis between the two channels, the features extracted by the networks are employed to feed a dense layer responsible for reconstructing the clean signal by minimizing the mean squared error as the cost function. The model is optimized during 2, 000 epochs using the Adam optimizer with a learning rate of 0.005 and a weight decay of 0.0004.
The dataset was divided into 20 folds to provide an in-depth statistical analysis. Each fold comprises 50 sequences of 48 frames each, summing up to 2, 400 samples per fold. As stated in Section 4.1, the dataset is formed by three subsets: (i) clean audio, (ii) noisy audio, and (iii) clean visual. The noisy audio and the clean visual input are used to feed the multimodal GNNs, while the clean audio is considered the reconstruction target. Finally, each fold is split into train, validation, and test sets, following the proportions of 60%, 20%, and 20%, respectively. The Wilcoxon signed-rank test [33] with 5% of significance was considered for statistical evaluation.

Experiments
This section exploits the superiority of the proposed Canonical Cortical Graph Neural Networks over the state-of-the-art CCA-GNN. The experimental results consider the contexts of feature extraction analysis, clean audio signal reconstruction, as well as neuronal activation and energy efficiency.

Feature Extraction Analysis
This section explores the task of self-supervised feature extraction in terms of canonical correlation analysis. The idea consists of extracting correlated features considering both the audio and visual channels, contributing to better features for clean audio reconstruction. In this context, the Canonical Cortical Graph Neural Networks, presented in Figure 5 as Cortical, showed a performance 75% higher than the CCA-GNN, on average, considering a small, medium, and high neighborhood, i.e., k ∈ [3, 10, 30].  Further, Table 1 provides the final MSE values over the testing set considering eight distinct k scenarios, i.e., k ∈ [3,5,7,10,15,20,25,30]. Notice bold values stand for the best values considering the Wilcoxon signed-rank test [33] with a significance of 5%. Such results show that the proposed model presents a better behavior when exposed to a reduced number of neighbors, i.e., the historical information is reduced, obtaining the lowest MSE overall over this scenario. This result can be explained by the memory implemented in the architecture, i.e., since the memory tries to model the predictions based on past frames, a longer temporal sequence makes this information ambiguous, leading the model to an extra exposure to past instances, thus overfitting. The idea is reinforced by the CCA-GNN approach, where an opposite behavior is observed in most cases, i.e., since CCA-GNN does not implement a memory, higher numbers of neighbors usually lead to lower prediction errors.

Clean Signal Reconstruction
Finally, Figure 7 depicts some examples of clean audio reconstruction for both models. Notice that the Canonical Cortical Graph Neural Networks with a reduced number of past frames, i.e., k = 3, obtained almost perfect reconstructions of the clean signal in both cases, which reinforces the idea that the memory implementation replaces the necessity of longer temporal information, described by a higher number of neighbors. Figure 7(b) shows that CCA-GNN is also capable of producing good representations when then the number of past frames is big enough, i.e., k = 30, even though the reconstruction is not as good as the proposed method.   Table 2 provides a comparison of the Canonical Cortical Graph Neural Network against the state-of-the-art results reported in recent works using the AV ChiME3 dataset and multimodal approaches for audio-visual speech enhancement, i.e., CCA-GNN and CCA Multilayer Perceptron (CCA-MLP) [25], a Long-Short Term Memory (LSTM) [3] and a Multilayer Perceptron (MLP) [5] based approaches, as well as a canonical correlation-based short-time objective intelligibility deep learning (CC-STOI DL) [17] method. Notice the proposed approach provided the most accurate results, outperformed all the compared tecniques.

Neuronal Activation Analysis
The neuronal activation rate analysis is fundamental for future audio-visual hearing aids since the metric reflects the model's energy efficiency, a critical feature considering energy-constrained environments like hearing aids and embedded devices. In this context, reducing the firing rate directly implies a reduction in energy consumption. Figure 8 depicts the neuron activation rate over the intermediate block layers concerning the audio and visual channels. Notice that both  A numerical representation is presented in Table 3, which comprises a more complete set of neighbors (past frames), i.e., k = [3,5,7,10,15,20,25]. The results reinforce the idea that the proposed model relies almost equally upon both noisy audio and visual modalities for the final output since their areas under the curve are practically the same for all cases, except for k = 10, with an irrelevant difference, i.e., 256 and 255 for noisy audio and visual, respectively. The results also show that the proposed model outperforms the baseline in the context of neuronal activation and, consequently, energy consumption in every possible scenario, showing itself as a suitable approach for speech enhancement in future audio-visual hearing aid devices considering both accuracy and energy performance.

Conclusion
This paper proposed a novel self-supervised method for multimodal correlated feature extraction through canonical correlation analysis maximization. The proposed model comprises a block-based neural network, where each block comprises two Graph Neural Networks layers, i.e., one for noisy audio and the other for the visual features, a memory, and a set of operations to filter, insert, delete, and modulate the input signals. Such operations are inspired in recent discoveries related to cortical cells and their interactions. Experiments were conducted over the AV ChiME3 dataset, designed for the task of multimodal clean audio reconstruction considering noisy audio and clean visual instances, compared the proposed approach against the CCA-GNN, a similar state-of-the-art method proposed recently for the task. Results show that the proposed Canonical Cortical GNN provides more coherent and better-quality features, reaching higher values of canonical correlation analysis. The proposed approach also obtained more accurate reconstructions, generating cleaner reconstructions. Moreover, the model delivers higher efficiency in terms of energy, evaluated by neurons' firing rate. Finally, it also showed itself to be less dependent on a more extended prior-frame sequence, i.e., high values of k, since the memory can store and track temporal information.
Regarding future work, we aim to extend a similar cortical-based architecture to Convolutional Neural Networks and applications to two-dimensional data. We also aim to implement the model on chips for training and inference for possible future implementation in hearing aid devices for AV speech enhancement.