Affective Neural Responses Sonified through Labeled Correlation Alignment

Sound synthesis refers to the creation of original acoustic signals with broad applications in artistic innovation, such as music creation for games and videos. Nonetheless, machine learning architectures face numerous challenges when learning musical structures from arbitrary corpora. This issue involves adapting patterns borrowed from other contexts to a concrete composition objective. Using Labeled Correlation Alignment (LCA), we propose an approach to sonify neural responses to affective music-listening data, identifying the brain features that are most congruent with the simultaneously extracted auditory features. For dealing with inter/intra-subject variability, a combination of Phase Locking Value and Gaussian Functional Connectivity is employed. The proposed two-step LCA approach embraces a separate coupling stage of input features to a set of emotion label sets using Centered Kernel Alignment. This step is followed by canonical correlation analysis to select multimodal representations with higher relationships. LCA enables physiological explanation by adding a backward transformation to estimate the matching contribution of each extracted brain neural feature set. Correlation estimates and partition quality represent performance measures. The evaluation uses a Vector Quantized Variational AutoEncoder to create an acoustic envelope from the tested Affective Music-Listening database. Validation results demonstrate the ability of the developed LCA approach to generate low-level music based on neural activity elicited by emotions while maintaining the ability to distinguish between the acoustic outputs.


Introduction
Sound synthesis refers to the creation of original audio signals by combining procedures that use embedded representations to extract information properties from complex data of different natures. Generated acoustic data have broad applications ranging from artistic innovation to creating adaptive, copyright-free music for games and videos [1], among others. Acoustic representations of music generation are often derived directly from other audio data sources [2]. However, music perception may involve segregating more complex composition structures such as melody, harmony, rhythm, and timbre. Due to the enhanced perception capabilities [3], sound generation has shown considerable potential with Machine Learning (ML) models fed by raw time-domain data, for which architectures are designed to be tightly coupled to the audio representations [4]. However, learning musical styles from arbitrary corpora implies adapting ideas and patterns borrowed from other contexts to a concrete objective. Style learning poses several challenges to ML architectures. Namely, the following issues are reported [5]: capturing/generating music with short-and long-term structures; performing low-level analysis (onset/offset detection, Several ML approaches have recently been developed, such as deep CCA that infers the optimum feature mapping [34], and architectures using Convolutional neural networks to compute the similarity between spaces [35], among others. Nonetheless, the performance of these feature alignment strategies described above is adversely affected if the training data is noisy and/or has high variability [36]. Thus, the signal-to-noise ratio of EEG recordings is poor because weak signals are overlaid by intrinsic noise with a much larger amplitude than that generated by biological sources and cause intra-subject and intersubject variability. As a result, feature extraction and feature alignment strategies require multiple repetitions across many runs and trials. However, in stimulus-response paradigms, auditory datasets hold very few trials per individual since participants tend to tire easily or have listening fatigue. Consequently, improving feature alignment strategies to measure the similarity between elicited audio stimuli and evoked EEG responses is still challenging [37]. This work proposes an approach to sonifying neural responses to affective music listening data using the introduced Labeled Correlation Alignment (LCA), which identifies the EEG features that are maximally congruent with the simultaneously extracted auditory features. The proposed two-step LCA approach embraces a separate stage that matches both input features with a set of emotion label sets using Centered Kernel Alignment (CKA). Afterward, Canonical Correlation Analysis (CCA) selects multimodal representations with higher relationships. LCA enables physiological explanation by adding a backward transformation to estimate the matching contribution of each extracted EEG feature set. CCA correlation estimates and partition quality are used as performance measures. To deal with inter/intra-subject variability, we evaluate three feature extraction strategies using Functional Connectivity (FC): the widely used Phase Locking Value, Gaussian Functional Connectivity, and combining both FC measures. The task of discriminating and paying attention to a specific sound source in an auditory environment is complex due to the variability of both the stimuli and the subjects, presenting changes in response in the test subjects and generating challenges in identifying a pattern of activation. In this analysis of neuronal activation in the presence of auditory stimuli, there are studies of auditory attention [38], as well as exploring the relationship between EEG and audio, such as Canonical Correlation Analysis (CCA) [39], for determining the correlation between the spaces. It also finds Neural Networks (NN) [40] to improve the correlation, although still limited since it optimizes the discrimination to represent instead of the final CCA projection [41], in addition to optimizing CCA in pre-training, but not while training the task [42]. In addition to improving the correlation between auditory attention and EEG and discovering the relationship between stimulus-response and BCI, the LCA approach also finds patterns in BCI to generate applications, such as in education and music [43].
Consequently, we identify the EEG features most congruent with evoked auditory data according to each label and present the results accordingly. In order to improve sonification discrimination abilities, we focus on the main aspects. Aspects such as channels, timewindowed dynamics, and bandpass filtering are addressed specifically. Additionally, concrete results of generated discriminative acoustic signals are examined.
The agenda is as follows: Section 2 describes the feature extraction methods, Labeled Correlation Alignment, and the variational autoencoders employed for sonification. Further, Section 3 explains the validated affective music listening database, including the preprocessing procedure and tuning of key parameters for feature extraction. Then, Section 4 summarizes the results in terms of spatial relationship and the effect of time-windowed feature extraction on the LCA performance. Lastly, Section 5 gives critical insights into their supplied performance and addresses some limitations and possibilities of the presented approach.

Extraction of (Audio)Stimulus-(EEG)Responses
A piecewise stationary analysis accounts for the non-stationarity behavior inherent to training data when characterizing the eliciting acoustic stimuli (Y ∈R) and brain neural responses (X ∈R). Thus, both feature sets (X∈X , Y∈Y) are extracted from M τ overlapping segments framed by a smooth-time weighting window lasting τ m ≤T, with m∈M τ , where T∈R is the recording length.
Specifically, a set of time-windowed neural response features, X →X, is extracted from the EEG electrode montage using two functional connectivity metrics (FC), Phase Locking Value (PLV) and Gaussian FC (GFC), estimated on a trial-by-trial basis, respectively, as [44]: where x c m and x c m are the real-valued EEG vectors captured at instant m ∈ M τ from the corresponding electrodes c, c ∈N C ; φ c m (t) and φ c m (t) are the corresponding instantaneous phases φ c m (t) and φ c m (t), with c =c , N C is the number of testing montage channels {x c m ∈[x c m :m∈M]}∈X , and σ φ ∈R + a length scale hyperparameter. Notations · 2 and E{:∀ν} stand for 2 -norm and expectation operator computed across a variable ν, respectively.
In parallel, a set of time-windowed acoustic features, Y →Y, is extracted under the music assessment and music listening paradigms [45]: Zero-Crossing Rate, Zero-Crossing Rate, High/Low Energy Ratio, Spectral Entropy, Spectral Spread, Spectral Roll-off, Spectral Flatness, Roughness, RMS energy, Broadband Spectral Flux, and Spectral flux for ten octave-wide sub-bands. The extracted acoustic features' descriptions are detailed in [46,47]. Furthermore, the feature set is completed by the short-time auditory envelopes extracted as in [48].

Two-Step Labeled Correlation Alignment between Audio and EEG Features
The proposed feature alignment procedure between eliciting audio-stimuli and aroused EEG responses consists of two steps: Firstly, the similarity of each feature space to the label set is assessed using Centered Kernel Alignment. This space allows selecting the extracted representations that match the closest. After selecting the labeled CKA representations, Canonical Correlation Analysis is performed to identify audio and EEG features that are maximally congruent in terms of estimated correlation coefficients.

Supervised CKA-Based Selection of Features
Sonification feature sets must be selected to create music following brain patterns but according to distinct emotional conditions. Hence, the alignment is performed separately between each feature set, Ξ={X∈R N R ×P , Y∈R N R ×Q } being P and Q the number of EEG and Audio features (N R is the number of trials), to the provided labels, noted as Λ∈Z, employing the CKA algorithm that includes an additional transformation to estimate the contribution of every input representation. To be specific, we use the supervised empirical estimate of CKA derived in [49], as follows: where notation || · || F stands for Frobenius norm,K ∈ R N R ×N R is the centered kernel matrix estimated asK=ĨKĨ, K ∈ R N R ×N R is the kernel matrix,Ĩ=I − 1 1/N R is the empirical centering matrix computed across the trial set that holds N R , and I∈R N R ×N R is the identity matrix, 1∈R N R is the all-ones vector; and K Ξ ∈R N R ×N R and K Λ ∈R N R ×N R are the kernel matrices that match each extracted feature set to the labels, respectively.
The kernel matrix elements, ξ, ξ ∈Ξ, are computed on a trial-by-trial basis, respectively, as follows: where W ξ is the matrix linearly transforming the selectedξ and input ξ sets in the form ξ=ξW ξ , withξ∈{X ∈ R N R ×P ,Ỹ ∈ R N R ×Q }, being W ξ W ξ the corresponding inverse covariance matrix of the multivariate Gaussian function as in Equation (3a). A Gaussian function is used as the first kernel κ Ξ (, )∈R + in Equation (3a), to assess the pairwise similarity between aligned features due to its universal approximation properties and tractability [50]. The second kernel includes the delta operator δ(·, ·) in Equation (3b) suitable for dealing with categorical label values.

CCA-Based Analysis of Multimodal Features
This unsupervised statistical technique aims to assess the pairwise linear relationship between the multivariate projected feature sets Ξ={X,Ỹ} obtained by supervised CKAbased selection and described in different coordinate systems (EEG and Audio). To this end, both representation sets are mapped into a common latent subspace to become maximally congruent. Namely, the correlation between the EEG and auditory features is maximized across all N R trials within a quadratic framework constrained to a single-dimensionality latent subspace, as below [51]: where ΣXX ∈ R P×P , ΣỸỸ ∈ R Q×Q , and ΣXỸ=X Ỹ ∈ R P×Q .

Sonification via Vector Quantized Variational AutoEncoders
The feed-forward encoder and decoder network converts an input time-series ξ=[ξ t :∀t], with ξ∈Ξ, into a coded form of a discrete finite set (or tokens), z∈{z s :∀s∈S}, having each element of size K. To this end, a latent representation h s =θ E (ξ ξ ξ) (with H∈{h s }) is encoded to be further element-wise quantized according to the vector-quantized codebook {e k :∀k}. The VQ-VAE model noted as µ(ξ) is then trained using the minimizing framework, as below [52]: where the first term is the reconstruction loss that penalizes for the distance between input ξ ξ ξ and decoded outputξ ξ ξ=θ D (·), the second term penalizes for the distance between each encoding value of H and their nearest neighbors e z in the codebook, and the third term prevents the encoding from strong fluctuations, ruling the weight β∈R[0, 1]. In addition, notation θ SG (·) stands for the stop-gradient operation, which passes zero gradients during backpropagation.
Generally speaking, the coding model trained by one auditory signal set ξ ξ ξ∈Ξ can be applied to the generation of acoustic data when feeding to the encoder signals of different nature, ξ ξ ξ ∈Ξ, provided their homogeneity is assumed. This model is referred to as µ(ξ|ξ ). In light of this, we suggest that the following conditions be met: - The VQ-VAE coder includes a parametric spectrum estimation based on regressive generative models fitted on latent representations [53]. Therefore, both sets of signals (ξ, ξ ) must have similar spectral content, at the very least, in terms of their spectral bandwidth. That is, -In regression models, both discretized signal representations must be extracted using similar recording intervals and time windows to perform the numerical derivative routines. Furthermore, the VQ-VAE coder demands input representations of fixed dimensions. Hence, the arrangements extracted from ξ and ξ must be of similar dimensions.

Experimental Setup
We propose a method for sonifying neural responses to labeled affective music listening using auditory and electroencephalographic features that are maximally congruent with the label set. The method is evaluated to create music within the stimulus-response paradigm using a scheme that encompasses the following stages (see Figure 1): (i) Preprocessing and extracting time-windowed representations: Estimating acoustic features from music data modulating emotions, and Functional Connectivity measures from evoked EEG neural responses. Three strategies for FC extraction are considered: Phase Locking Value, Gaussian Functional Connectivity, and their combination. Different time windows are evaluated for feature extraction from neural brain responses as the conditioning content is devoted to low-level music generation.
(ii) Labeled Correlation Alignment to identify the EEG features that are maximally congruent with the stimulating auditory data by each emotion. To preserve the interpretability of selected arrangements, this stage is performed in a two-step procedure: separate CKA matching between audio and EEG data with the labels, followed by CCA analysis of the selected feature sets.
The contribution of electrodes and bandpass-filtered, time-windowed dynamics to Labeled Correlation Alignment is examined. The subject's influence on overall performance is also considered.
(iii) Labeled audio conditioning content was generated using selected brain neural responses to feed a Vector Quantized Variational AutoEncoder. The VQ-VAE coder includes a parametric spectrum estimation based on regressive generative models fitted on latent representations [53]. Therefore, both sets of signals (ξ, ξ ) must have similar spectral content, at the very least, in terms of their spectral bandwidth. That is, -In regression models, both discretized signal representations must be extracted using similar recording intervals and time windows to perform the numerical derivative routines. Furthermore, the VQ-VAE coder demands input representations of fixed dimensions. Hence, the arrangements extracted from ξ and ξ must be of similar dimensions.

Experimental Set-Up
We propose a method for sonifying neural responses to labeled affective music listening using auditory and electroencephalographic features that are maximally congruent with the label set. The method is evaluated to create music within the stimulus-response paradigm using a scheme that encompasses the following stages (see Figure 1): i) Preprocessing and extracting time-windowed representations: Estimating acoustic features from music data modulating emotions, and Functional Connectivity measures from evoked EEG neural responses. Three strategies for FC extraction are considered: Phase Locking Value, Gaussian Functional Connectivity, and their combination. Different time windows are evaluated for feature extraction from neural brain responses as the conditioning content is devoted to low-level music generation.
ii) Labeled Correlation Alignment to identify the EEG features that are maximally congruent with the stimulating auditory data by each emotion. To preserve the interpretability of selected arrangements, this stage is performed in a two-step procedure: separate CKA matching between audio and EEG data with the labels, followed by CCA analysis of the selected feature sets.
The contribution of electrodes and bandpass-filtered, time-windowed dynamics to Labeled Correlation Alignment is examined. The subject's influence on overall performance is also considered.
iii) Labeled audio conditioning content was generated using selected brain neural responses to feed a Vector Quantized Variational AutoEncoder. ?  We assess the relationship between the neural responses captured and the auditory data in terms of their correlation estimated by CCA as a performance measure. Namely, the higher the r-squared coefficient, the more related the brain responses to auditory stimuli. We assess the relationship between the neural responses captured and the auditory data in terms of their correlation estimated by CCA as a performance measure. Namely, the higher the r-squared coefficient, the more related the brain responses to auditory stimuli.
The leave-one-out cross-validation strategy is applied (more precisely, leave-one-subjectout) to compute the confidence of CCA correlation estimates, as carried out in [54]. The discrimination ability of the labeled correlation alignment is also evaluated through the clustering coefficient, γ∈R + , that is the partition quality of the CCA correlation values, computed as: where ξ 0 is the mean distance between a sample and all other points in the same group, ξ 1 is the mean distance between a sample and all other points in the closest group, ξ n is the number of samples within the data set,ξ is the center of a group, where the squared distance of each sample to the center of each group is calculated [55]. This clustering measure calculates a trade-off between inter-class (first term) and intra-class variability (second term). Consequently, the larger the value of γ, the more different the labeled partitions of the extracted features will be.

Affective Music Listening Database
The data (publicly available at https://openneuro.org/datasets/ds002721/versions/ 1.0.2) (accessed on 1 April 2023) were collected by a total of N S = 31 individuals. The test paradigm consisted of six runs, capturing brain neural responses divided into two parts: baseline resting recordings were measured while the participants were sitting still and looking at the screen for 300 s (first and last run); four intervening runs (that is, N R = 40 trials per subject), each with ten individual trials. During a single trial, a fixation cross was presented until 15 [s] had passed. A randomly selected musical clip was played for T = 12 s after the fixation cross appeared. The participants were given a short break after listening to musical stimuli, followed by eight questions in random order to rate the music on a scale (1-9) of induced pleasantness, energy, tension, anger, fear, happiness, sadness, and sadness tenderness. Each participant had 2-4 s between answering the last question and the subsequent fixation cross in the inter-trial intervals.
For each subject, the signal set was recorded from 19 channels according to 10-20 electrode placement (Fp1, Fp2, F7, F3, Fz, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, and O2), and each recording lasting 15 s was sampled at a rate of 1000 Hz, submitted in Figure 2. The music stimuli examined how music modulates emotions and contained 110 excerpts from scores covering a wide range of emotional responses, as detailed in [56]. It is worth noting that the auditory data are labeled according to the two-dimensional arousal-valence plane since affective states may be characterized as a consciously accessible condition that combines arousal (activated-deactivated) and valence  The leave-one-out cross-validation strategy is applied (more precisely, leave-one-subjectout) to compute the confidence of CCA correlation estimates, as carried out in [54]. The discrimination ability of the labeled correlation alignment is also evaluated through the clustering coefficient, γ∈R + , that is the partition quality of the CCA correlation values, computed as: where ξ 0 is the mean distance between a sample and all other points in the same group, ξ 1 is the mean distance between a sample and all other points in the closest group, ξ n is the number of samples within the data set,ξ is the center of a group, where the squared distance of each sample to the center of each group is calculated [55]. This clustering measure calculates a trade-off between inter-class (first term) and intra-class variability (second term). Consequently, the larger the value of γ, the more different the labeled partitions of the extracted features will be.

Affective Music Listening Database
The data 1 was collected by a total of N S =31 individuals. The test paradigm consisted of six runs, capturing brain neural responses divided into two parts: baseline resting recordings were measured while the participants were sitting still and looking at the screen for 300 s (first and last run); four intervening runs (that is, N R =40 trials per subject), each with ten individual trials. During a single trial, a fixation cross was presented until 15 [s] had passed. A randomly selected musical clip was played for T=12 s after the fixation cross appeared. The participants were given a short break after listening to musical stimuli, followed by eight questions in random order to rate the music on a scale (1-9) of induced pleasantness, energy, tension, anger, fear, happiness, sadness, and sadness tenderness. Each participant had 2-4 s between answering the last question and the subsequent fixation cross in the inter-trial intervals. For each subject, the signal set was recorded from 19 channels according to 10-20 electrode placement (Fp1, Fp2, F7, F3, Fz, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, and O2), and each recording lasting 15 s was sampled at a rate of 1000 Hz, submitted in Figure 2. The music stimuli examined how music modulates emotions and contained 110 excerpts from scores covering a wide range of emotional responses, as detailed in [56]. It is worth noting that the auditory data are labeled according to the two-dimensional arousal-valence plane since affective states may be characterized as a consciously accessible condition that combines arousal (activated-deactivated) and valence (pleasure-displeasure), resulting in the following four labeled partitions (N L =4) [57]: High Arousal Positive

Time-Windowed Representations of Brain Neural Responses
Preprocessing EEG data consists of the following procedures: (i) High-pass filtering of the raw EEG channel set was performed with a relatively high cutoff frequency to remove linear trends in all N C electrodes. To this end, a zero-phase 3rd-order Butterworth filter was employed to bandpass the raw signal within  Hz. Further, the FC feature sets were extracted within a bandwidth f ∈N B , with N B = F s /2 , where F s ∈R + represents the EEG sampling frequency. The bandwidths were selected to cover physiological rhythms, which are influential in music appraisal within EEG paradigms, as reported in previous studies [58]. Namely: θ [4][5][6][7][8] Hz, α [8][9][10][11][12] Hz, and β [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] Hz; (ii) Artifact removal was achieved for the occipital electrodes (associated with motor control) that may be highly active because of the visual perception of sound stimuli after target presentation [59]. Another factor contributing to poor occipital signals might be insufficient electrode contact [60]. In this regard, the impedance had outlier values of (>100 kΩ) in three subjects. Therefore, both channels (O1, O2) were ignored in the following. (iii) Re-reference to the common-average electrical activity measured across all scalp channels. (iv) Resampling of EEG responses, partitioned by trials, using the onset of each music stimulus as a fiducial mark, and further downsampling at the sampling rate of 80 Hz. (v) Lastly, the piecewise stationary analysis of EEG and auditory data was carried out over a set of the time segments (having testing values [12, 6, 3, 1.5, 0.75, and 0.375] s), windowed by a smooth-time weighting function (namely, Hann window) with 50% overlap.
Further, the FC features are extracted according to Equations (1a) and (1b), where the kernel bandwidth parameter of GFC is optimized to reduce the probability density function variability of the observed data p(X|σ φ ), that is, as detailed in [61]: As a result, we extract one real-valued FC matrix sizing N φ ×N φ , in a single trail-basis at instant τ, for each evaluated FC measure and subject.
The FC matrix is vectorized to have a vector dimension N FC = N φ (N φ − 1)/2. Accordingly, the feature vector derived from individuals, N S , across all trials, N R , includes dimension N λ X = N FC ×N τ ×N T ×N S ×N L , extracted from each emotion label λ for purposes of validating the supervised feature alignment. Note that the extracted EEG feature arrangement doubles in size when both FC measures are concatenated.

Time-Windowed Representations of Eliciting Audio Stimuli
Regarding auditory stimuli, all recordings were sampled at 44, 100 Hz and then segmented into N τ sliding windows with 50% overlap. Moreover, the sampled data are smooth by squaring and applying a convolution with a square window. As a way to fulfill the condition in Equation (6), stimuli data are further downsampled to 64 Hz with cubic root compression. In order to match the dimension of the EEG training set, the acoustic set is also fixed to a similar size, that is, dim(Ỹ)∼dim(X). Therefore, within each τ, we extract the first PCA component from each of the 20 acoustic features described above [62]. The array is completed with N φ − 1 samples of the acoustic envelope. So, we extract N τ (20 + N φ − 1) acoustic features within each T to be fed into the next alignment procedure.

Results
Here, we present the results by selecting the EEG features most congruent with the evoked auditory data according to each label. We focus on the main aspects to improve the sonification process's discrimination abilities. Specifically, we address the influence of channels, time-windowed dynamics, and bandpass filtering on neural responses. Concrete outcomes of generated discriminative acoustic signals are also analyzed.

Electrode Contribution to Labeled Correlation Alignment
In the beginning, we consider the spatial relevance of each electrode in the scalp EEG montage in terms of the relationship reached by LCA between the features extracted from neural responses and acoustic stimuli. Figure 3 shows the r-squared values assessed by CCA after applying CKA matching (middle column), which are displayed at each validated set of window intervals, N τ . The correlation estimates are averaged across the label set for a generalized interpretation. As can be seen from the plotted heatmaps, the correlation range varies and spreads differently over the scalp electrodes depending on the evaluated feature extraction method. This fact can be seen in the top heatmap revealing that PLV obtains the lowest estimates between [0.05-0.59], with very few electrodes having a detectable contribution. In contrast, GFC extends the correlation interval to [0.05-0.73] (middle plot). At the same time, combining both measures results in correlation values [0.10-0.74] (bottom plot), suggesting that either strategy of improved FC extraction leads to apparent brain regions being coupled to the acoustic stimuli. Afterward, we evaluate the influence of each channel by averaging its correlation performance across all tested window intervals, as displayed in the matrix row for the whole EEG montage (noted as E{17}). It is worth noting that several electrodes tend to zero-value their contribution regardless of the extraction method employed. A particular focus is placed on electrodes that have been reported to be susceptible to artifacts during data acquisition of music listening paradigms, specifically, the ones associated with brain neural activity in the frontal cortex [63]. Thus, the bottom row (noted as E{14}) presents the averaged r-squared values and shows that the correlation may increase when removing Fp1, Fp2, and Pz electrodes.
The next aspect of consideration is evaluating the discrimination ability of the selected features using the clustering coefficient γ. As displayed in the right column of Figure 3, the partition separability of features extracted by PLV (see top plot) is modest due to the low assessed r-squared values. In the case of GFC, the partitions between extracted EEG features differ more pronouncedly. At the same time, the combination of GFC and PLV provides the most accurate separable clustering performance across the tested values of the time window τ. Observed behavior remains for each electrode arrangement evaluated: N C = 17 (blue line) or N C = 14 (orange line). For comparison, we assess the discrimination ability of each feature selection procedure after conducting just a single CCA step that achieves a significantly lower correlation (see left column) than the values attained by incorporating the supervised CKA step previously (middle column). A comparison of the heatmaps shows that a single CCA step results in lower values of γ (dashed lines) regardless of the extraction method used, indicating the increased association between neural responses and acoustic stimuli achieved through LCA.
Lastly, for purposes of physiological interpretability, Figure 4 displays the topoplots reconstructed from the FC feature sets according to the correlation with the evoking auditory data performed by LCA. As seen in the left column, PLV delivers weak values of r-squared that are evenly distributed over the scalp. On the other hand, GFC increases both lobes' contribution (see central column). This influence is further accentuated by combining GFC with PLV, giving rise to electrodes with powerful relevance (right column) and thus increasing their relevance in the following sonification stages. Note that correlation assessments focus more on the frontal and central lobes (painted yellow) when artifact-affected electrodes are removed.

Correlation Estimation for Time-Windowed Bandpass Feature Sets
Here, we investigate the effect of applying time-windowed feature extraction on LCA performance and, in particular, how distinct the EEG responses remain over time since changing dynamics can play a significant role in music creation. To illustrate this aspect, the upper plot of Figure 5 unfolds the time-varying clustering coefficient at different windows performed by each extraction method in the previous section (see Figure 3). The pictured scatter plots indicate that the labeled EEG feature partitions become distinguishable when fixed to a window narrower than τ ≤ 3 s, meaning that the captured affective neural responses can be more separable regardless of the FC metric used. From this length value down, the narrower the overlapping time segment of feature extraction, the more apparent the neural dynamics become. Note that the labeled partitions of the extracted EEG dynamics differ and are more pronounced in GFC (middle row of the top plot) than in PLV (upper row). However, combining GFC and PLV provides the best group separation (lower row).

Correlation Estimation for Time-windowed Bandpass Feature Sets
Time-resolution encoded by the extracted EEG feature sets.  Next, we analyze the time evolution of LCA to determine the dynamic resolution of neural responses encoded by the extracted feature sets over time, but only for the best strategy of FC representation (that is, the combination of PLV plus GFC). The lower plot in Figure 5 presents the obtained r-squared values and reveals that the dynamics extracted at short lengths of τ are weak because of very wide τ ≥ 3 s, resulting in intervals with almost zero-valued correlation. Comparatively, extracted features at τ ≤ 3 s become stronger and has fluctuations over time (left plot of bottom row). Note that implementing the channel removal strategy (middle plot) improves this behavior. Further, the right plot shows the mean estimate of changes in the time-varying dynamic resolution computed as the difference between neighboring correlation values, revealing that the separability of affective labels tends to decrease as τ shortens. This effect may however be reduced with a proper channel selection, as mentioned previously.
Another thing we discuss is the bandpass filtered feature extraction following brain oscillations as a valuable musical property. Figure 6 presents the values of r-squared and γ calculated by combining PLV plus GFC and extracted at different time windows for three brain oscillations evaluated (i. e., θ, α, β). Filtering the lowest band (θ waveform painted in blue line) causes more smoothing changes in the obtained time-varying dynamic resolution than the baseline signal holding all waveforms (black line). In contrast, extraction of the higher frequency rhythms (α -orange, β -green) speeds up the time-varying changes in estimated correlation values (bottom row). However, rapid changes in r-squared imply that discriminability between affective neural responses fluctuates over time (top row). Here, we investigate the effect of applying time-windowed feature extraction on LCA performance and, in particular, how distinct the EEG responses remain over time since changing dynamics can play a significant role in music creation. To illustrate this aspect, the upper plot of Figure 5 unfolds the time-varying clustering coefficient at different windows performed by each extraction method in the previous section (see Figure 3). The pictured scatter plots indicate that the labeled EEG feature partitions become distinguishable when fixed to a window narrower than τ≤3 s, meaning that the captured affective neural responses can be more separable regardless of the FC metric used. From this length value down, the narrower the overlapping time segment of feature extraction, the more apparent neural dynamics will be. Note that the labeled partitions of the extracted EEG dynamics differ and are more pronounced in GFC (middle row of the top plot) than in PLV (upper row). However, combining GFC and PLV provides the best group separation (lower row). Next, we analyze the time evolution of LCA to determine the dynamic resolution of neural responses encoded by the extracted feature sets over time, but only for the best strategy of FC representation (that is, the combination of PLV plus GFC). The lower plot in Figure 5 presents the obtained r-squared values and reveals that the dynamics extracted at short lengths of τ are weak because of very wide τ≥3 s, resulting in intervals with almost zero-valued correlation. Comparatively, extracted features at τ≤3 s become stronger and has fluctuations over time (left plot of bottom row). Note that implementing the channel removal strategy (middle plot) improves this behavior. Further, the right plot shows the mean estimate of changes in the time-varying dynamic resolution computed as the difference between neighboring correlation values, revealing that the separability of affective labels tends to decrease as τ shortens. This effect may however be reduced with a proper channel selection, as mentioned previously.
Another thing we discuss is the bandpass filtered feature extraction following brain oscillations as a valuable musical property. Figure 6 presents the values of r-squared and γ calculated by combining PLV plus GFC and extracted at different time windows for three brain oscillations evaluated (i.e., θ, α, β). Filtering the lowest band (θ waveform painted in blue line) causes more smoothing changes in the obtained time-varying dynamic resolution than the baseline signal holding all waveforms (black line). In contrast, extraction of the higher frequency rhythms (α -orange, β -green) speeds up the time-varying changes in estimated correlation values (bottom row). However, rapid changes in r-squared imply that discriminability between affective neural responses fluctuates over time (top row). To check for uniformity of the group of test subjects, we present in Figure 7 (top plot) the performance of LCA implementation, achieved individually across the channel set and at the considered time windows, which was used for feature extraction based on the combination of PLV plus GFC. In the case of r-squared estimation (green line), there is an appreciable discrepancy in mean and variance values among subjects. Furthermore, a few individuals with a high standard deviation may indicate that their elicited neural responses are far from typical in the subject set. In light of the discrimination ability that motivates the LCA algorithm, we compute the classification of affective feature sets using a GraphCNN framework, similar to the approach presented in [64]. The blue line depicts the calculated classifier accuracy values (mean and standard deviation). In order to provide a better understanding, all subjects are ranked in decreasing order of their achieved mean value, showing a large gap between the best and lowest performers. To illustrate this point, we compute the heatmap of electrode contribution from the r-squared assessments carried out by both subjects along with the corresponding reconstructed neural activity topoplots. As can be seen in the bottom plot, the best-performing subject (labeled as # 1) reaches a robust relationship between auditory and EEG responses with marked brain zones of activation. Moreover, enhanced performance occurs even within the broadest time window. On the contrary, the worst-performing subject (labeled as # 27) achieves a very scarce correlation heatmap, suggesting a poor contribution from the central brain zone, which is assumed to be important in the Affective Music Listening paradigm. [a] Estimated values of r-squared, γ, and accuracy.
[b] Best-performing subject #1 Worst-performing subject #27 To check for uniformity of the group of test subjects, we present in Figure 7 (top plot) the performance of LCA implementation, achieved individually across the channel set and at the considered time windows, which was used for feature extraction based on the combination of PLV plus GFC. In the case of r-squared estimation (green line), there is an appreciable discrepancy in mean and variance values among subjects. Furthermore, a few individuals with a high standard deviation may indicate that their elicited neural responses are far from typical in the subject set. In light of the discrimination ability that motivates the LCA algorithm, we compute the classification of affective feature sets using a GraphCNN framework, similar to the approach presented in [64]. The blue line depicts the calculated classifier accuracy values (mean and standard deviation). In order to provide a better understanding, all subjects are ranked in decreasing order of their achieved mean value, showing a large gap between the best and lowest performers. To illustrate this point, we compute the heatmap of electrode contribution from the r-squared assessments carried out

Generation of Affective Acoustic Envelopes
In the last part of the evaluation, we investigate the ability to create music conditioning content using brain neural activity selected by LAC. Specifically, the VQ-VAE framework in Equation (5) is trained with affective music stimuli,Ỹ, and then applied to create auditory data by feeding the autoencoder with the most similar representation of aroused brain neural responses,X, i.e., using the model µ Λ (Ỹ|X). Due to the highly complex music structure encoded, additional settings are required. Only the acoustic envelope is provided to the encoder as auditory training feature data, without any weighting filter (That is, WỸ = 1), omitting the remaining acoustic features and smoothed to decrease abrupt changes. When providing EEG data to feed the encoder input, the feature sets have an additional dimension to represent neural activity's spatial contribution. We map the EEG feature matrix into a vector representation by adding one convolutional layer to the VQ-VAE input to reduce dimension.
In the top row, the left plot of Figure 8 illustrates an example of a multichannel EEG response, followed by the extracted FC arrangement (middle plot) and applied to the Labeled Correlation Alignment, estimating the correlation assessments for feeding to the encoder. An example of the generated acoustic envelope in the output is then presented (right plot), reconstructed using VQ-VAE. The right plot illustrates how the envelope resulting from the training model µ Λ (Ỹ|XαX) is smooth enough (orange line). As a comparison, we show the acoustic output produced when encoding the raw EEG set directly (i.e., µ Λ (Ỹ|X) ), showing more increased variability and abrupt changes (blue line), which tend to degrade the overall quality of the created music. In the middle row, we show the clustering results obtained by the sets employed for training: input EEG envelopes (left plot), input FC features (center plot), and generated acoustic envelopes under the model µ Λ (Ỹ|XαX) (right plot), which show a low discriminant between affective labeled sets. On the other hand, the Labeled Correlation Alignment makes the compared input training sets distinctive. by both subjects along with the corresponding reconstructed neural activity topoplots. As can be seen in the bottom plot, the best-performing subject (labeled as # 1) reaches a robust relationship between auditory and EEG responses with marked brain zones of activation. Moreover, enhanced performance occurs even within the broadest time window. On the contrary, the worst-performing subject (labeled as # 27) achieves a very scarce correlation heatmap, suggesting a poor contribution from the central brain zone, which is assumed to be important in the Affective Music Listening paradigm.
[a] Time representation of training sets; All values are normalization for in interpretation.
[b] Clustering before LAC implementation [c] Clustering performed after LAC implementation

Generation of Affective Acoustic Envelopes
In the last part of the evaluation, we investigate the ability to create music conditioning content using brain neural activity selected by LAC. Specifically, the VQ-VAE framework in Equation (5) is trained with affective music stimuli,Ỹ, and then applied to create auditory data by feeding the autoencoder with the most similar representation of aroused brain neural responses,X, i.e., using the model µ Λ (Ỹ|X). Due to the highly complex music structure encoded, additional settings are required. Only the acoustic envelope is provided to the encoder as auditory training feature data, without any weighting filter (That is, WỸ=1), omitting the remaining acoustic features and smoothed to decrease abrupt changes. When providing EEG data to feed the encoder input, the feature sets have an additional

Discussion and Concluding Remarks
This work proposes an approach to sonifying neural responses to affective music listening data. Based on a set of emotions provided, the Labeled Correlation Alignment identifies EEG features most compatible with auditory data. To this end, LCA embraces two steps: Supervised CKA-based feature selection followed by CCA-based analysis. The validated results from the tested real-world data set demonstrate the developed LCA approach's ability to create low-level music content based on neural activity elicited by the considered emotions, maintaining the ability to discriminate between the produced acoustic envelopes.
Still, after the evaluation stage, the following points are worth noting: Feature extraction. Gaussian Functional Connectivity, characterizing the elicited brain activity, enhances the relationship assessment compared to the widely used Phase Locking Value alone. However, both FC measures' combinations better associate the neural responses triggered by coupled acoustic stimuli. This result suggests that the correlation may benefit from including kernel-based FC to deal with inter-/intra-subject variability. Nevertheless, the validation shows that the electrodes mostly affected by artifacts must be adequately removed to improve the EEG feature extraction step. This aspect raises the need to consider including other connectivity measures such as Phase-Amplitude Coupling and entropy-based FC representations, also used in music appraisal paradigms.
Regarding auditory representations, the validation results demonstrate that short-time acoustic envelopes can complete the widely used methods of acoustic feature extraction. Moreover, to properly estimate the intrinsic latent stochastic models, these envelopes, coding relationships between neighboring samples, are only fed into the variational encoder network that generates low-level music synthesis. Despite this, more elaborate representations, such as the Musical Instrument Digital Interface format, may be required when encoding music structures of higher complexity.
Labeled Correlation Alignment. We introduce the two-step procedure to associate multimodal features aligned with the label set, motivated by the fact that a single step of Canonical Correlation Analysis tends to result in cases of a weak association between coupled representation spaces. Additionally, this method for exploring relationships does not benefit from label set information, resulting in poor discrimination between affective responses. Hence, before Canonical Correlation Analysis identifies highly congruent multimodal features, Centered Kernel Alignment is performed to select the most relevant representations based on the affective labels.
Further physiological explanation of LCA results is possible by adding a backward transformation within CKA to estimate the contribution of each extracted feature set. In particular, the proposed LCA between the elicited audio-stimuli and aroused EEG responses enables interpretation of the following aspects: (a) Electrode contribution shows the correlation estimates focus more on the frontal and central lobes, increasing their relevance in the sonification stage. (b) The contribution, obtained by short-time dynamics, indicates that for narrow windows (τ ≤ 3 s) LCA can deliver affective neural responses that are still separable. Furthermore, the bandpass-filtered feature extraction based on brain oscillations may smooth or speed up EEG dynamics. However, discriminability between affective neural responses can reduce. (c) Influence of participants. A noticeable difference exists between the subject performing best and the one with the lowest accuracy in the assessed correlation.
From the information above, several aspects can be considered for enhancing the association between multimodal features, such as group-level analysis to search for joint contributions across individuals and correlation methods that search for optimized projections, for instance, using deepCCA [65].
Generation of low-level music content. Another finding is that the employed variational autoencoder can generate distinctive acoustic envelopes from EEG representations selected by LCA. However, the encoder network uses a discrete latent representation paired with an autoregressive decoder specially designed for high-quality videos, music, and speech. Hence, more efforts are needed to approach discrete neural representation with the predictive VQ-VAE model.
In the future, the authors intend to develop a framework based on variational encoder networks, for which brain neural data can directly affect the latent stochastic representations and regression models involved, according to the estimated relationship between the coupled spaces. More databases, built according to paradigms other than stimulus-response, will also be validated to deal with information shortages. Funding: This research was funded by the project: Sistema prototipo de procesamiento de bioseñales en unidades de cuidado intensivo neonatal utilizando aprendizaje de máquina -Fase 1: Validación en ambiente simulado (HERMES 55063) and funded by Universidad Nacional de Colombia and the project: Brain Music: Prototipo de interfaz interactiva para generación de piezas musicales basado respuestas eléctricas cerebrales y técnicas de composición atonal HERMES 49539, funded by Universidad Nacional de Colombia and Universidad de Caldas.

Conflicts of Interest:
The authors declare no conflict of interest.