Effective Phoneme Decoding With Hyperbolic Neural Networks for High-Performance Speech BCIs

Objective: Speech brain-computer interfaces (speech BCIs), which convert brain signals into spoken words or sentences, have demonstrated great potential for high-performance BCI communication. Phonemes are the basic pronunciation units. For monosyllabic languages such as Chinese Mandarin, where a word usually contains less than three phonemes, accurate decoding of phonemes plays a vital role. We found that in the neural representation space, phonemes with similar pronunciations are often inseparable, leading to confusion in phoneme classification. Methods: We mapped the neural signals of phoneme pronunciation into a hyperbolic space for a more distinct phoneme representation. Critically, we proposed a hyperbolic hierarchical clustering approach to specifically learn a phoneme-level structure to guide the representation. Results: We found such representation facilitated greater distance between similar phonemes, effectively reducing confusion. In the phoneme decoding task, our approach demonstrated an average accuracy of 75.21% for 21 phonemes and outperformed existing methods across different experimental days. Conclusion: Our approach showed high accuracy in phoneme classification. By learning the phoneme-level neural structure, the representations of neural signals were more discriminative and interpretable. Significance: Our approach can potentially facilitate high-performance speech BCIs for Chinese and other monosyllabic languages.


I. INTRODUCTION
S PEECH brain-computer interfaces (speech BCIs) have seen rapid improvements in recent years [1], [2], [3], [4].Advanced speech BCIs have enabled direct speech synthesis, and decoding of spoken words or sentences from neural signals [5], [6], [7], demonstrated great potential in restoring the ability to communicate for patients with aphasia.Recent studies developed high-performance English speech BCIs through decoding phonemes [8], [9], [10], [11].Moses et al. proposed a speech neuroprosthesis using articulatory features as an intermediate to achieve a speed of 15.2 words per minute and a word error rate of 25.6% [8].Metzger et al. further improved this approach by fusing three speech modalities of facial-avatar animation, text, and audio, achieving a speed of 78 words per minute and word error rate of 25% [9].Meanwhile, Willett et al. found multi-unit activities recorded from the ventral premotor cortex, which can directly reflect pronunciation movements, resulting in a high-performance speech neuroprosthesis with a speed of 62 words per minute, and word error rates of 23.8% in a 12500-word vocabulary [10].Card et al. further reduced the word error rates to 2.66% in the 12500-word vocabulary by using four 64-electrode Utah arrays placed in the ventral premotor cortex and the improved language model [11].These breakthrough studies demonstrated the scientific and practical potential of speech BCIs based on phoneme decoding.
We aim to develop a speech BCI system for Chinese Mandarin.From the perspective of pronunciation, Mandarin is a typical monosyllabic language, where a word commonly contains less than three phonemes [3].For such languages, accurate decoding of phonemes from neural signals plays a crucial role.Speech BCIs focusing on polysyllabic languages (e.g., English) benefit from an error-correction process in phoneme decoding, leveraging the combination rules embodied in the sequence of phonemes [10].However, error correction becomes extremely difficult for monosyllabic languages because morphemes are predominantly monosyllabic.Therefore, higher-performance phoneme decoding is essential for developing feasible Mandarin speech BCIs.
However, accurate decoding of spoken phonemes remains challenging.The articulations involved in speech encompass a complex combination of diverse orofacial movements, including lips, tongue, and jaw.Especially, since the movements are highly dexterous, distinguishing phonemes with similar kinematic patterns poses a challenge.As has been demonstrated by existing studies, neural activities corresponding to English phonemes share similar kinematic characteristics that are susceptible to confusion during classification [12], [13].These findings motivated us to explore the utilization of an embedding space capable of amplifying neural activity differences among similar Mandarin phonemes.
Hyperbolic space exhibits an exponentially increasing capacity as the radius expands [14], [15], which has the potential to provide a more distinctive feature space for the neural representation of phonemes.As illustrated in Figure 1, within this space, the neural embeddings of similar Mandarin phonemes can be distributed across regions with enhanced capacity, thereby improving their distinctiveness and separability.In this study, we explored the spatial structure for neural activities of Mandarin phonemes and aimed to construct an accurate phoneme classifier for high-performance Mandarin BCIs.To accomplish this, we collected neural signals from the motor cortex of a human participant while he was performing Mandarin-speaking tasks.We mapped the neural signals of phoneme pronunciation to a hyperbolic space to seek a more distinct phoneme representation.The embedding learning within a hyperbolic space requires a prior hierarchical structure as guidance.However, the neural structure of phonemes can not be directly defined.Thus, we proposed a hyperbolic hierarchical clustering approach to specifically learn a phoneme-level structure to guide the representation optimization process.The resulting representation exhibited greater distance between similar phonemes, which significantly reduced confusion for phoneme classification.Our approach demonstrated an accuracy of 75.21% for 21 phonemes and outperformed existing methods across different experimental days.

II. EXPERIMENT DESIGN
We collected neural signals from a Chinese Mandarinspeaking participant.The participant had two 96-channel Utah intracortical microelectrode arrays (Blackrock Microsystems, Salt Lake City, UT, USA) implanted in the hand knob area of the left primary motor cortex [16], which is proven to be related to the modulation of the participants' arm and hand movements [17], [18].Recent studies have also shown that the hand knob area can modulate orofacial movements during speech production [1], [4].All clinical and experimental procedures in this study were approved by the Medical Ethics Committee of The Second Affiliated Hospital of Zhejiang University (Ethical review number 2019-158, approved on 05/22/2019) and registered in the Chinese Clinical Trial Registry (Ref.ChiCTR2100050705).
Neural signals were acquired at 30 kHz using a 256-channel Neuroport system (NSP, Blackrock Microsystems) with two 96-channel Utah intracortical microelectrode arrays.The Neu-roPort system recorded audio signals at 30 kHz using a microphone placed in front of the participant via an analog input port.We manually sorted the raw spike signals into single-unit activity (SUA) with the Offline Sorter software.The audio signals were manually annotated using the Praat software to mark the start and end timestamps of the participant's pronunciation.As shown in Figure 2A, for each trial, a data segment was extracted from 0.5 seconds before to 1.5 seconds after the Acoustic On (AO) stage.The spike signals were binned into spike counts using a 100 ms window, and a sliding window method was applied with an overlap ratio of 0.75.After preprocessing, the extracted data segment was represented as a matrix X ∈ R N ×T , where N is the number of neurons, and T is the number of time bins.We flattened this matrix into a vector x E in Euclidean space.
The paradigm comprises four speaking tasks: (1) 21 consonants, (2) 24 vowels, (3) 20 syllables, and (4) 4 tones.During each speech task, the participant was asked to pronounce one Mandarin phoneme/syllable per trial.A trial contains several phases.Firstly, a red phoneme was displayed for one second along with a vocal cue.Then, the displayed phoneme turned green, indicating the start of the "Go" stage (3 seconds), where the participant should speak the prompted phoneme, after which the trial ended.We specifically defined the temporal range of speaking as the "AO" stage.As shown in Figure 2B, all the cues were displayed on a computer screen positioned one meter away from the participant.

III. HYPERBOLIC NETWORK FOR MANDARIN
PHONEME DECODING This section presents the hyperbolic network model for speech decoding (HYSpeech), which is proposed to decode spoken phonemes from neural signals.The framework of our approach is illustrated in Figure 2C.

A. Mapping Neural Signals Into the Hyperbolic Space
We first transform the neural signals into the hyperbolic space.The hyperbolic space has negative constant curvature and can be represented by several isometric models, such as the Poincar é disk model [19] and Lorentz model [20].Among these models, the Poincar é disk model (D d c , g c ) is one of the most effective models for modeling tree-structured data [21], [22], [23], [24].In the Poincar é model, the hyperbolic space is described by a Poincar é disk, which is denoted as The disk has a Riemannian metric, g c = λ 2 x g E , where λ x = 2 1−∥x∥ 2 is the conformal factor, c is the curvature, and g E is the Euclidean metric.
On the Poincar é disk, the distance between two points x, y ∈ D d c is defined as where d c (•) denotes the hyperbolic distance between points.The distance increases exponentially when the points are close to the boundary (x and y converge to 1 and the denominator converges to 0), which is well suited to the increasing depth of the tree.This model allows for representing tree-like structures more naturally and effectively, which is especially useful for capturing hierarchical relationships among phonemes.The typical approach for defining operations in the hyperbolic space involves transferring the data and operations to the tangent space.In this study, we employ the gyrovector space, which allows us to define the mutual transformation between the tangent and hyperbolic space using M öbius transformations [25], [26], [27]: Let x E be a flattened neural signal in Euclidean space that lies in the tangent spaces T p D d c at the origin p = 0.By definition, x E can be projected to hyperbolic space by the exponential map operation ex p c 0 : where x H denotes the hyperbolic vector that is used as the input to the hyperbolic network.Reversely, x H can be projected to Euclidean space by the logarithmic map operation log c 0 :

B. Learning the Phoneme-Level Neural Structure by Hyperbolic Hierarchical Clustering
To investigate the hierarchical structure of neural signals, we use a data-driven approach and employ a hierarchical clustering method to construct a binary tree based on pairwise similarities of neural representations.To guide the learning process of the hierarchical tree, we use a cost function proposed by Dasgupta [28], which is defined as: where lca(•) denotes the lowest common ancestor of two leaf nodes and lvs(•) denotes the leaves of rooted binary tree T [lca(i, j)].This cost allows data points i, j with high similarity (w i j ) to be merged first.Thus, a well-constructed tree should have a small cost.To enable continuous optimization of this discrete loss, researchers reformulated it by triplets of data points i, j, k, which is defined as: where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
This formalism simplifies the calculation of lca(•) and removes discrete operations.The relation 1{i, k| j} holds if lca(i, k) is a descendant of lca(i, j).Taking advantage of the continuous notion of lca(•), recent studies proposed to optimize this cost directly with a gradient-descent approach, including gHHC [29] and HypHC [30].Let X p = { p 1 , . . ., p b } denotes a batch of logit vectors, and ( p i , p j , p k ) denotes a triple sample from this batch.Our proposed clustering loss function can be defined as: and where σ τ (d i ) = e d i /τ / j e d j /τ is the scaled softmax function, d i j is the hyperbolic distance between two logit vectors ( p i and p j ), d 0 is the hyperbolic distance to the hyperbolic origin, and τ are the temperature values.

C. Phoneme Classification With the Guidance of the Hierarchical Neural Structure
Existing studies have proved hyperbolic neural networks' effectiveness in extracting features from hyperbolic vectors across different tasks [14], [15], [20], [31].We adopt hyperbolic feed-forward layers [32] for feature extration, which is defined as: where f : R d → R m is the Euclidean function, x L is the latent vector and m is the dimension of x L .
Following the formalism of gyrovector spaces, the vector operations of add and multiply are then defined as: where c denotes the M öbius addition and c denotes the M öbius scalar multi plication.Given these operations, a hyperbolic neural network can be naturally defined.
A hyperbolic multiclass logistic regression [32] is adopted to generate the logit of classification.Given k classes, the logit of x L is defined as: where a k ∈ D d c and p k ∈ T p D d c are trainable parameters, and λ c p k denotes conform factor.These trainable parameters form a set of hyperbolic hyperplane: Hc a, p = {x ∈ D d c :< − p c x, a >= 0}.The classification loss is given by crossentropy, which is defined as: where y i is the one-hot label and p i is the softmax possibility.
Accompany with the phoneme-level structure loss, the overall loss function is defined as : where λ is a hyper-parameter controlling the trade-off between clustering loss and classification loss.The hyperbolic distance is employed as the similarity metric while the similarity is computed using logit vectors.The clustering loss encourages learned neural representation to form tree structures, and the classification loss encourages learned neural representations to be discriminative from each other.In summary, we use 1-2 hyperbolic feed-forward layers in our hyperbolic model architecture.The curvature of the hyperbolic model is set to −1.Specifically, the input of our hyperbolic model is a flattened hyperbolic vector and the latent dimension is 256.Then, we use a hyperbolic multiclass logistic regression to predict a logit vector.We take the logit vector as the node of the hierarchical clustering tree and feed that into our hyperbolic hierarchical clustering process to learn the phoneme-level neural structure.
For our hyperbolic hierarchical clustering, we first need to sample some triples in the input data, which is for the clustering loss calculation.In practice, we choose to sample 10 triples.For each triple, we compute the similarity between each other using hyperbolic distance.During training, we use the parameter λ to balance the two tasks of classification and clustering, which is initialized to 0.1 and set to 0.5 after 50 epochs.The Riemannian stochastic gradient descent (RSGD) [33] method is adopted to optimize the network's parameters.

IV. RESULTS
In this section, we first analyze the signal and neural representations obtained from the speaking tasks.Next, we visualize the learned phonemic structure.After that, we compare it with other existing methods.Finally, we evaluate the performance of our method under different settings.

A. Neural Representations of Phoneme Speaking
By classifying the phonemes into distinct groups according to the articulator movement properties (Figure S1) [34], we observe that the neural activities of intra-group phonemes are significantly more similar than inter-group phonemes.(p < 0.01, shuffle test, Figure 3), suggesting that phonemes with similar articulations exhibit similar neural signals.
The visualized neural representations in a 2-dimensional space (Figure 4), also confirm that phonemes with similar articulations exhibit similar neural activities (Figure S2) Raster plot of an exemplar neuron across repetitive trials of 21 different consonants.Consonants with similar articulations are grouped in similar colors.
[4], [5].Specifically, the consonants /sh/ and /r/, with similar articulations, are closely spaced and challenging to discriminate using the raw signals.Consequently, their neural responses are close and difficult to discriminate in the raw signal space (Figure 4A).
Through hyperbolic learning, the neural representations become more discriminative, and phonemes with similar articulations are better separated, exhibiting greater discriminative ability compared to raw signals.Considering the examples of /sh/ and /r/, the neural representations are located more distantly (Figure 4D).Similar findings are observed with vowels and syllables.The similar vowels of /o/ and /ong/ (the similarity is illustrated in Figure S1B) are closely placed in the raw signal space (Figure 4B) but are well separated in the learned representation space (Figure 4E).The syllables with phoneme tasks further confirm this observation (Figures 4C  and 4F).The results demonstrate that, by considering the hierarchical structure of phoneme articulation, our proposed HYSpeech approach enlarges the space for similar phonemes, allowing them to be effectively separated.

B. Analysis of the Learned Neural Phonemic Structures
Here we visualize the hyperbolic hierarchical clustering tree of 21 consonants learned by our model.In Figure 5, we visualized the hyperbolic hierarchical clustering trees for consonant data which are binary trees.Each circle in the tree represents a class of consonants.Since it is in hyperbolic space, the circles are connected by curves.In the tree construction process, we average the embeddings of trials learned by our model according to their categories and project them as nodes to the Poincar é disk.These nodes are then connected based on the distance, and a hierarchical clustering tree is constructed bottom-up (Figure 5).
Referring to the articulations of phonemes given in Figure S1C, the clusters displayed in Figure 5 demonstrate similarity.For example, at the bottom of the Day-1 clustering tree (Figure 5A) and the top right of the Day-2 clustering tree (Figure 5B), /g/ and /k/ form a cluster.Both /g/ and /k/ are classified as tongue dorsum, plosive (the pink inner circle), and soft palate forming obstructions (the blue outer circle).The difference is that /k/ is aspirated, while /g/ is not.Although they exhibit similar clusters, in the hyperbolic clustering tree, the clusters are distributed at the edges of the Poincar é disk.Consequently, in the hyperbolic space, the difference is magnified, and the confusion between data points is reduced.

C. Comparison of Phoneme Decoding Performance
In this section, we evaluate the performance of the proposed HYSpeech approach in classification tasks (consonants, vowels, syllables, and tones) and compare it with existing methods.For all approaches, the input layer size equals the dimensionality of the input Euclidean vector x E , and the output layer size corresponds to the number of categories for the classification task.We evaluate the performance on a dataset spanning five days, which includes four tasks involving pronouncing consonants, vowels, syllables, and tones.The dataset contains 545 trials, and the details are presented in Table I.The content of syllable sets are presented in Table S1 and S2, and the tonal syllable list is given in Table S3.
1) Baseline Approaches and Evaluation Methods: Considering the limited size of our dataset, the leave-one-out method is employed to calculate accuracy rates.Leave-one-out method is a special cross-validation, which is performed with n times (n is the sample size), such that each time the classifier is trained with n -1 samples and tested on the remaining 1 sample.Before calculating the accuracy rates, we select the best learning rate from [0.001, 0.01, 0.05, 0.1].The following mean and standard deviation values are calculated over 5 random runs.The details of the competitors are as follows: • Support Vector Machine (SVM).We compare our approach with an SVM with a linear kernel.The "Out-lierFraction" parameter is set to 0.05 (i.e., 5% of data points are expected to be outliers).The "C" parameter is set to 1.It is worth noting that the results obtained from SVM are deterministic, so the confidence intervals are not provided.
• Gated Recurrent Unit (GRU).We select GRU as another baseline method.The input dimension of GRU is set to the number of neurons, and the output dimension corresponds to the number of classification categories.
The GRU architecture employed in this study consists of one hidden layer comprising 256 neurons.The parameters are optimized using the Adam optimizer [35].
• gHHC.This method is a hierarchical clustering technique, designed for hyperbolic spaces [29].To apply gHHC, we set the dimension of the Poincar é disk to the original dimension of the data and utilize the Adam optimizer with a learning rate of 0.01 for optimization.
• HypHC.Another hierarchical clustering method designed for hyperbolic spaces is HypHC [30].HypHC directly optimizes the leaf node positions, as the internal structure can be directly inferred from the leaf nodes.Our Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.study set the learning rate to 0.001 and the temperature value to 0.01 and employ the RSGD optimization technique [33].2) Comparison With Non-Hierarchical Methods: Results demonstrate that HYSpeech consistently outperforms non-hierarchical methods across all tasks.We conduct a performance comparison of our proposed HYSpeech approach with non-hierarchical methods (SVM and GRU) for four tasks (consonant, vowel, syllable, and tone) over 5 days and present the results in Table II.For the consonant task, the average accuracy of HYSpeech is 64.89%, which outperforms the SVM and GRU by 12.81% and 21.15%, respectively.For the vowel task, HYSpeech maintains a significant advantage, which outperforms the SVM and GRU by 11.82% and 20.87%, respectively.In the syllable task, the average accuracy of HYSpeech is 53.16%, which outperforms the SVM and GRU by 7.39% and 22.91%, respectively.In the tone task, HYSpeech achieves accuracies of 72.21%, with improvements of 20.34% and 13.04% over SVM and GRU, respectively.Paired t-test is performed to evaluate the significance of the improvement across all tasks.The average accuracy of HYSpeech across all tasks and all days is 59.00%, which outperforms SVM (52.93%) significantly (t-test, p = 0.0007) and outperforms GRU (49.15%) significantly (t-test, p = 0.000008).
3) Comparison With Hierarchical Methods: We next compare the HYSpeech approach with two hierarchical methods, gHHC and HypHC, for four tasks (consonant, vowel, syllable, and tone) over 5 days and present the results in Table II.As the results show, our method consistently outperforms gHHC and HypHC across all tasks.The average accuracy of gHHC across all tasks and all days is 50.91%, which underperforms HYSpeech significantly (t-test, p = 0.00005).The average accuracy of HypHC across all tasks and all days is 54.61%, which underperforms HYSpeech significantly (t-test, p = 0.00001).In the consonant and vowel task, HYSpeech outperforms the HypHC by 11.90% and 10.16%, respectively.In the tone task, HYSpeech outperforms the gHHC and HypHC by 9.83% and 8.37%, respectively.In the syllable task, HYSpeech outperforms the gHHC and HypHC by 23.95% and 12.78%, respectively.These results demonstrate that our proposed method outperforms all other methods in all tasks, which shows the effectiveness of our proposed structure and hierarchical clustering process.

D. Ablation Study
In this section, we conduct an ablation study, which evaluates the effectiveness of different components and model performance under different parameter settings.II and Figure 6A present the classification accuracy of consonants and vowels in hyperbolic and Euclidean spaces.Results show that the classification performance in hyperbolic space consistently outperforms Euclidean space for all tasks.To further demonstrate the advantage of hyperbolic space over Euclidean space, Figure 6B provides the confusion matrix of 11 similar consonant version (refer to Figure S3 for the complete version).In Euclidean space, phonemes with similar articulations are frequently confused, resulting in degraded classification performance.However, these phonemes are well separated in hyperbolic space, leading to better discriminability and overall performance.Figure 6C shows the Top-K performance of our proposed method.Specifically, our method achieves Top-5 accuracies of 97.27% and 81.67% and Top-10 accuracies of 98.18% and 95% in the consonant and vowel tasks, respectively.These results show that using hyperbolic space is beneficial for improving classification performance.
2) Effectiveness of the Hierarchical Clustering Loss: We examine the decision boundaries of our method with and without clustering loss.Table II presents the classification accuracy of consonants and vowels under this setting, where HYSpeech-N denotes that the clustering loss is omitted.The results show that decision boundaries are more distinct with clustering loss, highlighting the importance of clustering loss in the learning process of neural representations.Figures 6D  and 6E show the decision boundaries of HYSpeech and HYSpeech-N, where the latent dimension is set to 2. The circles in the figures represent the Poincar é disk.Since it is in hyperbolic space, the decision boundaries for different consonants are represented by curves of different colors.
3) Decoding Performance of Different Conditions: We investigate the classification performance using neural signals at different stages, including "prepare", "listen", and "read".Specifically, the stage "prepare" refers to the time range from one second before "Prompt" to "Prompt".The stage "listen" refers to the range from "Prompt" to "Go" which is related to the phoneme sound, and the stage "read" refers to the range from "AO" to one second after "AO".The experiments are conducted using consonant datasets.The results are presented in Figure 6F.As expected, the classification performance progressively increases from the "prepare" stage to the "read" stage.
4) Decoding Performance of Cross-Task and Cross-Day: We evaluate the cross-task performance.As shown in Figure 6G, we found that the classification performance of the syllable data, when trained using consonant or vowel data, decreased but was significantly higher than the chance level.It is reasonable because although a syllable is composed of a consonant and a vowel, consonant and vowel are combined and form phonemic connectives at the junction when pronouncing the syllable.Diverse types of phonemic connectives exist, limiting the recognition performance of consonants and vowels.
Then, we evaluate the cross-day performance.As shown in Figure 6H, we progressively use multiple days of data for evaluation and calculate the final classification accuracy with the leave-one-out method.The results show that the performance degrades slightly when using multi-day datasets for training.Whether historical data help improve the performance of a certain day is determined by whether the historical data neural modes cover this day.However, since the neural modes vary largely across days, it usually happens when we have a large amount of historical data.Otherwise, the historical data might not help or even lead to a more overfit model, resulting in a decrease in performance [36].Additionally, neural data with a long time span usually suffer from large variations in neural modes.Unfortunately, our data contained a long time span (Day-1: 210714, Day-2: 220321, Day-3: 220613, Day-4: 220620, Day-5: 220914), thus the neural modes can be largely different among days.It is commonly accepted that single-unit neural signals recorded from the motor cortex demonstrate high variability across days [9], [10].The variability could be large given long time-spans, such that multi-day datasets did not help improve the performance.
5) Influence of Hyper-Parameters: We evaluate the impact of hyper-parameters on the performance of our method.Specifically, we investigate the effects of the hidden size, number of blocks, clustering layer, training strategy, curvature, and learning rate using the consonant dataset.The results are presented in Figure 6J.
We compare the performance of three hidden sizes (128, 256, and 512) and find that the best performance is achieved with a hidden size of 256.We also compare the performance of three learning rates (0.1, 0.01, and 0.001) and find that the best performance is achieved with a learning rate of 0.001.The curvature (c in Equation 2) indicates the level of curvature in the hyperbolic space.We compare three different curvatures (−1, −2, and −3) and find that the best performance is achieved at a curvature of −2.
Each block contains a set of data points from all categories.We find that the best performance is achieved using 5 blocks evaluated by the leave-one-out method).As the number of blocks increases, the data size increases.It is worth noting that the performance may be further improved with an enlarged dataset.
We evaluate the impact of applying the clustering loss on different layers, including the input, latent, and logit layers.We find that the best performance is achieved when clustering is applied on the logit layer.This result suggests that clustering on the neural representations learned in the hyperbolic space is useful for improving classification performance.
At last, we evaluate the influence of training strategy, including joint and alternate training.In joint training, we initialize λ = 0 and adjust it to an equal ratio of 0.5 after 100 epochs.
In alternate training, we fix the number of switch-step to k and let λ switch between 0 and 1 every k steps.We find that the best classification performance is achieved when the switch-step is set to 5.

V. DISCUSSION AND LIMITATIONS
There is an interesting phenomenon: consonants have better classification performance than vowels.Such results have also been reported in previous work: [12] reported that the accuracy was 36.8% across 24 consonant classes and 27.7% across 16 vowels classes, [4] reported that the average classification performance for 24 consonants is 36.1%, and for 16 vowels is 23.9%.A common explanation is that the articulatory movements of different consonants are more discriminative.Vowels, on the other hand, are mostly differentiated by tongue position, with fewer differences in articulatory movements.In our work, we suggest that there are fewer classification hierarchies between vowels compared to consonants, so the performance gains from our approach will be less.
Our demonstration is a proof of concept that learning phoneme-level neural structure by hyperbolic neural networks can improve decoding performance.However, it is important to note that we have not yet developed a complete, clinically viable system for decoding utterances, and much work remains to be done to ensure real-time decoding and to verify the accuracy of utterance decoding.In addition, we have only verified the effectiveness of the algorithm on one individual and it needs to be confirmed among other participants.

VI. CONCLUSION AND FUTURE WORKS
In this work, we propose a hyperbolic model to decode spoken Chinese phonemes from neural signals.Our approach obtained superior performance compared with existing methods and achieved state-of-the-art.The significant performance improvement demonstrates that the neural representation of spoken phonemes contains a hierarchical structure, and using hyperbolic space for computation can be a suitable way to deal with the problem and can potentially bring further developments to the area.The findings suggest the feasibility of constructing high-performance Chinese speech BCIs based on phoneme decoding.The proposed idea and methodology are also beneficial for a broad area of neural decoding research.
In the future, we will collect more data while optimizing the model to further enhance the decoding performance.Furthermore, we will apply our approach to real-time utterance decoding, with correction by language models, and ultimately build a clinically viable system for everyday use.

Fig. 1 .
Fig. 1. (A) Mandarin phonemes with similar kinematics are represented by the same color with varying saturation levels (best viewed in color).(B) Despite appearing closer together in Euclidean space, these similar phonemes are widely separated in hyperbolic space, as indicated by the longer geodesic distance between them.

Fig. 2 .
Fig. 2. The framework of HYSpeech-based Mandarin phoneme decoding.(A) Neural signal recording and preprocessing.(B) Illustration of the experimental paradigm.(C) Projection of neural signals into the hyperbolic space.A clustering loss and a classification loss are jointly optimized to learn neural representations with hierarchical structures.

Fig. 3 .
Fig. 3. Raster plot of an exemplar neuron across repetitive trials of 21 different consonants.Consonants with similar articulations are grouped in similar colors.

Fig. 4 .
Fig. 4. Visualization of neural representations.(A-C) t-SNE of raw neural representations of consonants (left), vowels (middle), and syllables with phonemes (right), respectively.Each point corresponds to a single trial.(D-F) t-SNE of neural representations learned with HYSpeech.

Fig. 5 .
Fig. 5. Visualization of the learned hierarchical clustering tree by HYSpeech from consonant datasets, with Day-1 (A) and Day-2 (B).Outer circle color: articulator movements; Inner circle color: articulation manners (LF is grouped with LL due to its proximity to LL).Different colored curves outside the pink Poincar é disk represent different clusters.

1 )
Comparison of Embedding Spaces: Table

Fig. 6 .
Fig. 6.Performance and comparison.(A) Classification accuracy of 21 consonants, 24 vowels, and 20 syllables in different spaces.Significance levels: * -p < 0.05, ** -p < 0.01 and *** -p < 0.001.(B) Confusion matrices of consonant classification in different spaces, the colored rectangles indicate consonants with similar articulations.(C) Comparison of the Top-N accuracy between our approach and the Euclidean space-based approach.(D-E) The decision boundaries of HYSpeech and HYSpeech-N (without clustering loss).(F) Classification performance of neural signals using neural signals recorded from different conditions.(G) Classification performance on consonant data and syllables data, trained on consonant data only.(left) Classification accuracy of 11 consonants in syllable-set 1.The blue column indicates the accuracy of 11 consonants tested on consonant data, and the orange column indicates accuracy tested on syllable-set.(right) Classification accuracy of 5 consonants in syllable-set 2. (H) Classification accuracy of 21 consonants using multi-day data.The accuracy is calculated by the leave-one-out method.(I) The locations of electrodes Array A and Array B. (J) Influence of different parameters on classification performance.