Semantics-Guided Hierarchical Feature Encoding Generative Adversarial Network for Visual Image Reconstruction From Brain Activity

The utilization of deep learning techniques for decoding visual perception images from brain activity recorded by functional magnetic resonance imaging (fMRI) has garnered considerable attention in recent research. However, reconstructed images from previous studies still suffer from low quality or unreliability. Moreover, the complexity inherent to fMRI data, characterized by high dimensionality and low signal-to-noise ratio, poses significant challenges in extracting meaningful visual information for perceptual reconstruction. In this regard, we proposes a novel neural decoding model, named the hierarchical semantic generative adversarial network (HS-GAN), inspired by the hierarchical encoding of the visual cortex and the homology theory of convolutional neural networks (CNNs), which is capable of reconstructing perceptual images from fMRI data by leveraging the hierarchical and semantic representations. The experimental results demonstrate that HS-GAN achieved the best performance on Horikawa2017 dataset (histogram similarity: 0.447, SSIM-Acc: 78.9%, Peceptual-Acc: 95.38%, AlexNet(2): 96.24% and AlexNet(5): 94.82%) over existing advanced methods, indicating improved naturalness and fidelity of the reconstructed image. The versatility of the HS-GAN was also highlighted, as it demonstrated promising generalization capabilities in reconstructing handwritten digits, achieving the highest SSIM (0.783±0.038), thus extending its application beyond training solely on natural images.


I. INTRODUCTION
T HE human visual system serves as a crucial sensory organ for acquiring external information [1], making the decoding of brain vision a compelling topic in the field of neuroscience.Functional magnetic resonance imaging (fMRI) [2] is an effective non-invasive method for recording brain activities, and its popularity in visual decoding studies is steadily increasing.Visual stimulus decoding encompasses three distinct tasks: image classification, stimuli recognition, and perceived reconstruction [3].Among these tasks, reconstructing perceived images is the most challenging, as it requires efficient utilization of the limited information available in fMRI data.
Previous studies have demonstrated the existence of a mapping between cerebral cortical activity and stimulus images [4], enabling the decoding of perceptual images from fMRI data [5], [6], [7], [8].Several approaches have been explored for perceptual image reconstruction, including machine learning methods, convolutional neural network (CNN) methods, and generative deep learning methods.Machine learning methods employ linear models to map fMRI voxels to handcrafted features (local image structure, Gabor filter features) for visual reconstruction [6].However, these linear mapping-based approaches are primarily suited for simple stimulus images, such as domino patterns [7], handwritten numbers [8], and English letters [9], and may fall short in reconstructing complex natural images.Notably, researchers have discovered a strong correlation between CNN features and brain activity in the visual cortex [10], leading to the adoption of CNNs for recovering natural images from fMRI.These methods involve linearly mapping fMRI voxels to CNN features and then converting the corresponding features back into images through a decoder.For instance, Wen et al. linearly mapped fMRI signals to specific CNN layer features and utilized a decoding network for video frame reconstruction [11].Bely et al. [12] devised an encoder-decoder framework based on CNNs to address the scarcity of fMRI data, where the encoder maps stimulus images to fMRI voxel space, and the decoder performs the reverse mapping.The combination of encoder and decoder enables the use of self-supervision for training.Kai et al. proposed a reconstruction model based on visual attention guidance, inspired by the mechanism of human visual attention.By decoding visual attention distribution from fMRI signals, and then reconstructing perceptual images under its guidance [13].
With the development of image generation models, many researches have begun to utilize deep generative models to reconstruct stimulus images, such as Variational Autoencoder (VAE) [14], Generative Adversarial Network (GAN) [15], and Latent Diffusion Model (LDM) [16].These methods typically pre-train a deep generative model on large-scale datasets and then use linear regression or neural networks to learn the mapping of fMRI signals to latent feature vectors of the generative model.In this way, during the inference stage, the corresponding stimulus image can be reconstructed based on the latent feature vectors predicted by fMRI.For example, Ozcelik et al. [17] used ridge regression to decode latent variables from fMRI patterns for pre-training Instance-GAN to generate images with similar semantics to visual stimuli.With the help of deep generation network, images of different complexity can be reconstructed, such as faces [18], [19], single object-centered images [20], [21] and complex scene images [22], [23].In particular, since the publication of the latent diffusion model, many visual reconstruction methods based on it have emerged [24], [25], [26], [27], [28], [29], which can reconstruct high-quality complex scene images by utilizing the powerful generative capabilities of the latent diffusion model.Although these methods based on deep generative models have achieved impressive naturalness of reconstructed images, they have several inherent problems: (1) The application of pre-trained generative model is favorable to enhance the reconstruction quality, but the generated image is generally inconsistent with the original image semantics.(2) There is no guarantee that the generated image contains low-level features of the visual stimulus, i.e., the reconstruction commonly fails to match the real image.(3) Even with random noise as input, these models can generate high-quality images, resulting in unreliable decoding.)However, for visual reconstruction task, more emphasis should be placed on consistency with the original image compared to the diversity of the generated image.Therefore, the perfect reconstruction of visual stimulation remains to be explored.
To address the problems of the above methods and make the reconstructed image as consistent as possible with the original image, it is necessary to consider how to send more low-level visual features into the reconstruction space, and how to adequately utilize the limited information in the fMRI signals to guide the generator to restore the complex colors and textures of the natural image.The works of Horikawa and Kamitani [30] identified homology between the visual cortex and deep neural networks (DNNs) in hierarchical representation.This discovery established that DNN features can serve as proxies for the hierarchical representation of human vision, which can be translated from fMRI signals.Fang et al. [31] further emphasized that lower visual cortex (LVC) exhibits a higher correlation with low-level image features, while higher visual cortexes (HVC) display stronger correlations with image semantic features.Incorporating information from different visual cortex areas has proven beneficial in enhancing visual decoding performance.However, previous studies merely employed layer-specific DNN features and disregarded the relationship between visual features at various levels of the stimulus image and the visual cortex.Consequently, this limitation resulted in insufficient visual decoding and hindered the model's generalization ability.
Building on these insights, we introduce a novel decoding framework called the hierarchical semantic generative adversarial network (HS-GAN) to reconstruct corresponding perceptual images from fMRI recordings.Drawing inspiration from the hierarchical encoding of the visual cortex, our approach involves constructing an image encoding network that extracts different levels of visual features (hierarchical encoder) from stimulus images and supplements semantic features (semantic encoder), which are then compressed into low-dimensional latent vectors.To preserve more fine-grained details during visual reconstruction, we devise a generative network with skip connections to restore the corresponding visual stimuli from these latent representations.Additionally, we integrate self-attention modules into the generator, enabling the model to effectively leverage important visual information contained in the latent vectors at different levels.To account for the potential nonlinearity of fMRI data, we design a neural decoder with residual connectivity, which efficiently learns the mapping of fMRI to DNN features without overfitting.Given the limited number of fMRI-image pairing samples, we divide the model training into two stages.Initially, the model is trained on an additional large natural dataset in the first stage to incorporate prior knowledge, thereby enhancing reconstruction quality.Subsequently, in the second stage, we solely train the neural decoder to learn the transformation from fMRI voxels to perceptual image visual and semantic features.During the inference stage, the neural decoder is employed to predict corresponding latent representations of perceptual images from test fMRI patterns, which are then fed to the generator to obtain the final reconstructed images.Our primary contributions can be summarized as follows: • We propose a hierarchical semantic-guided visual reconstruction framework, which successfully decodes hierarchical visual and semantic representations of stimulus images from fMRI patterns.This approach maximizes the utilization of limited visual information in fMRI data, leading to improved reconstruction quality.
• The design of our generator, incorporating skip connections and attention modules, facilitates the recovery of perceptual images from low-dimensional representation vectors, further enhancing the fidelity of the reconstructed images.
• We introduce a neural decoder with residual connectivity, effectively learning the mapping of fMRI to DNN features and bolstering the accuracy of fMRI decoding.Additionally, we introduce a reconstruction loss in the training process of the neural decoder.
• Through extensive validation on two distinct datasets, our model achieves state-of-the-art performance, confirming the efficacy of the proposed approach Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Visual Reconstruction From fMRI
The existing approaches to visual reconstruction can be roughly divided into two groups.The first one emphasizes that the reconstructed images are similar to the original images in pixel space.Since the reconstructed image is expected to be consistent with the original image, this type of approach focuses on network design and training strategies, and train their own generative model from scratch.Shen et al. [20] designed an end-to-end DNN generative model to directly learn the mapping from fMRI voxels to images.Moreover, the discriminator and comparator were employed during the generator training process to introduce adversarial loss and perceptual loss.In the same year, Shen et al. [21] employed a linear decoder to decode fMRI into DNN features, and then optimized the pixel values of the image using feature loss to minimize the difference between its DNN features and the DNN features decoded from the fMRI pattern.The method of Beliy et al. [12] consists of an encoder E and a decoder D, where E converts the image to the corresponding fMRI voxels while D maps the fMRI to its corresponding image space.Two combined networks E-D and D-E were constructed by stacking E and D back-to-back for unsupervised training on unpaired images and fMRI data.Fang et al. [31] used linear models and shallow DNNs to decode shape features and category features of stimulus images from fMRI, respectively, which were then used as conditional information to train a GAN generator.Kai et al. [13] begin by predicting salient regions in the image (foreground attention) from the fMRI pattern, and then used it as a guide for the image decoder to recover the visual stimulus from the fMRI.A similar training strategy was used during training as in Beliy et al.
The second one focuses on the similarity of the reconstructed image to the original image in high-level semantic features.Such approaches typically synthesize the reconstructed content with the help of pre-trained generative models (e.g., instance-gan), guided by fMRI patterns.Ozcelik et al. [17] utilize ridge regression to decode conditional instance variables of an instance-GAN from fMRI patterns, which are then used as conditional guidance for pre-trained GAN to generate images with similar semantics to the visual stimuli.Chen et al. [25] first pre-trained a Mask Auto-Encoder (MAE) on an additional fMRI dataset, which was used to extract valid representations of fMRI voxels.Subsequently, the MAE-extracted features are utilized as the textual condition to fine-tune the pre-trained LDM to recover stimulus images.However, these methods can't guarantee that the reconstruction is semantically consistent with fMRI and thus lacks reliability.This is not suitable for some applications in real world scenarios, such as patient diagnosis.

B. Visual Information Processing
The processing of visual information can be divided into three different levels.Low-level processing, including the retina, lateral geniculate nucleus (LGN), and primary visual cortex (V1).This is the first step in visual processing, which focuses on perceiving the orientation, lines and edges of an image.Afterwards, mid-level processing, involving visual regions V2, V3, and V4.They extract shape, object and color features in the image, respectively.Finally, there is high-level processing.This step is accomplished by high-level visual areas such as fusiform face areas (FFA), lateral occipital (LOC), parahippocampal area (PPA), and medial temporal area (MT/V5).They show selective responses to face, object, place and movement.Based on the above conclusions, we should consider the relationship between different levels of visual regions and image features during fMRI decoding.

III. METHOD A. Overview
Let (x, y) represent the {Image, fMRI} data pair, where x ∈ R H ×W ×C represents the natural image, the H, W and C are the height, width and number of channels of x; y ∈ R L represents the fMRI sample collected when the subject viewed image x and L denotes the dimension.Fig. 2a show that the reconstruction task is to recover the perceived images from fMRI recordings.The visual image reconstruction framework we proposed includes three key parts: image feature encoder, neural decoder, and GAN image generator (Fig. 2b).The image feature encoder E includes hierarchical encoder E η and semantic encoder E ϵ .For simplicity, we use z h = {z h1 , z h2 , z h3 , z h4 } to represent hierarchical latent vectors, where z h = E η (x).In order to introduce category information into the reconstructed image, we use semantic encoder E ϵ to obtain the semantic feature z sm of original image to assist the generator G θ in reconstructing the semantically meaningful image.Hierarchical latent vectors and semantic representations are concatenated and fed into the generator to reconstruct the perceived image.Let x = G θ (z h , z sm ) denotes the recovered image of x.Since training the generative model requires a large amount of data, we combine the image feature encoder and the generator to form an autoencoder structure, which allows for self-supervised learning using additional images.At the same time, we also introduce discriminator D φ for confrontation training.Subsequently, we use the well-trained image feature encoder to guide the neural decoder D ψ to learn the transformation from fMRI voxels to feature latent vectors, z * h , z * sm = D ψ (y).In this way, we can decode a rich set of image representations from fMRI.Finally, the natural image corresponding to the fMRI sample y is recovered by Note that the neural decoder predicts hierarchical features using voxels from the entire visual cortex (VC), while for semantic feature it uses voxels from the HVC region, due to the fact that HVC shows more significant response to high-level image features [10].This design takes into account the relationship between different levels of image features and visual areas.

B. Image Feature Encoder
The image feature encoder plays a crucial role in extracting visual features from images at different levels.It comprises two essential components: the hierarchical encoder and the semantic encoder (Fig. 3).Thus, our feature extraction module effectively preserves both low-level features and high-level semantic content, contributing to superior image reconstruction.
1) Hierarchical Encoder: As the backbone of the hierarchical encoder, we employ a pre-trained resnet-50 [32] deep network from ImageNet.Leveraging the residual connections in this network, we can retain certain low-level features while extracting high-level features from the image.As depicted in Fig. 3a, we utilize the conv1, layer1, layer2, and layer3 modules of the resnet-50 to obtain visual features at various levels of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.input image.Since the weights of the resnet-50 network are fixed during training, convolutional modules are introduced to further process the extracted visual features.These features are then compressed into low-dimensional latent vectors using feature encoding blocks.These blocks consist of a convolutional layer and a global pooling layer to reduce the dimensionality of the feature maps, which are ultimately mapped to 1024-dimensional latent vectors through a fully connected (fc) layer.
2) Semantic Encoder: Since the reconstruction quality is positively correlated with the feature decoding accuracy, selecting a DNN with higher decoding accuracy theoretically achieves better reconstruction [21].Based on this, we use the "brain-like" VGG-19 network [33] as the semantic encoder.Specifically, we use VGG-19 pre-trained on ILSVRC2012 [34] to construct the semantic encoding network, where VGG-19 has 19 convolutional layers and 3 fully connected (fc) layers.For the purpose of decreasing the computing cost and enhancing the decoding precision, the output of the first fc layer of VGG-19 is utilized as the semantic representation.In this way, the dimension of the feature vector is reduced to 4096 and the category information of the original image is preserved (Fig. 3b).The category information of the object assists the generator in reconstructing the underlying details of the stimulus image more accurately.

C. Image Generator
In order to retain more low-level details from the original images in the reconstructed images, we devise a hierarchical semantic GAN inspired by the U-Net [35] design principle (Fig. 2).This skip connection overcomes the limitation of traditional encoder-decoder models that tend to lose low-level features such as shape and texture due to the bottleneck structure during the extraction of high-level features.Consequently, our generator is adept at transferring more low-level details to the reconstruction space.The structure of the generator is shown in Fig. 4. Firstly, a fc layer is used to map the low-dimensional latent vectors into the image feature space, and then it is fed into the transposed convolution module for feature extraction and 2-fold up-sampling, which consists of a transposed convolution layer, a normalization layer and a ReLU activation.Furthermore, in the process of recovering perceived images from hierarchical latent vectors, the image generator must effectively leverage the information embedded within these feature vectors.To address this, we introduced self-attention modules into the image generator architecture, enabling it to emphasize crucial visual information while disregarding less relevant details.Finally, the extracted features are concatenated with the feature maps of the next step in the channel dimension to fuse the different levels of features.It should be noted that we concatenate the semantic vector z sm with the hierarchical latent vector z h4 to constrain the generator to preserve the visual features corresponding to the object category during reconstruction.The self-attention mechanism's calculation formula is as follows: where d k = 1 and the definitions of Q, K , and V can be found in [36].
Model training.We use the combined loss of image loss, perceptual loss, and adversarial loss during generator training to enhance the recovery image quality.Where image loss is the Mean Square Error (MSE) of the reconstructed image and original image in pixels.Its equation is as follows: where x i represents the real image, xi = G θ (E (x i )) denotes the corresponding generated image, and N represents the sample size.We use the Learned Perceptual Image Patch Similarity (LPIPS) proposed in [37] as perceptual loss.It has been proved to achieve better reconstructed image quality [38].This loss is defined as: where (•) use the AlexNet [39] network, which is close to the structure of human visual cortex, as a feature extractor for the computation of perceptual loss.The last is adversarial loss, which can provide more natural image reconstruction.
The adversarial loss formula is as follows: where D φ is the discriminator, which employs the convolutional layers to extract input image's features, and then feeds them into a full-connected layer and the sigmoid function to get the probability of being classified as a real image.z = {z h , z sm } represents latent feature vectors.Finally, reconstruction loss L gen is: where λ 1 and λ 2 are hyperparameters representing the weights of perceptual loss and the adversarial loss, respectively.In order to balance different loss terms, it is necessary to choose appropriate parameters of λ 1 and λ 2 .Specifically, we performed a grid search on the interval [0.001, 10] for λ 1 and λ 2 , and calculated the LPIPS values [37] of the different parameter models on the validation set.The experimental results indicate that the best reconstruction performance is obtained when λ 1 = 1.0 and λ 2 = 0.01, and the reconstructed images are closer to the original images in visual perception (achieving the lowest LPIPS).In addition, the discriminator training loss formula is as follows:

D. Neural Decoder
In this study, we employ the neural decoder to convert fMRI recordings into hierarchical latent vectors and semantic features, subsequently reconstructing the corresponding images through the generator.Existing approaches primarily rely on linear regression to establish the mapping from fMRI to DNN feature maps.However, it has been observed that fMRI signals may introduce nonlinearity when the stimulation duration is less than 4.2 seconds [40].Additionally, during the image presentation experiment conducted by Horikawa and Kamitani [30], brain activity recordings of presented images were acquired without any rest intervals, introducing a form of nonlinearity in the fMRI data.Furthermore, under the assumption of a linear relationship between visual features and brain activity, simple decoding models are insufficient to model complex visual representations of the brain [17], [29], which leads to inadequate decoding of fMRI.In response to this, we devised a neural network with residual connections to effectively learn the mapping of fMRI to image features.In order to prevent overfitting, we incorporated LayerNorm and Dropout layers into the neural network, as depicted in Fig. 5.
In the training of neural decoder, we use the trained image feature encoder E to instruct the decoder D ψ to learn the transformation of fMRI to image feature vectors, fixing the parameters of image generator during this process.We simultaneously minimize two loss functions: feature loss L f eat and reconstruction loss L gen .The feature loss includes MSE and cosine similarity to ensure that the vectors regressed by neural Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
decoder are similar to the original feature vectors in both distance and direction.The feature loss term is defined as: where, L mse and L cosine are defined as follows: (8) where z i = E (x i ) and z * i = D ψ (y i ).The reconstruction loss here is shown in equation (5).Therefore, optimize the parameters of D ψ with the following objective: During training, the empirical hyperparameter of the feature loss term is set consistently with the literature [12], i.e., µ = 0.9, with the difference that a reconstruction loss is additionally introduced.Note that due to the dimensional differences in fMRI data across subjects, we trained the decoder model separately for each subject.Finally, the perceptual image x * corresponding to fMRI y can be obtained by G θ D ψ (y) .

E. Self-Supervised Training
The image encoder and generator represent two vital components of our proposed framework.The image encoder extracts visual representations from input images, and the generator converts these representations back into corresponding images.To enhance the performance of both the encoder and generator, we jointly trained these two networks on an additional image dataset.Specifically, we randomly selected 40,000 images from the ILSVRC2012 [34] and resized them to 128 × 128 pixels for self-supervised learning in the context of the reconstruction framework.It is important to note that there is no overlap between the selected images and the training or test images in the fMRI dataset.

A. Experimental Implementation 1) Dataset:
To evaluate the efficacy of our proposed method, we conducted experiments on two publicly available fMRI datasets: Horikawa2017 [30] and vanGerven2010 [41].
vanGerven2010 dataset: This dataset comprises visual stimuli of the numbers 6 and 9 selected from the MNIST dataset, totaling 100 grayscale images with a resolution of 28 × 28.The choice of these specific numbers is due to their substantial dissimilarity.During the image display trials, fMRI data were collected from one subject while viewing the stimulus images, encompassing voxels in the V1, V2, and V3 regions of the visual cortex.For training purposes, we selected 90 {image, fMRI} data pairs from the dataset, while the remaining pairs were reserved for testing.For the reconstruction of handwritten digits, we trained the neural decoder using fMRI voxels from all of the above visual cortex regions.
Horikawa2017 dataset: In the image display trials of this dataset, fMRI signals were collected from five subjects while viewing a series of images randomly selected from the Ima-geNet dataset.The training trials comprised 1200 images belonging to 150 categories, and the test trials contained 50 images from different categories.Notably, the image categories in the test set did not overlap with those in the training set.During fMRI data collection, the training trials involved a single collection per image, whereas the test trials were collected 35 times per image.All images were displayed with fixation in a 3T scanner (TR, 3s; voxel size, 3 × 3 × 3 mm).In accordance with previous studies [21], the fMRI data collected for each test image were averaged to improve the signal-to-noise ratio (SNR).Additionally, this fMRI dataset provides masks for various visual cortex regions, including V1, V2, V3, V4, LOC, FFA, and PPA.For more details about the Horikawa2017 dataset, please refer to [30].
2) Evaluation Indicators: Considering the notable complexity differences in stimulus images between the Horikawa2017 and vanGerven2010 datasets, distinct evaluation criteria were employed for assessing the reconstruction quality of these two datasets.For vanGerven2010, we utilize the Pearson Correlation Coefficient (PCC) and the Structural Similarity Index (SSIM) [42] as evaluation indicators to facilitate comparison with prior studies.Given two images X and Y , the expression of PCC is: where σ X , σ Y , and cov (X, Y ) are the standard deviation and covariance of X , Y , respectively.This metric can be used to assess the linear relationship between the reconstructed and original image.SSIM is a measure that quantifies human visual features, measuring the similarity of local structures between the reconstructed image and the original image.Its expression is: where µ X , µ Y and σ 2 X , σ 2 Y represent mean and variance of X , Y , respectively.σ X Y denotes covariance, c 1 and c 2 are constants.
For the reconstruction of Horikawa2017 natural images, in order to objectively evaluate the reconstruction quality of HS-GAN, qualitative and quantitative comparisons are performed in this paper.For qualitative comparison, images reconstructed by different methods are shown directly.For quantitative comparison, we used six metrics: histogram similarity (HS) [43], mutual information (MI) [44], SSIM identification accuracy (SSIM-Acc), perceptual similarity identification accuracy (Perceptual-Acc), AlexNet(2) and AlexNet(5) identification accuracy.For histogram similarity, the formula is: where histograms and n is the dimension of the histogram.MI is calculated using the following formula: where H (A) and H(B) represent the information entropy of images A and B, respectively, and H (A, B) is the joint entropy of A and B. The equations are as follows: For the identification accuracy metric, the recovered image is assessed using two candidate images: the actual image and a randomly selected one from the test set (excluding the actual image).If the reconstructed image is more similar to the actual image than the randomly selected one, the reconstruction is deemed successful.The formula is: For the 50 images in the test set, a total of 2450 comparisons are made.SSIM-Acc and Perceptual-Acc use Structural Similarity Index (SSIM) and LPIPS as similarity metrics, respectively, where the calculation of perceptual similarity is described in equation (3).AlexNet(2) and AlexNet(5) refer to the computation of PCC similarity using image features extracted from the second and fifth layers of AlexNet.
3) Implementation Details: Our proposed method was implemented using PyTorch, and model training was performed on an NVIDIA 3090 GPU.Self-supervision learning of the reconstruction network was conducted using 40,000 randomly selected images from the ILSVRC2012 dataset.The dimensions of hierarchical latent vectors and semantic feature vectors were set to 1024 and 4096, respectively.The dimension of the hidden layers of the neural decoder is 2048.
Training settings.During self-supervised training, the input images are resized to 128 × 128, the generator and discriminator are trained for 400 epochs using the Adam optimizer with an initial learning rate of 2 × 10 −4 , and the cosine annealing learning rate tuning strategy is invoked.For training stability, the discriminator uses the Patch-GAN design with a patch size of 16.For neural decoder training, the initial learning rate is 3 × 10 −4 , the weight decay is set to 1 × 10 −2 , 240 epochs are trained using the Adam optimizer and a learning rate scheduler is employed.The batch size for all training sessions is 64.The loss curves regarding the generator and the discriminator during the training period are displayed in Fig. 6.It can be observed that the generator loss smoothly converges around 200 epochs, but we continue to train up to 400 epochs to obtain a more robust generator.Various hyperparameters of formula (5) in the image reconstruction network loss term were fine-tuned during the training process, that is, λ 1 = 1.0, λ 2 = 0.01, see the appendix.

B. Image Reconstruction Performance
1) Natural Image Reconstruction of Horikawa2017: We assess our approach on the Horikawa2017 dataset, and partial  reconstruction examples are displayed in 7. From Fig. 7, our model captures the crucial characteristics of the object in the stimulus image, such as shape, contour, etc., and performs well on all subjects.More reconstruction examples can be found in the appendix.
We also compared the image reconstruction results qualitatively and quantitatively with other state-of-the-art methods, including Shen et al. [20], Shen et al. [21], Beliy et al. [12], Fang et al. [31], Kai et al. [13], Ozcelik et al. [17] and Chen et al. [25].Note that the focus of Ozcelik et al. and Chen et al. is different from our approach.They utilize the pre-trained generative model (GAN or LDM) on large-scale dataset to synthesize original image from a noise vector using fMRI as the conditional guide.However, for a broad comparison, we also provide their results.For qualitative comparison, we directly use the recovered images provided by the aforementioned authors in their respective papers.Fig. 8 showcases partially reconstructed images, all obtained from the fMRI data of Subject 3. To enhance the SNR, all fMRI voxels from the   SUBJECT 3 test image were normalized and subsequently averaged.As demonstrated in Fig. 8, the images reconstructed by our method exhibit rich colors, clear contours, and better preservation of underlying details such as shape and texture from the original images.Consequently, our reconstructions appear more natural, clear, and recognizable, representing a significant advancement over previous methods.Compared with previous methods focusing on pixel reconstruction ( [12], [13], [20], [21], [31] ), HS-GAN achieves further improvement in reconstruction quality.However, fMRI data often suffer from spatial redundancy, noise, and sample sparsity, resulting in poor representation of fMRI signals and potential overfitting of noise distribution.The above challenges make it difficult for our decoder to accurately predict the corresponding image features from the fMRI voxels, resulting in the existing reconstruction still hardly to replicate the original stimulus (exhibiting blurry and unclear), and thus the realism of the reconstruction still needs to be improved.Moreover, compared with methods emphasizing semantic similarity ( [17], [25] ), our method retains more low-level  visual features of the original image, providing more realistic reliable reconstruction.Although semantic-focused approaches can produce relatively high-quality images, it is hardly to ensure that the recovered images are consistent with the semantic information of fMRI, as shown in Fig. 8.
To provide an objective evaluation of the reconstruction performance of our proposed method, we quantitatively compared the results with the aforementioned methods using six metrics mentioned above.Since not all methods enable the calculation of the above metrics (depending on the content provided by the author), corresponding metrics for the different methods are reported.Notably, since all reconstructions provided in the papers are for Subject 3, we uniformly the results Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.   of Subject 3 for the indicator calculations.As shown in Table I, HS-GAN reconstructed images obtained the highest histogram similarity of 0.447, and the mutual information similarity for the images is second only to the method of Shen et al. [21], which indicates that our proposed method better preserves the low-level visual features of the original images and achieves a more reliable reconstruction.Furthermore, HS-GAN obtained the highest SSIM-Acc and Perceptual-Acc (78.90% and 95.38%, respectively), indicating that our reconstructed images are more consistent with human visual perception.For the metrics computed in the AlexNet feature space, HS-GAN also achieves the best performance (AlexNet(2) 96.24%, AlexNet(5) 94.82%), demonstrating that the reconstructed images also retain the high-level features of the original images.We also applied a hierarchical variational autoencoder VDVAE [45] (Very Deep VAE) for visual reconstruction, and the quantitative comparison results demonstrate that the reconstruction of HS-GAN is superior to that of VDVAE, which further proves the superiority of our model.A description of the visual reconstruction utilizing VDVAE is provided in the Appendix, and some of the reconstructed images are presented.Overall, HS-GAN achieves the best performance, with advantages in all quantitative metrics.

TABLE III QUANTITATIVE COMPARISON RESULTS OF ABLATION EXPERIMENTS WITH DIFFERENT MODEL COMPONENTS
It should be noted that although the reconstructed images of Ozcelik et al. and Chen et al. appear visually more plausible, their reconstructions tend to differ significantly from the original images, resulting in lack of reliability.As a result, the performance in the assessment of pixel and perceptual similarity metrics is not impressive.However, for the visual reconstruction task, consistency of reconstruction is more important than diversity.Particularly for applications in realworld scenarios, such as the diagnosis of neurological diseases.In addition, we additionally provide the quantitative evaluation when the model is not trained with additional data, and it can be found that our method also achieves the best performance on several metrics (HS: 0.435, SSIM-Acc: 74.31%, Perceptual-Acc: 90.42%, AlexNet (2): 91.86%) compared to other approaches.This demonstrates the effectiveness of our model design.
2) Grayscale Digital Image Reconstruction of vanGer-ven2010: In order to assess the generalization ability of our model beyond natural images, we conducted experiments on the reconstruction of handwritten digital images from the vanGerven2010 dataset.This task presents a challenge as our image feature encoder and generator were initially trained on natural images, without additional training on handwritten characters.Specifically, we fixed the parameters of the image feature encoder and generator, and then inputted the numeric characters scaled to 128 × 128 pixels into the encoder to extract visual features.Subsequently, a neural decoder was employed to map fMRI signals to the latent vectors.Finally, the predicted latent vectors were fed into the generator to reconstruct the corresponding handwritten digits.The reconstruction results are depicted in Fig. 9, where it is evident that our model successfully reconstructs the digits 6 and 9.
Table II presents the quantitative comparison results for the vanGerven2010 dataset.Notably, our method achieves the highest Structural Similarity Index (SSIM) of 0.783, an improvement of 7.4% compared to TIGAN [46].In visual comparison, HS-GAN reconstructed images have clearer contours.This is mainly attributed to: (1) Our model utilizes diverse visual features at different levels to reconstruct the stimulus image and introduce reconstruction loss in the neural decoder.(2) The specially designed generator effectively transmits more low-level details from the original image into the reconstruction space.Although DGMM [47], TIGAN [46], and DVAE/GAN [48] achieve higher Pearson Correlation Coefficient (PCC) values, they also exhibit the issue of blurred reconstruction.These comparative results demonstrate the robust versatility of our model, which is not simply limited to template matching, making it suitable for reconstructing images from other domains as well.Moreover, the adaptability of the model pre-trained on complex images to perform well in simpler image reconstruction tasks is evident from our findings.

C. Ablation Studies of Different Components
Our proposed HS-GAN incorporates several crucial components, including the semantic encoder, self-attention module, and neural decoder.In this section, we conduct ablation experiments to examine the effects of these components on model performance.The specific experimental results are presented in Fig. 10 and Table III.
As shown in Fig. 10, incorporating semantic features in the generative model allows the reconstructed images to have more accurate shapes, textures, and colors.For example, the shell in the fourth column, the reconstructed image after adding semantic features is visually more similar to the original image.The attention module allows the generator to reconstruct the original image with more precise details, such as airplanes and bats, the reconstructed outline is more similar to the real image after adding the attention module.The neural decoder proposed in this paper obtains more natural reconstructions compared to ridge regression, which is commonly used in previous methods, and achieves better performance in quantitative evaluation.This can be interpreted in two aspects: (1) Ridge regression can only capture linear relationships between fMRI patterns and DNN features, which may have complicated nonlinear relationships.
(2) Ridge regression ignores the correlation between DNN feature units, while our decoding model is able to capture this correlation.
We present the results of the quantitative evaluation in Table III, where the ridge regression decoder means that using ridge regression to learn the mapping from fMRI to latent features.From Table III, it can be seen that using the complete model achieves the best reconstruction quality, and different model components contribute to improving network performance.Especially, the introduction of category information significantly improves the reconstruction quality, which proves the effectiveness of our method.

D. Impact of Different Loss Functions
1) Generator Loss Functions: To evaluate the effectiveness of introducing perceptual loss and adversarial loss during generator training, we trained our model using three loss functions: L img , L img + L pl , and L img + L pl + L adv .The reconstructed images and quantitative evaluation results are presented in Fig. 11    important features (e.g., edges and textures) and is less sensitive to subtle changes in the image.In addition, the recognition accuracy of SSIM, Perceptual, AlexNet (2) and AlexNet(5) was also significantly improved, which indicates that the reconstructed image recovers most of the visual features of the original image, demonstrating the effectiveness of introducing perceptual loss.
3) Furthermore, adding adversarial loss further improves the quality of the reconstructed images by enforcing the generator to produce more natural-looking images.
Additionally, by introducing natural image prior information through self-supervised learning, the images reconstructed by the generator become more natural and recognizable, achieving the highest recognition accuracy.4) Using pixel-level similarity evaluation indicators, blurred images also have high similarity, which is inconsistent with human visual perception.For example, although the images reconstructed using the MSE loss were blurry, their HS and MI metrics also received high evaluations compared with other comparative experiments.Therefore, we prefer to use the similarity assessment in the image feature space to measure the reconstruction performance.The results of quantitative comparisons are presented in Table V, it can be observed that the quality of the reconstructions can be improved by introducing reconstruction loss, and the best performance is obtained in all evaluation indicators.This is because our ultimate target is to reconstruct realistic and reliable stimulus images, and the introduction of the reconstruction loss allows the latent features predicted by the decoder to be more suitable for generating natural images.However, when only reconstruction loss is employed, it is difficult to ensure accurate alignment of the decoder latent space with the image encoding space, which leads to low-quality reconstruction.

E. Effectiveness of Self-Supervised Training Strategy
In this section, we performed ablation experiments to verify the effectiveness of using the self-supervised training strategy.The results regarding the quantitative assessment are displayed in Table VI.
It can be observed that the use of self-supervised training strategy significantly improves the quality of model reconstruction, and achieves better performance on all quantitative metrics.This indicates that jointly training the image encoder and generator on additional image data to introduce the prior information of natural images, which can make the images reconstructed by the generator more natural.In addition, to explore the impact of the additional dataset on the reconstruction performance of HS-GAN, we use a larger number of images (100,000) for self-supervised training.The results in Table VI show that the reconstruction performance of the model does not change significantly when using more image data (close to the performance when using 40,000 images).This proves that our model is not data-hungry.Considering the computational burden of a larger dataset, we use 40,000 images for self-supervised training of the model.

F. Effectiveness of Hierarchical Features
We utilize hierarchical and semantic features of images for visual reconstruction, and to demonstrate the effectiveness of merging different levels of image features, we perform comparative experiments.Qualitative and quantitative comparisons of the reconstruction results are shown in Fig. 12 and Table VII, respectively.
From Fig. 12, it can be observed that when only semantic feature is used for reconstruction, the image quality is not satisfactory (the generated image is blurry and hard to identify) and the quantitative evaluation shows the lowest performance.We believe that this is because the semantic feature loses most of the low-level features in the original image, making it difficult for generator to recover the precise details.When the visual feature z h4 of the image is fused, more low-level features can be transmitted into the reconstruction space, so the shape, contour and color of the reconstructed image are more precise, and the model performance is significantly improved.
As shown in Table VII, as more levels of image features are introduced into the generator, the reconstruction quality is further improved.Especially in visual comparison, the reconstructed image using the complete method maintains maximum consistency with the original image in terms of low-level features (shape, color, etc.).In quantitative comparison, noise refers to the quantitative evaluation result of the reconstructed image obtained by feeding the random noise from a standard gaussian distribution into the decoder.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Overall, our complete method achieves the best performance (SSIM-Acc: 78.90%, Perceptual-Acc: 95.38%, AlexNet(2): 96.24%, AlexNet(5): 94.82%) and the recognition accuracy far exceeds that of noise-based reconstruction.This demonstrates that HS-GAN learns the complex mapping of fMRI signals to visual features of stimulus images, and the reconstruction is consistent with human visual perception.

Conclusion:
In this research, we introduced a novel approach, the semantics-guided hierarchical feature encoding Generative Adversarial Network (GAN), to address the challenge of reconstructing visual images from fMRI recordings.
Our method draws inspiration from the hierarchical encoding observed in the visual cortex and the homology of information processing between the brain and deep neural networks.Our proposed framework consists of an image feature encoder, which extracts hierarchical and semantic features from input images and encodes them as latent vectors.Subsequently, a neural decoder with residual connections is trained to learn the representation from the fMRI signal to the image feature space.Finally, the predicted hierarchical and semantic features are combined to reconstruct the image through the generator.The validation of our approach was conducted using two publicly available datasets, and we compared its performance with other advanced methods.By leveraging information from different visual cortex regions, our method achieved significantly improved results, yielding reconstructed visual images that are more natural and recognizable compared to previous approaches.
Discussion: This research opens up promising avenues for further advancements in the field of decoding visual information from brain activity.Although our approach achieves competitive results on realistic and semantic consistency of reconstructed images, there are still some limitations.First, due to the high cost of collecting fMRI data, it is difficult to obtain a large number of paired samples, which makes it difficult for the decoder to accurately predict the corresponding image features from the fMRI voxels.In future work, building deep learning models that can effectively understand fMRI patterns will better facilitate downstream tasks.Second, our model employs a two-stage training strategy, using latent feature vectors as the medium for fMRI voxel to image transformation, reducing the dependence on paired samples.However, the two-stage approach somewhat leads to information loss in fMRI, thus realizing an end-to-end decoding model from fMRI to image still needs attention.In our experiments, we observed some variation in reconstruction quality across subjects, although this is common in other decoding efforts.
In the future, we should investigate versatile cross-subject models to efficiently project fMRI representations from different subjects into the same embedding space.In addition, while existing methods have significantly improved the quality of reconstructed images from fMRI, what is the upper limit?The exploration of this question in future studies is expected to provide new insights in the field of visual decoding.

APPENDIX
Here, we describe how to use the VDVAE as a generator for visual reconstruction.The VDVAE is a hierarchical variational autoencoder model that consists of 75 layers and is pre-trained on the ImageNet dataset.Specifically, we first trained a neural decoder D ψ to learn the mapping of fMRI voxels to the embedding space of the VDVAE encoder.Here we employed the embedding vectors of the first 31 layers of VDVAE and combine them into a 91168-dimensional feature vector z.Subsequently, we feed the feature latent vector ẑ predicted by the decoder into the image decoder of the VDVAE to obtain the corresponding reconstruction.Partial reconstructed images are shown in Fig. 13.
In addition, We present all reconstruction examples (50 categories) of subject3 in Horikawa 2017 dataset, see Fig 14.Through the reconstruction results, we found that our method performed well on images centered on a single object (e.g., a shell), but not satisfactorily for images with complex backgrounds (row 7, column 4).This may be attributed to the interference of image background on the subject's attention, resulting in a lower signal-to-noise ratio of the recorded fMRI signal.To explore the impact of λ 1 and λ 2 on the image reconstruction performance, we computed the LPIPS for models with different parameters λ 1 = [0.001,0.01, . . ., 10] and λ 2 = [0.001,0.01, . . ., 10], and the results on validation set are shown in Table VIII.It can be seen that the best results are obtained at λ 1 = 1.0 and λ 2 = 0.01.

Fig. 1 .
Fig. 1.The cortical surface map of the brain.

Fig. 2 .
Fig. 2. The visual reconstruction framework proposed in this study.(a) Reconstruct the perceived images from fMRI recordings.(b) Review our overall framework.

Fig. 4 .
Fig. 4. The structure of image generator, where c denotes concatenation in the channel dimension and ConvT block includes a transpose convolution layer, batch normalization, and ReLu activation.

H
(A, B) = − ab p AB (a, b) log p AB (a, b)

Fig. 6 .
Fig. 6.The loss curves of the generator and discriminator during training process.

Fig. 8 .
Fig. 8. Qualitative comparison of different methods to reconstruct natural images on the Horikawa2017 dataset.

Fig. 10 .
Fig. 10.Qualitative comparative results of ablation experiments with different model components.

Fig. 11 .
Fig. 11.Qualitative comparison of the generator using different loss functions.

2 )
Neural Decoder Loss Functions: To demonstrate the effectiveness of introducing reconstruction loss in the training process of the neural decoder, ablation experiments with different loss functions of the neural decoder are performed in this section.

TABLE I QUANTITATIVE
COMPARISON OF RECONSTRUCTION RESULTS OBTAINED USING FMRI OF

TABLE II QUANTITATIVE
COMPARISON OF RECONSTRUCTION QUALITY FOR VANGERVEN2010 DATASET, WHERE THE BEST RESULTS ARE HIGHLIGHTED IN BOLD.(↑: THE HIGHER THE VALUE, THE BETTER THE RECONSTRUCTION PERFORMANCE OF THE METHOD)

TABLE IV QUANTITATIVE
EVALUATION OF DIFFERENT LOSS FUNCTIONS FOR THE GENERATOR

TABLE V QUANTITATIVE
EVALUATION OF DIFFERENT LOSS FUNCTIONS FOR THE DECODER and Table IV, respectively.Our findings indicate that: 1) Using only image loss results in fuzzy and difficult-torecognize images, with the lowest recognition accuracy.This is attributed to the MSE loss function causing the reconstructed images to lose precise details from the original images.2) The introduction of perceptual loss leads to clearer images with distinct outlines.This is because that perceptual loss places more emphasis on perceptually Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VI QUANTITATIVE
EVALUATION OF SELF-SUPERVISED TRAINING STRATEGY Fig. 12. Qualitative comparisons of reconstruction results by fusing different levels of features.

TABLE VII QUANTITATIVE
EVALUATION INCORPORATING DIFFERENT LEVELS OF FEATURES

TABLE VIII LPIPS
COMPARISON WITH VARIOUS λ 1 AND λ 2 , THE LOWER THE VALUE, THE BETTER