NeuroSeg-III: efficient neuron segmentation in two-photon Ca2+ imaging data using self-supervised learning

Two-photon Ca2+ imaging technology increasingly plays an essential role in neuroscience research. However, the requirement for extensive professional annotation poses a significant challenge to improving the performance of neuron segmentation models. Here, we present NeuroSeg-III, an innovative self-supervised learning approach specifically designed to achieve fast and precise segmentation of neurons in imaging data. This approach consists of two modules: a self-supervised pre-training network and a segmentation network. After pre-training the encoder of the segmentation network via a self-supervised learning method without any annotated data, we only need to fine-tune the segmentation network with a small amount of annotated data. The segmentation network is designed with YOLOv8s, FasterNet, efficient multi-scale attention mechanism (EMA), and bi-directional feature pyramid network (BiFPN), which enhanced the model's segmentation accuracy while reducing the computational cost and parameters. The generalization of our approach was validated across different Ca2+ indicators and scales of imaging data. Significantly, the proposed neuron segmentation approach exhibits exceptional speed and accuracy, surpassing the current state-of-the-art benchmarks when evaluated using a publicly available dataset. The results underscore the effectiveness of NeuroSeg-III, with employing an efficient training strategy tailored for two-photon Ca2+ imaging data and delivering remarkable precision in neuron segmentation.


Introduction
The utilization of two-photon microscope and Ca 2+ indicators for imaging neuronal activity is crucial in contemporary neuroscience research [1][2][3][4][5][6].The advancement in imaging techniques allows high speed and large field of view recording in vivo, and it has resulted in an influx of experimental data requiring efficient processing [7,8].A comprehensive pipeline for handling two-photon imaging data has been established, encompassing video denoising, registration, neuron segmentation, and extracting Ca 2+ signals from the monitored neurons [9,10].The precise neuron segmentation is a necessary prerequisite for analysis, while neuroscientists normally undertake a laborious task to manually annotate this data.Although manual segmentation remains the gold standard ground truth (GT) marking for determining whether the segmented areas are neurons, it is a solution that is inefficient and labor-intensive.Previously, numerous automated segmentation methods have been developed to address this issue, capable of performing segmentation in real time with promising accuracy.
As deep learning continues to gain traction in the field of neuroscience, neuron segmentation methods can broadly be classified into unsupervised and supervised approaches, depending on their reliance on manual labels.In the category of unsupervised learning methods, some of them obtained the position of neurons by signal filtering in each frame of the video.These methods employ pixel-value thresholding, convolution with specific filters, graph-based clustering, or combinations thereof to emphasize the structural features of neuron boundaries, which facilitates the neuron segmentation [11][12][13][14][15]. Another specific scenario within unsupervised learning involves the utilization of matrix factorization approaches.In the previous studies, principal component/independent component analysis (PCA/ICA) [16], non-negative matrix factorization (NMF) [17], constrained non-negative matrix factorization (CNMF) [18], Suite2p [19], or online analysis of streaming Ca 2+ imaging data (OnACID) have been developed for neuron segmentation [20].Despite the unsupervised learning methods mentioned above do not require manual labeling, further improvements are needed in terms of segmentation accuracy and speed.
For the category of supervised learning methods, robust classifiers are trained for neuron segmentation using manually annotated masks.Typically, these methods necessitate a substantial quantity of manual annotations for training, extracting features from the training data, i.e., labelled image (2D) data or video (3D) data, and generalizing to the new data with similar characteristics.Convolutional neural networks (CNNs) and the U-Net architecture are widely employed methods in supervised learning.Specifically, CNNs are typically used for pixel classification [21], while the U-Net architecture is commonly applied for image segmentation tasks (e.g., UNet2Ds and Shallow U-Net Neuron Segmentation(SUNS)) [22][23][24].Additionally, techniques such as our previous work utilize the network based on Mask-RCNN to extract neuronal features from 2D two-photon Ca 2+ imaging data [25,26].The 2D projected image approach offers the benefit of processing speed but faces the drawback of information loss due to the projected images may not effectively capture the features of neurons: those with slow firing rates and those with fluorescence emissions weaker than the baseline fluorescence or background.In contrast to approaches focused on 2D image processing, spatiotemporal methods that process 3D video offer the potential for improved accuracy in detecting sparsely firing and overlapping neurons, albeit at the cost of increased computational complexity.One notable example is STNeuroNet, an end-to-end model that utilizes a 3D CNN to segment active neurons [27].Another approach, known as CaImAn, combines both unsupervised and supervised learning paradigms.It employs unsupervised learning to identify active components and then utilizes supervised learning to refine these components [28].
To leverage the high segmentation accuracy of supervised learning methods while minimizing the need for extensive manual labeling, we propose NeuroSeg-III, a novel self-supervised learning approach specifically designed to enhance neuron segmentation.This approach innovatively combines a self-supervised pre-training model with an improved segmentation network, leveraging the transformation invariance and covariance contrast (TiCo) learning method to reduce reliance on annotated data [29].Our segmentation network incorporates YOLOv8s, FasterNet [30], EMA attention mechanism [31], and BiFPN [32] to boost accuracy while reducing computational load.We enhance input data for bolstering spatiotemporal information by fusing maximum projection and correlation map images [19,26].NeuroSeg-III stands out for its generalizability across various Ca 2+ indicators and imaging scales, outperforming existing methods in segmentation speed and accuracy.

Dataset from our laboratory
In this study, we conducted two-photon Ca 2+ imaging experiments using C57BL/6J mice provided by the Laboratory Animal Center at the Third Military Medical University.The experimental procedures were conducted in accordance with the Third Military Medical University Animal Care and Use Committee's approved protocols.
During the experiments, we first exposed the auditory cortex region [33,34], followed by the injection of dye (GCaMP6f or Cal-520 AM, or OGB-1 AM) into the same area.Following a 2-hour incubation period, Ca 2+ imaging was performed and imaging data was recorded with a custom-built two-photon microscope system (LotosScan, Suzhou Institute of Biomedical Engineering and Technology, Suzhou, China) [35,36].
In total, 212 imaging videos (OGB-1: 61 samples; Cal-520: 132 samples; GCaMP6f: 19 samples) are generated in our laboratory.Three skilled annotators individually labelled each neuron, and the resulting labels were compared to generate a final consensus, which served as the ground truth encompassing 30-240 neurons in each imaging plane.

Dataset from Allen Brain Observatory (ABO)
There are two groups of ABO datasets utilized in this work.The first group was used for mixed training with the dataset from our lab to demonstrate the segmentation model's generalization capability, while the second group was employed for comparison with segmentation methods.The ABO-mixed dataset comprises 132 images extracted from 71 two-photon videos.These videos cover various brain regions and layers.Specifically, there are 66 images captured at imaging depths of 175 µm and another 66 images obtained at imaging depths of 275 µm.For the data acquired at imaging depth of 175 µm, there are 17 images recorded from primary visual cortex (VISp), 12 images from posterolateral visual cortex (VISpm), 12 images from lateral visual cortex (VISl), 6 images from rostrolateral visual cortex (VISrl), 6 images from anteromedial visual cortex (VISam) and 13 images from anterolateral visual cortex (VISal).As for the images in the depth of 275 µm, there are 24 pieces from VISp, 9 pieces from VISpm, 15 pieces from VISl, and 18 pieces from VISal.Three skilled annotators individually labelled each neuron, and the resulting labels were compared to generate a final consensus as GT.
The second group included 10 videos acquired at imaging depth of 275 µm and 10 videos acquired at imaging depth of 175 µm using two-photon microscopy (VISp, Experiment ID: 501271265,501484643,501574836,501704220,501729039,501836392,502115959,502205092,502608215,503109347,504637623,510214538,510514474,510517131,524691284,527048992,531006860,539670003,540684467,545446482).Each frame of the 20 ABO videos was cropped from 512 × 512 pixels to 487 × 487 pixels, the black boundary regions were removed.The ground truth used in this dataset was carefully proof read from the work of STNeuroNet [27].To assess the efficacy of neuron segmentation methods, the aforementioned dataset was utilized in a two-round generalization cross-validation.This rigorous evaluation involved considering different recording depths.Specifically, 10 ABO 275 µm videos and 10 ABO 175 µm videos were separately employed as the training and validation datasets.The ABO 175 µm videos are used for training, and ABO 275 µm videos are used for validation.
All mice in two groups of the ABO dataset expressed the GCaMP6f indicator and each video in ABO dataset contains approximately 100-400 neurons.

Data preprocessing
In our previous work, we employed the image fusion strategy [26].However, considering the imaging characteristics of the ABO dataset, we opted for fusing the maximum projection instead of the average projection with a correlation map [37][38][39][40].The final image was produced by linearly weighting the maximum projection and the correlation map in a ratio of 1:1, followed by normalization.Here, we created the correlation map by calculating the weighted multi-dimensional correlation of each pixel with its neighboring pixel, as follows: where g i denotes Gaussian kernel used for filtering.This process can be interpreted as performing a Gaussian filtering operation on f i or ∥f i ∥ 2 , f i refers the neighboring pixels' traces, and c w denoting the correlation for different dimensions [19].The potential neuron locations can be indicated with correlation map values.These new data were used as image data for training and validation with the proposed method.Considering the requirements for training and computational costs, we performed a maximum projection of the raw two-photon videos every 20 frames, equivalently down-sampled the original data to 1/20th of its initial frequency.This dataset consisting of maximum projection is used as the training set for self-supervised learning.

Framework of NeuroSeg-III
NeuroSeg-III consists of two major parts: the first is a self-supervised learning network (TiCo) that acquires intricate feature representations through the utilization of unlabeled data samples [29], followed by the second stage, which is a segmentation network improved from YOLOv8s and fine-tuned to perform neuron segmentation [30][31][32].Figure 1 demonstrates the framework of the proposed approach.

Self-supervised learning
The self-supervised learning-based network is presented in Fig. 1(A).This technique simultaneously fine-tunes the objectives of transformation invariance and covariance contrast, thereby efficiently normalizing the covariance matrix within the embeddings.It serves as a dual-purpose approach, functioning as a method for contrastive learning as well as for reducing redundancy.After using a stochastic data augmentation, two identical encoders f θ and f ξ (the backbone module of the segmentation network) and two projectors g θ and g ξ , which are dampened sharing the parameters (θ and ξ) and weights, generate feature representations z ′ i and z ′′ i , where The projection network utilizes the feature maps produced by the backbone module of the segmentation network (Fig. 1(B) and Fig. 2).It incorporates adaptive average pooling, ReLU activation, batch normalization (BN), and fully connected layers (FC).The ultimate feature representations are produced through an additional FC layer.
In the settings of TiCo, we used a momentum encoder technique, i.e., only the parameter θ is updated within backpropagation process, meanwhile the parameter ξ is updated as the exponential moving average of the parameter θ: where α ∈ [0, 1] is a hyperparameter and t is time step.During training, the covariance matrix C t is updated at time step t : and fully connected layers (FC).The ultimate feature representations are produced through an additional FC layer.In the settings of TiCo, we used a momentum encoder technique, i.e., only the parameter  is updated within backpropagation process, meanwhile the parameter  is updated as the exponential moving average of the parameter : where  ∈ [0,1] is a hyperparameter and  is time step.During training, the covariance matrix   is updated at time step  : where  ∈ [0,1] is a hyperparameter.Hence, the loss function is designed to jointly optimize two objectives: where β ∈ [0, 1] is a hyperparameter.Hence, the loss function is designed to jointly optimize two objectives: The first term aims to minimize the difference between embeddings of various data augmentations of the same image.The second term aims to constrain each vector towards the subspace associated with smaller eigenvalues of the covariance matrix.

Segmentation network
The fast development of the two-photo Ca 2+ imaging technology and the Ca 2+ indicators pose challenges to the accuracy and speed of neuron segmentation [5,6].The newly developed YOLO (You Only Look Once) models offer several advantages including compact model size, rapid processing, and exceptional accuracy [41], and they have been extensive applied in the domain of object detection [42][43][44].In this study, we adopted the latest YOLOv8s model with considering the balance between hardware capabilities and segmentation accuracy.The original YOLOv8s backbone architecture consists of Conv, C2f, and SPPF modules, wherein the C2f module plays a vital role in learning residual features.
Within the backbone structure, we introduced a modification by replacing the original C2f module with the enhanced C2f-Faster-EMA module.This hybrid module combines the The first term aims to minimize the difference between embeddings of various data augmentations of the same image.The second term aims to constrain each vector towards the subspace associated with smaller eigenvalues of the covariance matrix.

Segmentation network
The fast development of the two-photo Ca 2+ imaging technology and the Ca 2+ indicators pose challenges to the accuracy and speed of neuron segmentation [5,6].The newly developed YOLO (You Only Look Once) models offer several advantages including compact model size, rapid processing, and exceptional accuracy [41], and they have been extensive applied in the domain of object detection [42][43][44].In this study, we adopted the latest YOLOv8s model with considering the balance between hardware capabilities and segmentation accuracy.The original YOLOv8s backbone architecture consists of Conv, C2f, and SPPF modules, wherein the C2f module plays a vital role in learning residual features.
Within the backbone structure, we introduced a modification by replacing the original C2f module with the enhanced C2f-Faster-EMA module.This hybrid module combines the functionality of the C2f module with FasterNet and integrates the EMA attention mechanism (Fig. 2(A)).Furthermore, we opted not to replace the C2f module in the neck module.Instead, functionality of the C2f module with FasterNet and integrates the EMA attention mechanism (Fig. 2(A)).Furthermore, we opted not to replace the C2f module in the neck module.Instead, we enhanced the neck module by incorporating a BiFPN (Fig. 2(A-C)) to replace the original PANet structure that enables effective cross-scale connections and weighted feature fusion.It is worth noting that there were no modifications made to the segment head module (Fig. 2(D)).These improvements of the segment network simultaneously reduce the parameter count and enhance the model's segmentation performance through the incorporation of an attention mechanism.We employed the BiFPN within the neck module of YOLOv8s to enhance the depth of information mining and further improve the multi-scale neuronal feature extraction capability of the model, meanwhile reducing the model's parameters.While integrating the attention mechanism increased the computational complexity of the model, it demonstrated its utmost effectiveness during image feature extraction.Each Faster-EMA block consists of a PConv layer [30], followed by two Conv layers and the EMA (Fig. 2(B)).As for the input of the network, we utilize the fused image obtained from the maximum projection and correlation map, and the segmentation result of the network is illustrated in Fig. 2(E).
We visualized how the attention module (Fig. 3(A)) highlights specific features to demonstrate the necessity of the EMA module.Figure 3(B) demonstrates that when using different Ca 2+ indicators and varying quantities of neurons (n = 7, 19, 34, 70) in imaging data, the inclusion of EMA mechanism in the model leaded to a significant increase in the deep red area on the map, compared to the model that lacked the attention mechanism and thus was not able to focus on neurons.This indicates that the network, when integrated with EMA, effectively learned to make use of information from neuron regions and consolidate features from these regions.
we enhanced the neck module by incorporating a BiFPN (Fig. 2(A-C)) to replace the original PANet structure that enables effective cross-scale connections and weighted feature fusion.It is worth noting that there were no modifications made to the segment head module (Fig. 2(D)).These improvements of the segment network simultaneously reduce the parameter count and enhance the model's segmentation performance through the incorporation of an attention mechanism.We employed the BiFPN within the neck module of YOLOv8s to enhance the depth of information mining and further improve the multi-scale neuronal feature extraction capability of the model, meanwhile reducing the model's parameters.While integrating the attention mechanism increased the computational complexity of the model, it demonstrated its utmost effectiveness during image feature extraction.Each Faster-EMA block consists of a PConv layer [30], followed by two Conv layers and the EMA (Fig. 2(B)).As for the input of the network, we utilize the fused image obtained from the maximum projection and correlation map, and the segmentation result of the network is illustrated in Fig. 2(E).We visualized how the attention module (Fig. 3(A)) highlights specific features to demonstrate the necessity of the EMA module.Figure 3(B) demonstrates that when using different Ca 2+ indicators and varying quantities of neurons (n = 7, 19, 34, 70) in imaging data, the inclusion of EMA mechanism in the model leaded to a significant increase in the deep red area on the map, compared to the model that lacked the attention mechanism and thus was not able to focus on neurons.This indicates that the network, when integrated with EMA, effectively learned to make use of information from neuron regions and consolidate features from these regions.

Data augmentation
We employed data augmentation strategies before feeding the data into TiCo and during the training of the segmentation network.Here we used the same augmentation strategy used in

Data augmentation
We employed data augmentation strategies before feeding the data into TiCo and during the training of the segmentation network.Here we used the same augmentation strategy used in TiCo.Each input image after the maximum projection is transformed twice to produce the two distorted views shown in Fig. 1(A).The augmentation pipeline consists of random cropping (1.0, probability), resizing to 224 × 224 (1.0, probability), randomly flipping the images horizontally (0.5, probability), adding Gaussian blurring (0.5, probability), and applying solarization (1.0, probability).During the training procession of the improved YOLOv8s model, the image augmentation pipeline consists of the following transformations: mosaic data augmentation, random blurring, median blurring, randomly changing brightness and contrast, applying contrast limited adaptive histogram equalization and image compression.The first transformation (mosaic data augmentation) is always applied, while the other six are applied randomly, with a probability of 0.05.

Model training with TiCo
To perform training of our proposed model, we opted for the SGD optimizer and used a fixed learning rate of (0.2 × batch size/256).Due to limitation in GPU memory, we selected a batch size of 128.Pre-training was conducted using the maximum projection dataset, which was extracted from 20 ABO videos (10 videos in 175 µm and another 10 videos in 275 µm) used for comparing the neuron segmentation methods.This dataset was sampled at a rate of 1/20 and did not take annotations into account.In total, we trained for 1000 epochs, selecting the model with the lowest loss as our pre-trained model to be transferred to the segmentation network.

Model training with mixed dataset
To evaluate the performance of model modification compared to the original YOLOv8s, we utilized a dataset containing 344 images (212 from our lab and 132 from the ABO).These images were divided into three distinct sub-datasets (training set: 270 images; validation set: 37 images; testing set: 37 images).The segmentation network underwent training of 200 epochs (batch size: 8).The training phase incorporated an early-stopping strategy to improve the model's generalization ability.During the final 10 epochs, data augmentation using mosaic was disabled to further enhance the model's performance [45].

Model training with the ABO dataset
To evaluate neuron segmentation performance, all methods underwent two-round generalization cross-validation using 20 ABO videos.During the evaluation process, which involved methods including CITE-On [46], SUNS [24], STNeuroNet [27], and the classifier of CaImAn [28], or ROI classifiers in Suite2p [19], the models were trained and tested using the video data.All these methods were optimized according to their literatures.For evaluating the training and validation of NeuroSeg-III, we employed image data.The best result of NeuroSeg-III (split = 8) is compared with other neuron segmentation methods.
The training of NeuroSeg-III was conducted as follows: First, we employed the training strategy to train the encoder of the segmentation network with the self-supervised learning network (TiCo).Subsequently, we initialized the encoder of the segmentation network using a transfer learning strategy and then proceeded to fine-tune all the weights of the segmentation network.During the process of fine-tuning the segmentation network weights, we also implemented an early-stopping strategy.During the final 10 epochs, we disabled data augmentation.

Evaluation metrics
Three metrics (precision, recall, and F1-score) were employed [27] for the evaluation of the segmentation performance: where N GT is the number of GT neurons, N TP is the number of true-positive neurons, and N detected is the number of detected neurons.The degree of overlap between the detected neuron and the GT masks is quantified using the Intersection-over-Union (IoU) metric, which is measured as: where m 1 and m 2 are two binary masks.The distance (Dist) between masks is calculated as: where M j and m GT i are the masks for the detected and GT neurons, respectively.Subsequently, the Hungarian algorithm was utilized for the matching calculations.

Results
The data preprocessing, model training and testing were performed on a device consisting of Intel Xeon Gold 6258R CPU, NVIDIA RTX A6000 GPU, 640 GB RAM.

C2f-Faster-EMA and BiFPN improved segmentation accuracy meanwhile reducing the model's parameters
To demonstrate the efficacy of the model improvement, we initially carried out ablation experiments where we selectively removed the FasterNet, EMA, and BiFPN components from the enhanced segmentation network.The results of these ablation experiments are illustrated in Fig. 4(A), which indicate that our enhanced model (0.9394 ± 0.0288) exhibits a significant increase in F1-score compared to original YOLOv8s (0.9254 ± 0.035, P < 0.0001).Notably, the EMA module has the most pronounced impact on model accuracy.While the FasterNet and BiFPN modules also contribute to a slight improvement in model accuracy, their primary contribution lies in reducing the model's computational cost (GFLOPS) and parameters (Fig. 4(B)).This reduction facilitates faster training and shorter inference times, making it easier to generalize to new data and deploy the model on mobile computing devices, especially when hardware resources are limited.Although the EMA structure does introduce a slight increase in computational costs and parameters, this increment is negligible compared to its enhancement in segmentation accuracy.

Self-supervised module improved the model's segmentation accuracy while reducing the reliance on GT
Existing neuron segmentation methods based on supervised learning often require a large amount of labelled data to achieve good performance.So, how can we achieve comparable performance with a small amount of labelled data?To tackle this challenge, we investigated the potential of integrating self-supervised learning techniques with segmentation network.We divided each of the 20 ABO videos into 1 to 10 segments.For each segment of a video, we generated a merged image by combining features from max projection and correlation maps.Each training set was composed of merged images created from the segmented video segments and 10 merged images generated from complete videos.Additionally, each validation dataset consisted of 10

4(B)
).This reduction facilitates faster training and shorter inference times, making it easier to generalize to new data and deploy the model on mobile computing devices, especially when hardware resources are limited.Although the EMA structure does introduce a slight increase in computational costs and parameters, this increment is negligible compared to its enhancement in segmentation accuracy.

The self-supervised module improved the model's segmentation accuracy while reducing the reliance on GT.
Existing neuron segmentation methods based on supervised learning often require a large amount of labelled data to achieve good performance.So, how can we achieve comparable performance with a small amount of labelled data?To tackle this challenge, we investigated the potential of integrating self-supervised learning techniques with segmentation network.We divided each of the 20 ABO videos into 1 to 10 segments.For each segment of a video, we generated a merged image by combining features from max projection and correlation maps.
Each training set was composed of merged images created from the segmented video segments and 10 merged images generated from complete videos.Additionally, each validation dataset consisted of 10 merged images generated from complete videos.We meticulously verified the merged images generated from complete videos.We meticulously verified the ground truth for each image.In total, we conducted 10 groups of two-round generalization cross-validation experiments to assess the impact of self-supervised pre-training on the segmentation model's performance.From Fig. 4(C), it is evident that the model trained with pre-trained weights consistently outperforms the model trained without them.The improvement in model performance is more significant when the number of labelled data is smaller, particularly noticeable when split = 1, where loading self-supervised pre-trained weights has the most significant impact.When split = 5, the segmentation network with TiCo pre-training (0.8787 ± 0.226) achieved a similar F1-score to that of split = 10 without loading pre-trained weights (0.8781 ± 0.223), nearly halving the amount of training data required.This demonstrates that the framework of NeuroSeg-III, which combines the self-supervised module with the segmentation network, can achieve high accuracy with fewer training samples, significantly reducing the labor for labeling data.

Segmentation network achieved precise and generalized neuron segmentation
By training the segmentation network of NeuroSeg-III with a mixed dataset, we evaluated its performance in generalized neuron segmentation tasks.The dataset encompassed images derived from various Ca 2+ indicators, imaging scales and depths, brain regions, and neuron activity patterns.This enabled the proposed model to undergo extensive learning and acquire generalized neuron segmentation capabilities.
After completing the hybrid training, we evaluated the model's performance on two-photon imaging datasets, which included three different Ca 2+ indicators and imaging scales [47][48][49].The results, visualized in Fig. 5, demonstrated the segmentation network's exceptional performance across different Ca 2+ indicators and imaging scales.Furthermore, we delved into the activities of the segmented neurons, as illustrated in Figs.5(A-C).Our model effectively identified active neurons, indicating that the spatiotemporal information was integrated successfully by image fusion preprocessing, thereby contributing to good segmentation performance.Consequently, the learned segmentation network of NeuroSeg-III amalgamates diverse neuron characteristics from imaging data, laying the groundwork for generalized neuron segmentation task.
ground truth for each image.In total, we conducted 10 groups of two-round generalization cross-validation experiments to assess the impact of self-supervised pre-training on the segmentation model's performance.
From Fig. 4(C), it is evident that the model trained with pre-trained weights consistently outperforms the model trained without them.The improvement in model performance is more significant when the number of labelled data is smaller, particularly noticeable when split = 1, where loading self-supervised pre-trained weights has the most significant impact.When split = 5, the segmentation network with TiCo pre-training (0.8787 ± 0.226) achieved a similar F1score to that of split = 10 without loading pre-trained weights (0.8781 ± 0.223), nearly halving the amount of training data required.This demonstrates that the framework of NeuroSeg-Ⅲ, which combines the self-supervised module with the segmentation network, can achieve high accuracy with fewer training samples, significantly reducing the labor for labeling data.imaging datasets, which included three different Ca indicators and imaging scales [47][48][49].The results, visualized in Fig. 5, demonstrated the segmentation network's exceptional performance across different Ca 2+ indicators and imaging scales.Furthermore, we delved into the activities of the segmented neurons, as illustrated in Figs.5(A-C).Our model effectively identified active neurons, indicating that the spatiotemporal information was integrated successfully by image fusion preprocessing, thereby contributing to good segmentation performance.Consequently, the learned segmentation network of NeuroSeg-Ⅲ amalgamates diverse neuron characteristics from imaging data, laying the groundwork for generalized neuron segmentation task.
In addition to higher accuracy, the inference speed of NeuroSeg-III (Fig. 6(D)) also surpassed that of SUNS, albeit statistically insignificantly (P = 0.0501), which processed videos at about 2 ms/frame and was significantly (P < 0.0001) faster than the other methods.

Discussion
Self-supervised learning methods enable the learning of data features without the need for labelled data, representing a future trend in deep learning.We chose TiCo as our self-supervised learning network, which is based on a combination of contrastive learning and redundancy reduction.To enhance the information contained in each training image and reduce the training time, we performed maximum projection every 20 frames, which is equivalent to subsampling the original video.Additionally, we made slight adjustments to the training parameters specifically tailored to the features of imaging data.The evaluation results (Fig. 4(C)) indicate that our trained self-supervised model could efficiently capture the data features of the ABO dataset.It improved the segmentation performance of the model, regardless of the amount of training data available for the segmentation network.This improvement is particularly pronounced when working with smaller subsets of training images.Moreover, we achieved the best results using only half of the ground truth data for training, which outperformed the performance achieved by using only the segmentation network.
The proposed segmentation network was built upon YOLOv8s and underwent the following modifications: it incorporates the backbone module with FasterNet and the EMA attention mechanism while utilizing BiFPN connections in the neck module.Before training the network of the segmentation network, we applied preprocessing operations similar to our previous work [26].However, in this case, we tailored the image fusion to the imaging characteristics of the ABO dataset by combining maximum projection instead of average projection with correlation maps.This operation allows each 2D input image to contain more spatiotemporal information.We performed ablation experiments to individually demonstrate the effectiveness for each of our improvement modules.Combining FasterNet and BiFPN can effectively reduce the model's GFLOPS and parameters (Fig. 4(B)), and enhancing computational efficiency.EMA, a novel and efficient multi-scale attention mechanism, introduces a restructuring approach wherein a portion of the channels is reorganized into the batch dimension, while the channel dimension is divided into multiple sub-features.This restructuring ensures that spatial semantic features are evenly distributed within each feature group, effectively retaining channel-level information while minimizing computational overhead.As a result, EMA not only significantly enhances the segmentation performance of the model but also requires only a negligible increase in computational cost.In the visualized heatmap (Fig. 3(B)), it is evident that EMA can focus more on the neuron regions across various neuron numbers and imaging scales in two-photon Ca 2+ imaging data.Combining the results from the ablation experiments (Fig. 3(A)), it is evident that FasterNet, EMA, and BiFPN, when integrated with YOLOv8s, have a synergistic effect, further enhancing the model's segmentation capability.

Conclusion
In this study, we propose an efficient automated neuron segmentation approach named NeuroSeg-III.The approach based on self-supervised learning consists of two modules: a self-supervised network and an improved YOLOv8s segmentation network.We used the self-supervised network to pre-train the encoder network by using unsupervised feature representations of the data.Subsequently, we fine-tuned the segmentation network for a downstream neuron segmentation task with a limited number of labelled data samples.Our approach is a generalized model that achieves promissing segmentation effects in various brain regions, imaging depths, calcium indicators, and across different scales of two-photon imaging data.Furthermore, it doesn't require complex parameter tuning.When testing the segmentation performance of NeuroSeg-III on a public dataset (ABO), our approach outperformed the state-of-the-art methods (CITE-On, Suite2p, CaImAn, STNeuroNet, and SUNS), achieving the highest performance in terms of precision and F1-score.Additionally, NeuroSeg-III not only had the highest segment accuracy, but also was much faster than other methods.
In the future, we plan to develop an end-to-end neuron segmentation approach based on self-supervised learning.With such an approach, there would be no need for weight transfer; instead, the model could be trained directly on raw data to extract neuron features and perform neuron segmentation.

FIG. 1 .
FIG. 1.The framework of NeuroSeg-Ⅲ.(A) Schematic of the self-supervised learning process based on TiCo.The input of TiCo is a series of maximum projection images generated from each video batch (n=20 frames).// indicates a stop-gradient to halt backpropagation, • symbolizes a bifurcation, ⊕ represents an addition, △ denotes multiplication by a scalar value, and Δt signifies a temporal delay of one unit.Contractions legend: AAP: adaptive average pooling, Proj: projector, FC: fully connected layer.(B) Schematic of the proposed neuron segmentation method utilizing YOLOv8s architecture.The input of the backbone module is derived from the fusion of maximum projection and correlation map, which complemented the spatiotemporal information.We transfer the pre-trained weights from TiCo to initialize the weights of the backbone module in the training process.

Fig. 1 .
Fig. 1.The framework of NeuroSeg-III.(A) Schematic of the self-supervised learning process based on TiCo.The input of TiCo is a series of maximum projection images generated from each video batch (n = 20 frames).// indicates a stop-gradient to halt backpropagation, • symbolizes a bifurcation, ⊕ represents an addition, denotes multiplication by a scalar value, and ∆t signifies a temporal delay of one unit.Contractions legend: AAP: adaptive average pooling, Proj: projector, FC: fully connected layer.(B) Schematic of the proposed neuron segmentation method utilizing YOLOv8s architecture.The input of the backbone module is derived from the fusion of maximum projection and correlation map, which complemented the spatiotemporal information.We transfer the pre-trained weights from TiCo to initialize the weights of the backbone module in the training process.

FIG. 2 .
FIG. 2. Elaborate network architecture of NeuroSeg-Ⅲ.(A) Overview of the improved architecture of YOLOv8s.Block with a light lavender gray background describes the backbone module and Block with a pale orange background demonstrates the neck module, individually.In the module of backbone, we modified the block of C2f by incorporating the FasterNet block and EMA attention mechanism.For the neck module, BiFPN, is adopted and allows efficient multiscale feature fusion, and continue to use the C2f block proposed in YOLOv8s.(B) The structures of the Faster-EMA, Bottleneck, CBS and SPPF blocks used in the segment network, ⊕ denotes tensor concatenation.The Faster-EMA and Bottleneck blocks take input from the split block and their output is concatenated as the input for the CBS block.(C) The structures of the C2f-Faster-EMA block used in backbone module and the C2f block used in neck module.These blocks receive information from feature fusion as input and output it to the segment head.(D) The structure of the segmentation head and the projection in TiCo.(E) Neuron segmentation result by NeuroSeg-Ⅲ (ABO Expreiment ID: 501704220).CBS: convolution, batch normalization, and SiLU activation function; SPPF: Spatial Pyramid Pooling-Fast; Up: upsamling; Conv: convolution; PConv: partial convolution; BN: batch normalization; MP: max pooling; FC: fully connected layer.

Fig. 2 .
Fig. 2. Elaborate network architecture of NeuroSeg-III.(A) Overview of the improved architecture of YOLOv8s.Block with a light lavender gray background describes the backbone module and Block with a pale orange background demonstrates the neck module, individually.In the module of backbone, we modified the block of C2f by incorporating the FasterNet block and EMA attention mechanism.For the neck module, BiFPN, is adopted and allows efficient multiscale feature fusion, and continue to use the C2f block proposed in YOLOv8s.(B) The structures of the Faster-EMA, Bottleneck, CBS and SPPF blocks used in the segment network, ⊕ denotes tensor concatenation.The Faster-EMA and Bottleneck blocks take input from the split block and their output is concatenated as the input for the CBS block.(C) The structures of the C2f-Faster-EMA block used in backbone module and the C2f block used in neck module.These blocks receive information from feature fusion as input and output it to the segment head.(D) The structure of the segmentation head and the projection in TiCo.(E) Neuron segmentation result by NeuroSeg-III (ABO Expreiment ID: 501704220).CBS: convolution, batch normalization, and SiLU activation function; SPPF: Spatial pyramid pooling-fast; Up: upsampling; Conv: convolution; PConv: partial convolution; BN: batch normalization; MP: max pooling; FC: fully connected layer.

FIG. 3 .
FIG. 3. Attention mechanism used in backbone module.(A) Diagram of EMA.The terms 'X Avg Pool' and 'Y Avg Pool' are the calculations for 1D horizontal and vertical global pooling, respectively.⊕ represents an addition module.(B) Visualization of attention mechanisms across different imaging scales and Ca 2+ indicators.It noted that the EMA module identifies more neurons with increased details.

Fig. 3 .
Fig. 3. Attention mechanism used in backbone module.(A) Diagram of EMA.The terms 'X Avg Pool' and 'Y Avg Pool' are the calculations for 1D horizontal and vertical global pooling, respectively.⊕ represents an addition module.(B) Visualization of attention mechanisms across different imaging scales and Ca 2+ indicators.It noted that the EMA module identifies more neurons with increased details.

FIG. 4 .
FIG. 4. The improvement over the YOLOv8s and the pretrained weights obtained via self-supervised learning achieve higher segmentation accuracy while reducing the parameters of models.(A) Ablation experiment for evaluating the network components including FasterNet, EMA, and BiFPN modules (*P<0.05;**P<0.005;****P<0.0001;n = 37 images; two-sided Wilcoxon signed-rank test; error bars are SD; n.s., not significant).(B) The segmentation module of NeuroSeg-Ⅲ significantly reduced GFLOPS and the amount of the model's parameters.(C) Pretrained weight transferring via self-supervised learning enhances the segmentation ability of our proposed model, especially with a few labelled training data.We used 10 ABO 275 μm videos and 10 ABO 175 μm videos separately as the training and validation dataset.The horizontal axis represents the number of partitions applied to the original ABO video.

Fig. 4 .
Fig. 4. The improvement over the YOLOv8s and the pretrained weights obtained via selfsupervised learning achieve higher segmentation accuracy while reducing the parameters of models.(A) Ablation experiment for evaluating the network components including FasterNet, EMA, and BiFPN modules (*P < 0.05; **P < 0.005; ****P < 0.0001; n = 37 images; two-sided Wilcoxon signed-rank test; error bars are SD; n.s., not significant).(B) The segmentation module of NeuroSeg-III significantly reduced GFLOPS and the amount of the model's parameters.(C) Pretrained weight transferring via self-supervised learning enhances the segmentation ability of our proposed model, especially with a few labelled training data.We used 10 ABO 275 µm videos and 10 ABO 175 µm videos separately as the training and validation dataset.The horizontal axis represents the number of partitions applied to the original ABO video.