Real-time data processing in colorimetry camera-based single-molecule localization microscopy via CPU-GPU-FPGA heterogeneous computation

Because conventional low-light cameras used in single-molecule localization microscopy (SMLM) do not have the ability to distinguish colors, it is often necessary to employ a dedicated optical system and/or a complicated image analysis procedure to realize multi-color SMLM. Recently, researchers explored the potential of a new kind of low-light camera called colorimetry camera as an alternative detector in multi-color SMLM, and achieved two-color SMLM under a simple optical system, with a comparable cross-talk to the best reported values. However, extracting images from all color channels is a necessary but lengthy process in colorimetry camera-based SMLM (called CC-STORM), because this process requires the sequential traversal of a massive number of pixels. By taking advantage of the parallelism and pipeline characteristics of FPGA, in this paper, we report an updated multi-color SMLM method called HCC-STORM, which integrated the data processing tasks in CC-STORM into a home-built CPU-GPU-FPGA heterogeneous computing platform. We show that, without scarifying the original performance of CC-STORM, the execution speed of HCC-STORM was increased by approximately three times. Actually, in HCC-STORM, the total data processing time for each raw image with 1024 × 1024 pixels was 26.9 ms. This improvement enabled real-time data processing for a field of view of 1024 × 1024 pixels and an exposure time of 30 ms (a typical exposure time in CC-STORM). Furthermore, to reduce the difficulty of deploying algorithms into the heterogeneous computing platform, we also report the necessary interfaces for four commonly used high-level programming languages, including C/C++, Python, Java, and Matlab. This study not only pushes forward the mature of CC-STORM, but also presents a powerful computing platform for tasks with heavy computation load.


Introduction
Single-molecule localization microscopy (SMLM), including the photoactivated localization microscopy (PALM), fluorescence photoactivated localization microscopy (FPALM), stochastic optical reconstruction microscopy (STORM), direct stochastic optical reconstruction microscopy (dSTORM) and many other names, has become an important tool for structural cell biology [1][2][3].Multi-color SMLM is of great interests, as it increases the information content that can be extracted from biological samples.For example, multi-color SMLM imaging of proteins labeled with different fluorophores can probe the spatial relations [2] and the interactions among proteins at the single-molecule level [4].
Currently, researchers have reported numerous approaches to achieve multi-color SMLM, including mainly sequential excitation [5], chromatic dispersion [6], spectral splitting [7,8], point spread function (PSF) engineering [2,6], and ratio-metric based [2,9].However, these approaches are still facing different technical problems.For example, sequential excitation methods are able to correct drift within each color channel, but could not eliminate the misalignment among different channels due to accumulated drift during the acquisition process [5,8].Chromatic dispersion methods experience a reduction in the signal-to-noise ratio (SNR), because the spectral information should be distributed onto tens of pixels.And, in the chromatic dispersion methods, the spectral accuracy is limited by the achievable SNR [6,10].Furthermore, chromatic dispersion methods fail to resolve overlapping spectral distributions [6].Spectral splitting methods, which necessitate additional error-prone alignment procedures between channels [7], require complex and expensive custom-built microscope hardware setups, and simultaneous acquisition of multiple channels often suffers from spectral crosstalk [8,10].PSF engineering methods are hindered by the inability to produce sufficiently distinct PSF shapes due to the small spectral emission differences on the order of tens of nanometers in the peak emission, thus impeding discrimination of spectrally close fluorophores [6].The ratio-metric based methods enable simultaneous multicolor imaging using a single excitation laser, with negligible channel shift and chromatic aberration [2].However, these methods impose stringent requirements on the selection of fluorescent dyes, as well as on the precise measurement of photon counts [2,9].
In 2021, our group reported a complete different approach for realizing multi-color SMLM, called colorimetry camera-based stochastic optical reconstruction microscopy (CC-STORM) [11].In this approach, the conventional monochrome camera used in SMLM is replaced by a new kind of low-light cameras called colorimetry camera.Colorimetry camera is actually a customized sCMOS camera with a repeated pattern of color pixels and monochrome (or called white, W) pixels.The color pixels include Red (R), Green (G), Blue (B), and Near Infrared (NIR) pixels [12].CC-STORM utilizes the White channels of the colorimetry camera to capture the positional information of emitters, and employs the color channels to distinguish the colors of different emitters, thus enabling a pixel-level joint encoding of emitter position and color.In this way, CC-STORM uses a very simple optical system to achieve multi-color SMLM with low crosstalk.In fact, CC-STORM was reported to enable two-color SMLM with ∼ 20 nm spatial resolution and < 2% color crosstalk [11].And, although current CC-STORM has been used only in two-color two-dimensional SMLM, this method has good potentials to be expanded to 4-5 color SMLM [13] and three-dimensional SMLM.
According to its encoding strategy, CC-STORM needs to determine the spatial position and color of each emitter.Specifically, as reported previously [11], the data processing tasks of CC-STORM mainly include (see Section 2.3 for details): subregion extraction, emitter localization with maximum-likelihood estimation (MLE), image extraction from color channels (abbreviated as color extraction), normalized color intensity (NCI) estimation, emitter color recognition, and image reconstruction.Note that the color extraction process requires to traverse all pixels in the four color channels.If this process is performed on a CPU or GPU, the pixels in different color channels should be processed sequentially.Therefore, it is not surprise to find out that the color extraction process is the most time-consuming step in the reported CC-STORM [11].Because the data processing speed of the reported CC-STORM is significantly slower than the typical image acquisition speed (1024 × 1024 pixels @ 30 ms), we could not realize the desired real-time optimization of experimental conditions.Here real-time processing means that the total time required for processing a raw image in CC-STORM (including at least molecule localization, color recognition, and image reconstruction) must be less than the time required to acquire that raw image.
On the other hand, recently we used a customized FPGA-GPU heterogeneous computing platform (HCP) to develop a method called AIO-STORM, which enables real-time data processing of high-density SMLM under high data throughput (1024 × 1024 pixels @ 100 frame per second (fps)) [14].AIO-STORM primarily uses a commercial FPGA chip (XC7K325, Kintex 7, Xilinx) to accelerate the time-consuming subregion extraction task in high-density SMLM.If this task is executed on a CPU or GPU, it would involve sequential and iterative computation and storage.However, executing this task on an FPGA can avoid these issues [15].After a careful comparison on the tasks between CC-STORM and high-density SMLM, we realized that FPGA-GPU computation would be a promising way to solve the slow data processing problem in CC-STORM.
It is worthy to note that CC-STORM involves many computation tasks, and using only FPGA-GPU computation may be insufficient to meet the real-time data processing demand in CC-STORM.Therefore, it would be better to incorporate CPU computation onto the FPGA-GPU computation platform, although this treatment requires additional efforts to solve the data communication issues among the devices in the CPU-GPU-FPGA platform.Since common algorithm development languages cannot directly realize the required data communication between FPGA and other devices, PCIe-core is typically used as the data interaction model to develop data communication interfaces [14].However, developing data communication interfaces is technically challenging, and the data interfaces in different high-level programming languages are not compatible.
In this paper, to facilitate the deployment of CC-STORM or other algorithms onto the CPU-GPU-FPGA platform, we first encapsulated the data communication of the HCP into a dynamic link library (DLL).We then developed the communication interfaces for four high-level programming languages using a hybrid programming approach.Finally, we used FPGA computing to accelerate the color extraction task, and used GPU computing to expedite the MLE-based emitter localization task.In this way, we deployed CC-STORM on the customized CPU-GPU-FPGA platform and developed an updated version of CC-STORM, which we named colorimetry camera-based stochastic optical reconstruction microscopy via heterogeneous computing platform (HCC-STORM).Using simulated and experimental datasets, we verified that HCC-STORM could approximately triple the execution speed of CC-STORM without sacrificing other performance, meeting the execution speed requirements for real-time image processing of a field of view (FOV) of 1024 × 1024 pixels with an exposure time of 30 ms.We believe that this paper points out a good approach to achieve real-time image processing for the colorimetry camera used in this study (Retina 200DSC), at its maximum FOV (2048 × 1152 pixels) and a typical exposure time of 30 ms, if we upgrade the CPU-GPU-FPGA platform properly.

Construction of CPU-GPU-FPGA heterogeneous computing platform
The commercial FPGA chip we selected in this study belongs to the Xilinx Kintex-7 series (model: XC7K325), featuring a PCIe Gen 2.0 × 8 interface.We directly connected the FPGA to the motherboard of a graphics workstation, which included a CPU (Intel Core i7-11700K, memory: 32.0 GB) and a GPU (NVIDIA GeForce RTX 3060, memory: 12 GB, CUDA cores: 3580).We housed these devices in a single chassis connected via PCIe slots, forming the hardware implementation of the CPU-GPU-FPGA HCP.
Firstly, we used PCIe-core to implement direct memory access (DMA) functionality on the FPGA (via PCIe).We created a user logic module inside the FPGA to accommodate algorithm tasks.Since the PCIe-core used AXI4-Stream for input and output, we developed the corresponding modules to convert this complex onboard data transmission format into a simpler data format with valid signals.Additionally, we placed a first input first output memory (FIFO) in the user logic module to store output data.The CPU could read the volume of data in the FIFO to determine whether or not to initiate DMA read operations.
Subsequently, we developed an automated CPU-FPGA data transfer scheme, as shown in Fig. 1.The scheme includes the following steps.Step 1: Initialize the host computer, which mainly includes: set the amount of data that are required to transfer from CPU to FPGA, allocate memory for data transmission and reception, load data into the transmission memory, set FIFO read threshold, and read wait time threshold.
Step 2: Create a sub-thread to transmit data, while the main-thread waits.
Step 3: Perform data reception in the main-thread, if the data volume in the FIFO of the FPGA exceeds the FIFO read threshold we have set.No matter the data are received or not, the main-thread continues to wait.
Step 4: Check the data volume in the FIFO, if the data volume in the FIFO remains below the read threshold and the waiting time exceeds the wait time threshold.If the data volume in the FIFO is zero, we terminate the process; otherwise, we perform a final data reception.After completing this reception, we terminate the process.In the sub-thread, we terminate the sub-thread once data transmission is complete.
We implemented the communication between the CPU and the GPU using the CUDA application programming interface (API).We facilitated the data exchange between the FPGA and the GPU through the main memory.We used the CUDA API to bind the main memory with the GPU's device memory.Subsequently, while we sent the data to the computer memory via the FPGA, we also transferred the data automatically to the device memory of the GPU, thus enabling efficient data interaction between the FPGA and the GPU.

Development of high-level programming language interfaces for the heterogeneous computing platform
Before implementing the high-level programming language interfaces (HPLIs), we encapsulated the host computer program of the HCP by splitting and reorganizing.We divided the host computer program into three functions according to their purposes: memory allocation function, automatic data transmission function, and memory release function.We then encapsulated these three functions into a DLL, and thus developed the required HCP interfaces for four high-level programming languages: C/C++, Python, Java, and Matlab.
The following notes are important during the development of the DLL.
Note 1: Since we would like use the DLL in Java, we should wrap it according to the Java Native Interface (JNI) specification.
Note 2: Since a direct interchange of memory between the memory environment of the DLL (which is developed in C/C++) and the runtime environment of Java is not possible, we must invoke JNI library functions for conversion.
Note 3: Standard encapsulation methods are sufficient, if the DLL is intended for use with C/C++, Python, and Matlab.
Note 4: C/C++ and Matlab can directly invoke the internal interface functions of the DLL without requiring additional tools.However, Python requires the assistance of Ctypes to convert both the internal interface functions of the DLL and the memory defined by DLL function calls.Once this conversion is completed, the functions can be invoked within the Python programming environment.Java requires the assistance of JNI to import functions from the DLL into the runtime environment, enabling their invocation.

Deployment of the CC-STORM algorithm into the CPU-GPU-FPGA heterogeneous computing platform
In the CPU-GPU-FPGA HCP, we directly employed HPLIs to control data interaction among internal devices of the platform.During the deployment of CC-STORM, our sole focus was on task assignment across different devices, thus significantly simplifying the complexity of the deployment process.
The execution flow of CC-STORM is illustrated in Fig. 2(a).The main tasks of CC-STORM and its specific deployment on the HCP are as follows: Task 1: Subregion extraction.Prior to performing emitter localization, denoising and emitter identification in the White channels were required for the raw images.Subsequently, 9 × 9 regions centered on the emitters were extracted.We assigned this task to the GPU to achieve faster execution speed.
Task 2: Emitter localization with MLE.This task involved localizing a massive number of emitters individually, making it suitable for deployment on GPUs that have abundant threads and computational resources.We assigned this task to the GPU to achieve faster execution speed.
Task 3: Color extraction.This task involved extracting the information from the four color channels in a raw fluorescence image, which was captured by the colorimetry camera, into the corresponding channel images.This process required iterating through all the pixels in the four color channel images.Thus, performing this task on CPU or GPU would require to sequentially extract pixels from different color channel images, leading to considerable time consumption.However, FPGA can use the hardware logic circuitry to simultaneously extract color pixels from different color channel images, significantly reducing the processing time for this task.Therefore, we employed FPGA computation to accelerate this task.
Task 4: NCI estimation.The NCI information of emitters was estimated using the channel images extracted after color extraction and the Gaussian distributions recovered from emitter localization.This task was complex and computationally intensive, thus we determined to deploy it on the GPU.
Task 5: Color recognition and image reconstruction.We determined the color of emitters from NCI information, and performed image reconstruction using emitter position information.This task had relatively low computational load, so the time consumption of deployment this task on the CPU could be negligible, while deployment this task on the GPU would incur additional communication overhead.Therefore, we deployed it on the CPU.
In summary, Task 3 had a good potential for acceleration, and could be deployed on the FPGA to improve the execution speed of CC-STORM.For other tasks with complex and computationally intensive procedures, we deployed them on the GPU.For the tasks with small computational load, we deployed them on the CPU.
We realized the tasks in two computing platforms (Fig. 2(b, c)) and compared the total execution times (Fig. 2(d, e)).It is worthy to note that the reported CC-STORM employed a CPU-GPU computing platform, whereas the current HCC-STORM used a CPU-GPU-FPGA computing platform.In the reported CC-STORM, the CPU-based color extraction task consumed more time than all of the other tasks combined.Conversely, in HCC-STORM, the FPGA-based color extraction task required significantly less time, thereby reducing the overall execution time.

Implementation of the color channel image extraction task on FPGA
To implement the color extraction function on the FPGA, we designed a logic circuit and integrated it into the user logic module described in Section 2.1.Previously we had solved the internal data transmission issues within the FPGA, here we only needed to focus on the design of the color extraction module, which included a pattern reading module and four distance judgment modules.From this design, we replicated raw image data into four copies using hardware circuits, and then directed these copies to the distance judgment modules, which included R, G, B, and NIR channels, respectively.The distance judgment modules for the four color channels operated independently without data interaction, forming four parallel processing pipelines.We pre-stored the color channel arrangement matrix (pattern) of the colorimetry camera in the FPGA's Read-Only Memory (ROM).Additionally, we designed a logic circuit to automatically manage the ROM read address, thus constructing the pattern read module.When the input data was valid, the pattern read module output the corresponding data to the distance judgment modules, serving as a reference template for the color extraction tasks.We arranged the raw image data and pattern data into 9 × 9 windows, before feeding them into the distance judgment modules.
The distance judgment modules were the key component for implementing the color extraction functionality.The workflow for a single color channel within the distance judgment modules is described as follows: Step 1: Construct a distance matrix for all channels.A 9 × 9 matrix representing the distances across all channels was constructed, as illustrated in Fig. 3(a).The matrix exhibited a nonuniqueness issue with the distance values, as depicted in the figure.To ensure result consistency between the FPGA and the algorithm, we refined the distance matrix for all channels to the format Step 2: Extract pixel positions.We constructed a 9 × 9 register matrix to represent pixel positions, and used it to extract pixel positions from the original 9 × 9 image data window.We recorded the pixel positions in the pattern window that matched the color channel parameter of the module in the pixel position register matrix.
Step 3: Construct the distance matrix for the current channel.Using the pixel position register matrix obtained in the previous step, we retained the recorded pixel positions in the full-channel distance matrix and set all other positions to 81, thereby generating the distance matrix for the current channel.
Step 4: Identify the closest position in each row of the distance matrix.We compared the values row by row to determine the position with the smallest value in each row.
Step 5: Identify the closest position in the distance matrix.We compared the minimum values obtained from the previous step to determine the overall minimum value of the entire 9 × 9 distance matrix, which represented the position closest to the center.
Step 6: Output the result.We output the effective pixel value closest to the center position in the original image data window, thus completed the extraction of a color channel image for one pixel in a channel.
It is noteworthy that the distance judgment modules for the four color channels remain consistent, with a sole distinction in their respective color channel parameters.This color extraction module, when tasked with extracted images for up to four color channels, is only subjected to computation time determined by the FPGA operating frequency and the number of processed pixels.When CC-STORM is expanded to support 3-color or 4-color SMLM, this color extraction module remains operational.Moreover, the time required for performing color extraction tasks with the same number of pixels remains unchanged.

Performance evaluation methods for heterogeneous computing platform and high-level programming language interfaces
To evaluate the communication performance of the CPU-GPU-FPGA HCP and its HPLIs, we conducted a series of comprehensive tests.To simulate the data growth associated with the color extraction task, we duplicated the input data of the user logic module in the FPGA, and directly output these two copies from the user logic module.We utilized HPLIs, including C/C++, Python, Java and Matlab, to control the data transmission on the HCP.Initially, we sequentially transmitted a varied amount of data (1 MB to 1 GB) to evaluate the transfer throughput T of data interaction between the CPU and the FPGA, with the data transmission sequence set as CPU-FPGA-CPU.Here, Data i represents the total data volume input from the HPLIs, Data o represents the total data volume output from the HPLIs, and Time represents the duration of the HPLIs function calling.
Furthermore, we conducted tests to evaluate the overall data interaction efficiency of the HCP.Using HPLIs, we controlled the interaction of different data amounts (1 MB to 1 GB) among the devices, following the sequence of CPU-FPGA-GPU-CPU for data transmission.This evaluation served to assess the overall data transfer throughput T all .Here, Data CF denotes the total data volume transferred from CPU to FPGA, Data FG represents the total data volume transferred from FPGA to GPU, Data GC indicates the total data volume transferred from GPU to CPU, and Time all signifies the total time consumed in the process.

Acquisition of simulated and experimental images and performance evaluation
We utilized simulated datasets with varied activation densities (0.1 ∼ 0.6 µm −2 ) to assess the localization performance of HCC-STORM.At each activation density, we simulated 1000 frames of raw images with randomly distributed emitters and 512 × 512 pixels in each frame.The image generation was configured according to the typical intensity range in SMLM experiments [11].
The average photon count per emitter was set to 5000 photons, the background intensity was set to 336.4 photons [11], and the emitters were randomly distributed in the image.The full width at half maximum (FWHM) of the Gaussian point spread function was set to 1.3 pixels, with the pixel size of 108 nm at the sample plane.The mean emission wavelength was set to 640 nm, and the camera readout noise was set to 2.71 e-.We evaluated the localization performance of the two algorithms with the Jaccard index and root mean square error (RMSE).The Jaccard index was used to evaluate the emitter detection capability, while the RMSE was used to characterize localization accuracy.The experimental datasets were the same as the original CC-STORM paper [11].We processed the experimental raw images with HCC-STORM and CC-STORM, and compared the runtimes and the spatial resolution of reconstructed images.We quantified the FWHM resolution using a previously reported tool called LuckyProfiler [16].LuckyProfiler is an ImageJ plugin designed to easily and effectively quantify the FWHM resolution of super-resolution images, without the need of manual selection.

Characteristics of different hardware computing platforms
Different processors have their own characteristics, as shown in Table 1.Therefore, they often exhibit different execution speeds when handling the same task.CPUs possess excellent management and scheduling capabilities [17], but they exhibit limited computational power [18].GPUs have a large number of threads and computing units, enabling parallel processing of massive data for compute-intensive tasks [19].Compared to GPUs, FPGAs exhibit an even stronger parallel capability, encompassing not only data parallelism but also pipeline parallelism [19,20].However, FPGA programming is not easy, and is typically suffered from limited resources [19].In this paper, we tried to build a powerful computing platform by integrating these three processors together, so that we could leverage the strengths of various processors, while mitigating their weaknesses.The key issue in integrating different processors lies in the communication between them, using the common communication methods such as USB [21], Ethernet [22], and PCIe [14].Compared to other communication methods, PCIe offers a greater bandwidth, making it suitable for scenarios requiring high throughput.When connecting different processors via the PCIe interface, it is common to utilize a high-level programming language to develop data transfer interfaces.However, the high-level programming languages used in algorithm development may often be different, and sometimes it is necessary to simultaneously use multiple languages, thus raising significant challenges for algorithm deployment on HCPs.Additionally, the reported AIO-STORM method employed FPGA-controlled DMA data transfer to achieve 98% of theoretical throughput, so that the bandwidth advantage of PCIe could be fully used [14,23].However, this kind of data transfer method necessitates the implementation of data transfer control within the FPGA, which poses considerable technical challenges and requires a high level of technical skills.
In this paper, we used CPU to control data transfer, so that the development complexity was significantly reduced, although the data transfer throughput was scarified.Moreover, the data transfer controlled by CPU could be encapsulated into automatic data transfer interfaces.In this case, we could easily develop four HPLIs, thus solving the challenges in using diverse algorithm development languages in HCP.However, because currently we used a low speed version of PCIe (PCIe 2.0), the data throughput might not meet the demands of the latest sCMOS cameras (see Section 3.2 for details).This issue could be solved by adopting a high speed version of PCIe (PCIe 3.0 or higher).
Note that CC-STORM is required to perform both molecule localization and color discrimination.Therefore, utilizing only GPU and FPGA might not be insufficient to meet the computational demand of CC-STORM.The platform developed in this paper could increase the computation power by leveraging the CPU computation.The integration of more computational units enables the HCP to accommodate algorithms with various characteristics, thus offering enhanced scalability to a broader array of application scenarios.

Results of data transfer throughput testing for heterogeneous computing platform and high-level programming language interfaces
We conducted tests using the method described in Section 2.5.When different volumes of data were used as input, the data exchange efficiency between the devices in the HCP exhibited varied performance.We found that as the data transfer size increased, the transfer throughput between the CPU and FPGA also increased, as shown in Fig. 4(a).Different HPLIs also demonstrated varied data exchange performance, and the Python interface showed superior data exchange performance, reaching a maximum throughput of 1548 MB/s.The C/C++ interface also exhibited good data exchange performance, with a maximum throughput of 1503 MB/s.The Java interface demonstrated the poorest data exchange performance, which was likely due to its complex memory conversion processes, with a maximum throughput of 1134 MB/s.The overall data exchange efficiency of the HCP also varied with the volumes of the input data, as shown in Fig. 4(b).Among these, the Python interface achieved the highest throughput, reaching 2116 MB/s.Due to the relative low PCIe version (PCIe 2.0) of the FPGA board and the limited number of lanes, data interaction between the FPGA and other processors was currently a bottleneck for the overall throughput.
In CC-STORM, the colorimetry camera we used is the Retina 200DSC, which has a typical exposure time of 30 ms, a maximum field of view of 2048 × 1152 pixels, and a maximum data throughput of approximately 150 MB/s.Our current CPU-GPU-FPGA platform and HPLIs are sufficient to meet the throughput requirements for real-time data exchange of this camera.If our CPU-GPU-FPGA platform and HPLIs are used to deal with other up-to-date sCMOS cameras, for example, the Teledyne Photometrics Kinetix sCMOS camera operating at full frame rates (3200 × 3200 pixels, 83 fps, 1621 MB/s), the maximum throughput supported by our platform can achieve 95% of the maximum data throughput of this camera.Subsequently, upgrading the PCIe version or increasing the number of lanes in our platform will suffice to fully meet the throughput requirements of this camera.

Comparative analysis the color channel image extraction task time consumption
We used CC-STORM and HCC-STORM to process a set of raw two-color SMLM images, which consists of 10000 frames of raw images with 512 × 512 pixels in each frame.We recorded the time for color extraction and overall processing, as shown in Fig. 5.We found that in CC-STORM, the color extraction task occupied approximately 71% of the total processing time.Thus, enhancing the execution speed of the color extraction task was crucial for reducing the overall execution time.After deploying the color extraction task on the FPGA, its execution time decreased from the original 140.9 seconds to 23.1 seconds, enhancing the execution speed by approximately six-folds.In this case, the overall execution time of HCC-STORM was reduced from the original 198.3 seconds to 67.3 seconds, showing an approximately three-fold increase in the speed.

Determining emitter localization using simulated images
We used the simulated dataset mentioned in Section 2.6 to compare the localization performance between HCC-STORM and CC-STORM.The localization performance of these two methods was quantified using the Jaccard Index and RMSE, with the results are shown in Fig. 6.We found that, across various activation densities, HCC-STORM consistently maintained the same localization performance as CC-STORM.The reason is simple: both methods use identical emitter localization algorithms.Although HCC-STORM utilizes FPGA logic circuits to accelerate the color extraction task, the implementation of their localization functionalities does not depend on the data obtained from color extraction.Therefore, it is reasonable to see that HCC-STORM and CC-STORM exhibit identical localization performance.

Two-color single-molecule localization microscopy
Using the experimental dataset mentioned in Section 2.6, we compared the runtime, color recognition capability, and spatial resolution of the reconstructed images processed with HCC-STORM and CC-STORM.The results are depicted in Fig. 7(a, b).The runtime of HCC-STORM (C/C++, Python, Java, and Matlab), and implemented the entire data processing tasks in CC-STORM into this new heterogeneous platform.We found this heterogeneous computation platform (HCP), in conjunction with the high-level programming language interfaces (HPLIs), is able to support a data transmission throughput of up to 1548 MB/s, offering a powerful computing platform for CC-STORM.
Using simulation and experimental images, we verified that this new heterogeneous platform is able to increase the execution speed of CC-STORM by approximately three times, while maintaining the same localization and color recognition performance.The updated version of CC-STORM using the heterogeneous platform, called HCC-STORM in this study, could finish the entire data processing tasks of CC-STORM in approximately 6.7 ms for raw images with 512 × 512 pixels, corresponding to about 26.9 ms for raw images with 1024 × 1024 pixels.The data processing speed of HCC-STORM meets the real-time image processing requirement for the colorimetry camera used in this study (Retina 200DSC), which operates typically at an exposure time of 30 ms and an imaging field of view of 1024 × 1024 pixels.
However, due to the limitation in FPGA resource, currently it is not feasible to use HCC-STORM to perform two-color SMLM with the full chip of Retina 200DSC (which has a maximum array of 2048 × 1152 pixels).In the near future, this issue could be addressed by employing a FPGA chip with resources approximately 2.5 times greater than the current FPGA (XC7K325TFFG900, Xilinx).Additionally, implementing more tasks (for example, post-processing tasks for artifacts removal in SMLM [25], drift correction during image acquisition [26,27], etc.), could be possible after a further upgrade in the FPGA chip and the PCIe interface.We want to point out that certain tools may be used to simplify the algorithm deployment process on HCPs.For example, using Vivado HLS development tools could speed up the deployment of algorithms on FPGA, while employing Open Computing Language (OpenCL) might accelerate the deployment of algorithms across different processors and the development of communication between them.
Finally, we want to point out that the current colorimetry camera can surely be used to develop 3D multi-color SMLM.However, we should realize that this camera has a much lower quantum efficiency than the back-illuminated monochrome cameras that are popularly used in 3D multi-color SMLM, and that the engineered 3D PSF such as astigmatism PSF occupies a larger pixel array.Therefore, when we use the current colorimetry camera in 3D multi-color SMLM, we should use (or even develop) brighter fluorophores.

Fig. 2 .
Fig. 2. CC-STORM algorithm workflow and task distribution across different computing platforms.(a) CC-STORM algorithm workflow.(b) Task distribution of CC-STORM on a CPU-GPU computing platform.(c) Task distribution of HCC-STORM on a CPU-GPU-FPGA heterogeneous computing platform.(d) Execution time of (b).(e) Execution time of (c).

Fig. 3 .
Fig. 3. Distance matrix for all channels.(a) The original distance matrix.(b) The modified distance matrix.

Fig. 4 .
Fig. 4. Data transfer throughput test results.(a) The results between CPU and FPGA.(b) The overall throughput of the heterogeneous computing platform.

Fig. 5 .
Fig. 5. Comparing the execution time of CC-STORM and HCC-STORM.(a) Execution time of the color extraction task in CC-STORM, with the total execution time shown in the lower right corner.(b) Execution time of the color extraction task in HCC-STORM, with the total execution time shown in the lower right corner.