Colorful image reconstruction from neuromorphic event cameras with biologically inspired deep color fusion neural networks

Neuromorphic event-based cameras communicate transients in luminance instead of frames, providing visual information with a fine temporal resolution, high dynamic range and high signal-to-noise ratio. Enriching event data with color information allows for the reconstruction of colorful frame-like intensity maps, supporting improved performance and visually appealing results in various computer vision tasks. In this work, we simulated a biologically inspired color fusion system featuring a three-stage convolutional neural network for reconstructing color intensity maps from event data and sparse color cues. While current approaches for color fusion use full RGB frames in high resolution, our design uses event data and low-spatial and tonal-resolution quantized color cues, providing a high-performing small model for efficient colorful image reconstruction. The proposed model outperforms existing coloring schemes in terms of SSIM, LPIPS, PSNR, and CIEDE2000 metrics. We demonstrate that auxiliary limited color information can be used in conjunction with event data to successfully reconstruct both color and intensity frames, paving the way for more efficient hardware designs.


Introduction
Event cameras deviate from the conventional framebased camera paradigm and follow the underlying principles of biological vision.They mimic the asynchronous, event-driven nature of neural information processing in biological retinas by transmitting luminance changes with high temporal resolution and low latency [1].By capturing luminance changes at the pixel level, event-based cameras can detect motion at high speeds, replacing conventional fixedrate frame-based vision with asynchronous events, making them well-suited for tasks entailing high dynamic range and optical flow [2].However, event data requires non-traditional algorithmic approaches for computer vision (CV) and image processing.By reconstructing frame-like intensity maps from events, conventional frame-based algorithms can be applied to event-driven data acquisition.This can increase the adoptability of event-based vision systems in a wide range of applications.
Recent studies have successfully demonstrated the reconstruction of gray-level image intensities using deep neural networks [3][4][5], adversarial training [6][7][8], and transformers [9].Since color-enriched image reconstruction improves performance in dynamic vision tasks and event cameras, generally, do not capture color, two main approaches were developed for colorful event-based image reconstruction: color encoding and color fusion.
With color encoding, event camera sensors were redesigned to include color-sensitive photodiodes.For example, Farian and colleagues introduced a dichromatic spiking pixel circuit designed for intensity agnostic selective color detection [10]; and Berner and colleagues proposed a vision sensor for the detection of color transitions at high speed with real-time display capabilities [11].Another approach for color encoding involves integrating multiple event cameras with different color filters.For example, Moeys and colleagues demonstrated the incorporation of RGBW color filters with an event camera, allowing for color events interpolation for reconstruction [12].Li and colleagues further improved their design with the introduction of the C-DAVIS camera (Color Dynamic and Active-Pixel Vision Sensor) [13].The C-DAVIS combines monochrome dynamic vision sensor pixels with RGBW active-pixel sensor pixels, capturing both color frames and asynchronous monochrome temporal contrast events.In those designs, color information was encoded in the events generated by the sensor.The implementation of color encoding requires dedicated circuits and specialized hardware designs tailored to accurately encode color information [10][11][12][13].
With color fusion, data from a conventional frame-based RGB camera is merged with event data.Color fusion-based CV might involve training a model on a large dataset of scenes captured with both event and conventional cameras.It was successfully applied for video frame interpolation [14] and temporal video frame synthesis [15].For example, Pini and colleagues proposed a framework that utilizes a deep adversarial neural network with a recurrent module to synthesize RGB frames from an event camera [16].The proposed architecture uses an initial or periodic set of RGB keyframes and a sequence of events to synthesize color frames for semantic segmentation and object detection.In their design, the introduced RGB and the Event frames must have identical high resolutions.
In this work, we propose a biologically inspired color fusion system that embraces the principles of efficient neural processing seen in biological visual pathways.In retinal vision processing, data is based on center-surround receptive fields, which enhance edges [17].The brain converts these edges back into a perceptual coherent image [18,19].Our proposed framework consists of a layer of image reconstruction, which mirrors the brain's transformation of edges into an entire picture.Our proposed color cues space is based on neurons' receptive fields and biologically plausible opponent color perception found in the primate visual system [20], which are based on the relative activation of opposing pairs of color channels.This mechanism enables the visual system to enhance color contrasts, achieve color constancy under varying lighting conditions, and efficiently encode a wide range of colors using fewer neural pathways.
In contrast to the studies described above, which use the full RGB frame resolution to generate a colorful output, our biologically inspired approach uses low-spatial and tonal-resolution quantized color cues.Our model demonstrates that auxiliary limited color information can be used in conjunction with event data to successfully reconstruct both color and intensity frames, paving the way for more efficient hardware designs.Our model allows for a smaller, better-performing vision system, providing quality colorful image reconstruction with lower computational requirements in real-time.

Methods
Our color fusion approach features a three-stage convolutional neural network (CNN) for the reconstruction of color intensity maps from event data and sparse color cues (figure 1).Our network design network comprises three phases (each described in length below): 1.A 5-layer CNN that predicts the image's Laplacian from the event-frames and the color cues; 2. A non-trainable (fixed parameters) Poisson solver layer that reconstructs the intensity image from the predicted Laplacian input; and 3. A lightweight U-Net model for color cue integration.Figure 1 illustrates the model architecture.

Data acquisition and color-cues generation
In this work, we acquired RGB and event data from both the Caltech101 [21] and the neuromorphic N-Caltech101 [18] datasets.Each event-camera file from the N-Caltech dataset [21] was converted to an eventframe tensor E, where H and W denote its spatial dimensions, and T is the number of event frames composing each image in the dataset, equation (1).Our input signals were therefore converted to eventframe tensors, E, with the size B × H × W × 8, where B is the batch size and 8 stands for the number of the event-frame tensors and color cues channels.We used 6 event-frame tensors (90 × 120 × 6) and 2 color cues channels: RG cue and BY cue (specified below in equation ( 2)).E is defined using: where ∆t = 50 ms and p t is the polarity of the event in time step t, p ∈ (−1, 1).p = 1 denotes increasing luminance change and p = −1 denotes decreasing luminance change.
Due to the lack of current specialized hardware for color cues, we simulated their generation process.To generate color cues inputs from a given image we used the opponent color space, which is inspired by the receptive fields of the neurons in the primate visual system [20].The color opponent mechanism is a fundamental aspect of color processing in biological visual systems.It involves the perception of color based on the relative activation of opposing pairs of color channels.These pairs include red-green, blueyellow, and black-white (luminance).Rather than representing colors individually, the brain processes color differences between these pairs of opposing colors.In the opponent system red (R) and green (G) are perceived as opposites in the red-green channel (RG), and blue (B) and yellow (Y) in the blueyellow channel (BY).Input RGB images were converted to three opponent channels: the chromatic RG, BY channels, and the achromatic intensity (I) channel using: where M opp is the color opponent transformation matrix in which a = 0.2989, b = 0.587, and c = 0.114.Color cues were derived from the opponent space using: Equation ( 2) yields two color channels representing red, green, yellow, and blue using the combination of the two channels (RG, BY): magenta (red and blue), cyan (yellow and blue), orange (red and yellow), and lime (green and yellow).Each color channel was down-sampled by a factor of s = 2.The size of the color cues tensor was therefore reduced to: In our method, equation (2) was used to represent 2 bits of color cues depth (3 bins quantization).Additionally, we applied quantization to each opponent channel, using 5 bins and 7 bins to represent color cues with 3 bits and 4 bits in depth, respectively.Figure 2 shows the color cues for selected images (a), as well as the quantized range for each opponent channel when using 2 bits, 3 bits, and 4 bits representation.

Intensity reconstruction and colorization neural network
A three-stage CNN was used to reconstruct and colorize frames based on events and color cues.Our neural network comprises the following phases: 1.A simple 5-layer CNN (described in section 2.2.1) that predicts Laplacian from the event-frames (channels 0-5 in the input tensor) as well as the color cues (the last two-channels of the input tensor); 2. A Poisson solver layer (described in section 2.2.2) that reconstructs the image from the predicted Laplacian input; and 3. A lightweight U-Net model (described in section 2.2.3) for color cue integration (figure 1).

Laplacian prediction
This part is based on the 5-layer CNN, previously proposed by [19], which predicts the image Laplacian from event data.Unlike the originally proposed method, here the CNN also incorporates the color cue channels since color cues contain some spatial information about the scene.We used the Mean Absolute Error (MAE) loss to minimize the objective function between the predicted Laplacian ( Lap) and the ground-truth (GT) Laplacian (Lap) using: where Lap (the predicted Laplacian) is the output of the current phase of the neural network (figure 1).

Poisson solver layer
The Poisson reconstruction layer mirrors the brain's transformation of edges into an entire picture.Our Poisson solver layer solves the Poisson equation (Poisson Integration, PI) to reconstruct surfaces (intensity reconstruction) from edge inputs.Even though this layer does not contain learnable parameters, it was incorporated into the network to backpropagate errors for end-to-end training.The implementation of this layer was based on the PI algorithm, previously proposed by [22], applied to the predicted Laplacian of the image intensity.The solution to the Poisson equation: ∇ 2 u = g can be achieved through the utilization of the Discrete Sine Transform (DST) and its inverse transform (IDST) as follows: where the DST and the IDST denote the Discrete Sine Transform and its corresponding Inverse Discrete Sine Transform, which are independently applied to each column of the images.The notation ( ′ ) signifies the transpose operation on the matrix.The denominator d in equation ( 5) is defined for each pixel's location (i, j) as follows: where H and W are the image dimensions (height and weight).
Here, we used both the MAE and the Structural Similarity Index Measure (SSIM) to compute the loss of this layer: where Lrec is the output of the Poisson solver layer.

Color cues integration
As an input to this phase, the predicted intensity (luminance) channel ( Lrec , the reconstructed intensity image which was derived at the previous Laplacian prediction stage) is concatenated with the color cues channels.
The decoder consists of two convolutional blocks, each having three convolutional layers.The first block starts with a transposed convolutional layer with stride =2 and the following layers: (30, 2, None,0), (30, 3, LeakyRelu,0), (30,3,LeakyRelu,1).The output of the second block of the encoder is added to the output of the first convolutional layer in this block.The second block starts with a transposed convolutional layer with stride =2 and the following layers: (20, 2, None, 0), (20, 3, LeakyRelu,0), (20, 3, LeakyRelu, 1).The output of the first block of the encoder is added to the output of the first convolutional layer in this block.Finally, the U-Net ends with a last convolutional layer containing three filters for the three opponent channels and no activation function.
We used the network to minimize two cost functions: (1) the MAE of the real (original) opponent colors of the image and the predicted opponent colors using: and ( 2) the perceptual similarity of the original I RGB and the predicted ÎRGB RGB color images utilizing the perceptual similarity metrics SSIM and LPIPS [23]: We defined I RGB = opp2rgb (RG, BY, L) and ÎRGB = opp2rgb( RG, BY, L), where opp2rgb is the linear transformation from the opponent color space to the RGB color space computed using the inverse of the opponent matrix M opp equation ( 1) (10)

Baseline colorization
To compare the results of our method with other colorization techniques and to demonstrate the importance of color cues, we used DeOldify's colorization, a pre-trained open-source colorization tool to colorize the reconstructed intensity channel ( L) generated by the model.DeOldify's colorization is an opensource CNN-based image colorization framework, which employs deep neural networks, which were trained on large, diverse datasets to assign appropriate colors to grayscale images.It is known for its impressive results and its architecture is based on a U-Net design [24], which was constructed from a pretrained ResNet.As a coding and decoding architecture, U-Net is characterized by its ability to mix features from different scales by using skip connections.

Implementation details
Our model was implemented using TensorFlow and was trained on NVIDIA Tesla T4 GPU.N-caltech101 dataset sequences were divided into 6097 training sequences, 1306 validation sequences, and 1306 testing sequences.To optimize our model, we utilized the Adam optimizer [25] with a batch size of 16 and initialized the learning rate at 0.001.We implemented a dynamic learning rate strategy, triggering a 20% decrease in the learning rate when the model reached a plateau.This plateau, defined as a lack of improvement in validation loss over 6 consecutive epochs, ensures adaptability to the dataset dynamics using: where α is the minimal required relative improvement (here we use α = 0.005).The minimum learning rate was set at 2 × 10 −6 .This strategy contributes to the model's robustness and adaptability, allowing for effective training and performance optimization.We trained each of the three different color cues bits depth configuration model for 150 epochs.We set λ ∇ 2 = 100, λ PI,mae = 100, λ PI,ssim = 25, λ opp = 150, λ ssim = 100 and λ lpips = 500.We opted for a high value for the LPIPS weight parameter to strongly prioritize the enforcement of similarity in the learned features, emphasizing perceptual criteria during the training process.

Results
The results, obtained from the images acquired from the entire Caltech101 and the neuromorphic N-Caltech101 datasets (n = 1296), are summarized in table 1 1 presents the SSIM, LPIPS, PSNR and CIEDE2000 scores for the different methods.SSIM is a similarity metric for images, where a higher score indicates a higher similarity to the reference image.LPIPS [23] is a perceptual metric for image quality assessment, where a lower score indicates a better-perceived image quality.Peak Signalto-Noise Ratio (PSNR) is a metric used to quantify the quality of a reconstructed signal by comparing it to the original, measuring the ratio of the peak signal power to the noise power.CIEDE2000 [26] is a color difference metric that quantifies perceptual color differences between two samples, considering factors such as lightness, chroma, and hue, to provide a more accurate representation of human visual perception.
Our results show that LPIPS, SSIM, PSNR and CIEDE2000 similarity measures improved when color cues were introduced to the model, both visually (figure 4) and quantitatively (table 1).The first row represents the baseline DeOldify's colorization.The next three rows represent different variations of our color cues enhanced method, where the number of bits used to represent the cues was varied from 2 to 4 bits.As expected, increasing the number of bits used to represent the color cues results in higher SSIM and PSNR and lower LPIPS and CIEDE2000 scores.The best performance was obtained by using 4 bits to represent color cues, with an SSIM score of 0.819, an LPIPS score of 0.134 PSNR score of 21.23 and CIEDE200 score of 7.19.Using higher bit-depth does not further improve the results, suggesting a sweet spot for colorization efficiency for the given image size and resolution.
To illustrate the accumulated contributions of each stage in the model, figure 3 showcases each intermediate stage of the model.First and second columns in in figure 3 present the input to the model: color cues and event frames.The third column presents  the output of the CNN, which predicts the Laplacian of the intensity of the image.In the fourth column, the reconstructed intensity is presented through the Poisson integration layer.The U-Net's contribution to intensity reconstruction is presented in the fifth column.The sixth column exhibits both color and intensity results from the U-Net.The last column presents the original image, denoted as the Ground Truth (GT).The results indicate that the U-Net enhances the intensity details of the image, refining them, and reconstructs more realistic color information from the provided color cues.
Our results demonstrate that the proposed network is capable of reconstructing high-quality images from events and color cues.We show that when color cues are of 2-bit depth and above, our proposed framework outperforms DeOldify's colorization in terms of SSIM, LPIPS, PSNR and CIEDE2000.This is showcased in figure 4 where DeOldify's colorization yielded several images with a yellowish hue (see the flower and the map in figure 4).Moreover, as it is also demonstrated in figure 4, in some cases, colorized images obtained by DeOldify's colorization are still achromatic in contrast to the results of our proposed methods.However, there are a few exceptions.For example, in comparison to DeOldify's colorization, the body color of the woman shown in figure 4 is closer to the ground truth (pink VS. orange hue) than to our proposed framework.

Discussion
In this work, we simulated a framework for the reconstruction of colorful images from events and auxiliary color cues using artificial neural networks.We evaluated our approach against a successful open-source colorization algorithm, DeOldify's colorization [27,28], which has been trained on large datasets of colorful images from various domains.While DeOldify's colorization was able to colorize some images beautifully, such as the butterfly in figure 4, it struggled with others, such as the rooster and saxophone.This highlights the fact that image colorization is an ill-posed problem [28].Moreover, DeOldify's model is computationally intensive, with a large size of 171 MB, which requires significant memory and computational resources, resulting in longer inference times.
In contrast, our proposed U-Net architecture has only 88 K parameters and a smaller size of 1.5 MB, resulting in less memory usage and shorter inference times.We also compared the performance of our method and DeOldify using two evaluation metrics, SSIM and LPIPS.The results in table 1 show that our method outperformed DeOldify's colorization in terms of SSIM, while also achieving a significantly lower LPIPS score.
The color cues represent the chromatic information of the scene and are used to improve the accuracy of the reconstruction.We tested our method using color cues of different depths, ranging from 2 to 4 bit.Our results showed that increased color depth leads to a significant improvement in both SSIM and LPIPS scores (figure 4 and table 1), thus supporting the effectiveness of our approach in utilizing color information for event-based image reconstruction.
In addition to improving image quality, the use of color cues has the advantage of requiring less memory than methods that use RGB frames as auxiliary data, as reported in previous studies [14][15][16].Specifically, color cues require only H 2 × W 2 × 2 × 2 bits of memory (where H and W are the height and width of the image, respectively), as the color cues were down sampled in both dimensions and 2 bits were used to represent each of the 2 color channels.In contrast, RGB frames require H × W × 3 × 8 bits of memory (accounting for the height, width, 3 color channels, and 8 bits per channel).
Therefore, the memory size ratio between the color cues and the RGB data is RGB  Color cues = 24, when 2 bits are used to represent the color cue channels.Similarly, when 3 bits are used to represent the color cue channels, the memory size ratio decreases to 16, and for 4 bits, the ratio further decreases to 12.This highlights the efficiency of using color cues as auxiliary data in event-based vision systems.The use of color cues offers advantages over traditional methods employing RGB frames, as it requires significantly less memory, making it a more resource-efficient choice for edge vision systems.The integration of color cues also holds potential for real-world applications, enhancing image reconstruction, visual tracking, and object recognition in scenarios where existing methods typically produce grayscale reconstructions.
For our current work, we used the N-Caltech101 dataset to demonstrate the effectiveness of our approach.The images in this dataset were captured under fixed target saccades, such that motion in these images was predetermined and did not represent real-life camera motion.Using a simple dataset under controlled conditions, we built a foundational proof of concept and systematically improved the reconstruction algorithm.This approach established a clear baseline for method development, validation, and comparison, enabling focused exploration of fundamental principles before scaling to complex real-world scenarios.Consequently, the proposed reconstruction algorithm was optimized to reconstruct still images rather than video sequences.For our model to fit real data, it will require an additional recurrent scheme and further training on larger and more diverse datasets.This will enable our approach to capture real motion patterns that are present in real-world visual scenes.To handle 'real-world' event data, our model's lightweight U-Net can be used to incorporate color cues into other reconstruction models that reconstruct videos from event streams, potentially providing a significant improvement over existing methods, which typically produce grayscale reconstructions [4][5][6][7][8][9]29].With the aid of color information, image reconstruction, visual tracking, and object recognition can produce more accurate and visually appealing results.
Despite that there is currently no existing hardware for the generation of color cues in conjunction with event data, our work may pave the way toward the design of such an optimized system.Future directions could involve refining the current method to address real-world camera motion and enhancing performance on more realistic datasets, including low light conditions [30] and optimizing real-time processing.Examining the balance between computational efficiency and reconstruction accuracy across various color cue depths is important.Additionally, extending comparisons to a broader set of existing event-based vision systems and integrating color cues into video reconstruction models could enhance the method's practical applicability.Further adaptations, such as additional training on diverse datasets and incorporating recurrent schemes, such as convolutional LSTM [31], may improve its effectiveness in a wider range of real-world scenarios.

Figure 1 .
Figure 1.The architecture of the proposed reconstruction and colorization model.

Figure 3 .
Figure 3.The inputs and intermediate stages of the model.Adapted from CaltechDATA 2022.CC BY 4.0.
and demonstrated in figure 4. The corresponding color cues with varying bit depth are shown in figure 2. To compare the results of our method with another colorization technique, as well as to demonstrate the importance of color cues, we used DeOldify's colorization, a pretrained open-source tool to colorize the reconstructed intensity channel L generated by the model.The results of DeOldify's colorization were labeled as Events + DeOldify.Table