Polarization-driven Semantic Segmentation via Efficient Attention-bridged Fusion

Semantic Segmentation (SS) is promising for outdoor scene perception in safety-critical applications like autonomous vehicles, assisted navigation and so on. However, traditional SS is primarily based on RGB images, which limits the reliability of SS in complex outdoor scenes, where RGB images lack necessary information dimensions to fully perceive unconstrained environments. As preliminary investigation, we examine SS in an unexpected obstacle detection scenario, which demonstrates the necessity of multimodal fusion. Thereby, in this work, we present EAFNet, an Efficient Attention-bridged Fusion Network to exploit complementary information coming from different optical sensors. Specifically, we incorporate polarization sensing to obtain supplementary information, considering its optical characteristics for robust representation of diverse materials. By using a single-shot polarization sensor, we build the first RGB-P dataset which consists of 394 annotated pixel-aligned RGB-Polarization images. A comprehensive variety of experiments shows the effectiveness of EAFNet to fuse polarization and RGB information, as well as the flexibility to be adapted to other sensor combination scenarios.


Introduction
With the development of deep learning, outdoor scene perception and understanding has become a popular topic in the area of autonomous vehicles, navigation assistance systems for vulnerable road users like visually impaired pedestrians and mobile robotics [1]. Semantic Segmentation (SS) is a task to assign semantic labels to each pixel of the images, i.e., object classification task at the pixel level, which is promising for outdoor perception applications [2]. A multitude of SS neural networks have been proposed following the trend of deep learning like FCN [3], U-Net [4], ERFNet [5], SwiftNet [6] and so on.
However, the networks mentioned above are mainly focused on the segmentation of RGB images, which makes it hard to fully perceive complex surrounding scenes because of the limited color information. A lot of works concerning domain adaptation have been presented to cope with SS in conditions without enough optical information [7,8]. Yet, a high-level security needs to be guaranteed for outdoor scene perception to support safety-critical applications like autonomous vehicles, where merely algorithm advancement is limited. Incorporating heterogeneous imaging techniques, multimodal semantic segmentation is of great necessity to be researched, which can leverage various types of optical information like depth, infrared and event-based data [9,10]. In this paper, we employ polarization information as the supplement sensor information to advance the performance of RGB-based SS considering its optical characteristics for robust representation of diverse materials. The polarization information are promising to advance the segmentation of objects which possess polarization features in the outdoor. With the rationale, this work advocates polarization-driven multimodal SS, which is rarely explored in the literature.
To better explain the necessity of multimodal semantic segmentation that merely RGB sensors can not cope with complex outdoor scene perception, we conduct a preliminary investigation in arXiv:2011.13313v1 [cs.CV] 26 Nov 2020 an unexpected obstacle detection scenario. In outdoor scenes, many unexpected obstacles like tiny animals, boxes and so on are risk factors for secure driving. We choose Lost and Found dataset [11] to perform an experiment. The dataset is acquired by a pair of cameras with a baseline distance of 23cm in 13 challenging outdoor traffic scenes by setting up 37 different categories of tiny obstacles, which possesses three types of data as shown in Fig. 1, i.e., RGB image, disparity image and ground-truth label. The dataset contains 3 categories, i.e., coarse annotations of passable areas, fine-grained annotations of unexpected tiny obstacles and background, whose resolution is 1024×2048. Among them, 1036 images are selected as the training set, while the remaining 1068 images are selected as the validation set. Considering the fact that outdoor scene perception application demands high efficiency, we select a real-time network SwiftNet [6] to conduct the experiment. We only take the RGB images as the input information to train the network, where other training implementations will be described in Section 4.1. The detailed results are shown in Table 3 Fig. 2. The results show that severe over-fitting has appeared, and we find that the model trained merely by RGB images can not satisfactorily detect small, unexpected obstacles. According to the toy experiment above, the model's performance is unacceptable when trained only with RGB images. Thereby, we consider it is necessary to incorporate additional sensor information for semantic segmentation to perceive outdoor traffic scenes. As mentioned above, we select polarization as the complementary information, whose potential has been shown in our previous works [12,13] for water hazard detection. In this work, we leverage a novel single-shot RGB-P imaging sensor, and investigate polarization-driven semantic segmentation. To sufficiently fuse RGB-P information, we propose the Efficient Attention-bridged Fusion Network (EAFNet), enabling adaptive interaction of cross-modal features. In summary, we deliver the following contributions: • Addressing polarization-driven semantic segmentation, we propose EAFNet, an efficient attention-bridged fusion network, which fuses multimodal sensor information with a lightweight fusion module, advancing many categories' accuracy, especially categories with polarization characteristics like glass, whose IoU is lifted to 79.3% from 73.4%. The implementations and codes will be made available at https://github.com/ Katexiang/EAFNet.
• With a single-shot polarization imaging sensor, we present an RGB-P outdoor semantic segmentation dataset. To the best of our knowledge, this is the first RGB-P outdoor semantic segmentation dataset, which will be made publicly accessible at http://www. wangkaiwei.org/download.html.
• We conduct a series of experiments to demonstrate the effectiveness of EAFNet with comprehensive analysis, along with a supplementary experiment that verifies EAFNet's generalization capability for fusing other sensing data besides polarization information.

From Accurate to Efficient Semantic Segmentation
Convolutional Neural Networks (CNNs) have been the mainstream solution to semantic segmentation since Fully Convolutional Networks (FCNs) [3] approached the dense recognition task in an end-to-end way. SegNet [14] and U-Net [4] presented encoder-decoder architectures, which are widely used in the following networks. Benefiting from deep classification models like ResNets [15], PSPNet [16] and DeepLab [17] constructed multi-scale representations and achieved significant accuracy improvements. Inspired by the channel attention method proposed in SENet [18], EncNet [19] encoded global image statistics, while HANet [20] explored heightdriven contextual priors. ACNet [21] leveraged attention connections and bridged multi-branch ResNets to exploit complementary features. In another line, DANet [22], OCNet [23] and CCNet [24] aggregated dense pixel-pair associations. These works have pushed the boundary of segmentation accuracy and attained excellent performances on existing benchmarks. In addition to accuracy, the efficiency of segmentation CNNs is crucial for real-time applications. Efficient networks were designed such as ERFNet [5], DFANet [25] and SwiftNet [6]. They were built on techniques including early downsampling, filter factorization, multi-branch setup and ladder-style upsampling. Some efficient CNNs [26,27] also leveraged attention connections, trying to improve the trade-off between segmentation accuracy and computation complexity. With these advances, semantic segmentation can be performed both swiftly and accurately, and thereby has been incorporated into many optical sensing applications such as semantic cognition system [28] and semantic visual odometry [26,29].

From RGB-based to Multimodal Semantic Segmentation
While ground-breaking network architectural advances have been achieved in single RGB-based semantic segmentation on existing RGB image segmentation benchmarks such as Cityscapes [30] and Mapillary Vistas [31], in some complex environments or under challenging conditions, it is necessary to employ multiple sensing modalities that provide complementary information of the same scene. Comprehensive surveys on multimodal semantic segmentation were presented in [2,9]. In the literature, researchers explored RGB-Depth [21,27], RGB-Infrared [32,33], RGB-Thermal [34,35], GRAY-Polarization [36,37] and Event-based [10,38] semantic segmentation to improve the reliability of surrounding sensing and the applicability towards real-world applications. For example, RFNet [27] fused RGB-D information on heterogeneous datasets, improving the robustness of SS in road-driving scenes with small-scale unexpected obstacles.
In this work, we focus on RGB-P semantic segmentation by using a single-shot polarization camera. Traditional polarization-driven dense prediction frameworks were mainly dedicated to the detection of water hazards [39,40] or the perception in indoor scenes [41,42]. In our previous works, we investigated the impact of loss functions on water hazard segmentation [43], followed by a comparative study on high-recall semantic segmentation [44]. Inspired by [41], dense polarization maps are predicted from RGB images through deep learning [1]. Instead, current polarization imaging technique makes it possible to sense pixel-wise polarimetric information in a single shot and has been integrated on perception platforms for autonomous vehicles [13]. Following this line, we present a multimodal semantic segmentation system with single-shot polarization sensing. Notably, we found previous collections [36,37,45] of polarization images were mainly gray images without providing RGB information that are critical for segmentation tasks. Besides, they were limited in terms of data diversity and entailed careful calibration between different cameras. In contrast, we are able to bypass the complex calibration and naturally obtain multimodal data with single-shot polarization imaging. As an important contribution of this work, a novel outdoor traffic scene RGB-P dataset is collected and densely annotated, which covers not only specular scenes but also diverse unstructured surroundings. The dataset will be made publicly available to the community to foster polarimetry-based semantic segmentation. Moreover, our work is related to transparent object segmentation [46,47].

Methodology
In this section, we derive the polarization image formation process and explain why polarization images contain rich information to complement RGB images for semantic segmentation. Then, we make a brief introduction of our integrated multimodal sensor and the novel RGB-P dataset. Finally, we present the Efficient Attention-bridged Fusion Network (EAFNet) for polarimetrybased multimodal scene perception.

Polarization Image Formation
Polarization is a significant attribute of objects, which can represent the surface material's optical characteristics. Reflection and refraction occurs when light is illuminated at the interface of two media. Both the reflected and refracted light have a certain degree of polarization. The polarized light can be orthogonally decomposed into two directions, both at a linear polarization state. We illustrate the importance of polarization according to the Fresnel equation:  where r and t are the reflected and refracted portion of incoming light, the subscript label s and p represent perpendicular polarization and parallel polarization, n 1 and n 2 are refractive indexes of the two media material, and the and are the angle of incident light and refracted light, respectively. Inferred from Eq. (1), we find that the surface material's optical characteristics can affect the intensity of the two orthogonally polarized light. Therefore, the orthogonally polarized light can partially reflect the surface material.
The polarization image formation can be reducible to the model shown in Fig. 3. In outdoor scenes, the light source is mainly sunlight. When the sunlight shines on the object like cars, polarized reflection occurs. Then, the reflected light with orthogonally polarized portion enters the camera with a polarization sensor, and the optical information with polarized characteristics are recorded by the sensor. The reason why the photoelectric sensor can record the polarized information is that the sensor's surface is covered by a polarization mask layer with four different polarization directions, and only the light with the same polarization direction can pass the layer.
Here, we make a brief introduction of polarization parameters like the Degree of Linear Polarization (DoLP) and the Angle of Linear Polarization (AoLP). They are the key elements that contribute to the advancement of multimodal semantic segmentation. They are derived by Stokes vectors S, which are composed of four parameters, i.e., S 0 , S 1 , S 2 and S 3 . More precisely, S 0 stands for the total light intensity, S 1 stands for the parallel polarized portion's superiority against the perpendicular polarized portion, and S 2 stands for 45 • polarized portion's superiority against 135 • polarized portion. S 3 , associated to circularly polarized light, is not involved in our work on multimodal semantic segmentation. They can be derived by: where I 0 , I 45  According to Eq. (3), the range of DoLP is from 0 to 1. For partially polarized light, DoLP ∈ (0, 1). For completely polarized light, DoLP = 1. Namely, DoLP stands for the degree of Linear Polarization. For AoLP, it ranges from 0 • to 180 • . AoLP can reflect object's silhouette information, because objects with the same material normally possess similar AoLP. Therefore, AoLP is a natural scene segmentation mask. In other words, objects of the same category or with the same material have similar AoLP. We generate a visualization of a set of DoLP and AoLP polarization images, as shown in Fig. 4. We find that the glass area and vegetation area are of high DoLP, but other areas are of low DoLP, which offers limited information, merely focused on the area with polarized characteristics. Besides, the left part of the vegetation and sky can not be distinguished merely depending on DoLP. On the contrary, for AoLP, we find areas of the same category show proper continuity of polarization information, which indicates great spatial priors for SS. AoLP offers a better representation of spatial information, which keeps a consistent distribution on the same category or materials like vegetation, sky, road and glass. In Section 4, we will further analyze the AoLP's great potential in providing extra spatial information for SS over DoLP with extensive experiments.

Integrated Multimodal Sensor and ZJU-RGB-P Dataset
The RGB-P outdoor scene dataset is captured by using our integrated multimodal sensor for autonomous driving [13], as shown in Fig. 5. The sensor is a highly integrated system which is a combination of polarization sensor, RGB sensor, infrared sensor and depth sensor. The sensor captures polarization information with an RGB-based imaging sensor LUCID_PHX050S, which is an RGB-P sensor. The difference between LUCID_PHX050S and Gray-Polarization sensors is that the former is covered with an extra Bayer array besides the polarization mask.
In addition, the multimodal sensor integrates an embedded system which combines hardware and software, by which we can attain various types of information like semantic information, infrared information, stereo depth information, monocular depth information and surface normal information by utilizing relevant estimation algorithms. Fig. 5(b) shows some examples of the sensor's output information. The highly integrated sensor can broaden RGB-based sensor's application scenarios [13]. The infrared information can assist nighttime semantic segmentation, and the polarization-RGB-infrared multimodal sensor can offer precise depth information by pairing the sensors with different baselines. We leverage the multimodal sensor to attain pixelaligned polarization and RGB images, and the main purpose of this work to adapt RGB-based SS to Polarization-driven multimodal SS. RGB-Polarization outdoor scene SS dateset is scarce in the literature. Some research groups have realized the importance of polarization information for outdoor perception. The Polabot dataset [36,37], a GARY-Polarization outdoor scene SS dataset, consists of around 180 pairs of images at a low resolution of 230×320. The limited image number and the low resolution make  it hard to train a robust SS network for outdoor scenes. In addition, the dataset is short of RGB information which provide important texture features for classification tasks.
Addressing the scarcity, we build the first RGB-P outdoor scene dataset which consists of 394 annotated pixel-aligned RGB-Polarization images. We collect the images with abundant and complex scenes at Yuquan Campus, Zhejiang University, as shown in Fig. 6. The scenes of the dataset cover road scenes around teaching building area, canteen area, library and so on to provide diverse scenes and to reduce the risk of over-fitting in training SS models.
The resolution of our dataset is 1024×1224, which makes it possible to apply data augmentation like random crop and random rescale, which are crucial for improving data diversity and attaining robust segmentation [28]. We label the dateset with 9 classes at the pixel level, i.e., Building, Glass, Car, Road, Vegetation, Pedestrian, Bicycle and Background. An example of the dataset is shown in Fig. 7, which consists of four pixel-aligned RGB images at four polarization directions and a SS label. But AoLP and DoLP are the ultimate polarization representations integrated into SS, so the four polarized RGB images need to be operated according to Eq. (3) and Eq. (4). Finally, we select 344 images as the training set, and the other 50 images as the validation set. We name it ZJU-RGB-P dataset.

Efficient Attention-bridged Fusion Network
In order to combine RGB and polarization features, we present EAFNet, an Efficient Attentionbridged Fusion Network to exploit multimodal complementary information, whose architecture is shown in Fig. 8. Inspired by SwiftNet [6] and our previous SFN [44] with an U-shape encoder-decoder structure, EAFNet is designed to keep a similar architecture with downsampling paths to extract features and an upsamling module to restore the resolution, together with EAC modules to fuse features from RGB and polarization images. Here, we make a brief overview of EAFNet according to Fig. 8. EAFNet is designed to have a three-branch structure with dowmsampling paths of the same type. They are the RGB branch, the polarization branch and the fusion branch. In order to advance computation efficiency, we employ ResNet-18 [15], a light-weight encoder to extract and fuse features. After attaining the dowmsampled and fused features, SPP module, a spatial pyramid pooling module [6,48] is leveraged to enlarge valid  receptive field. Then, a series of upsampling modules are leveraged to restore the feature resolution. Like SwiftNet, EAFNet employs a series of convolution layers with kernel size of 1×1 to connect features between shallow layers and deep layers. The key innovation, here, lies in that EAFNet possesses the carefully designed fusion module, namely EAC Module, with inspiration gathered from the Efficient Channel Attention Network [49]. With this architecture, EAFNet is a real-time network, whose inference speed on GTX 1080Ti reaches 24 FPS (Frame Per Second) at the resolution of 512×1024.
EAC Module is an efficient attention complementary module which is designed for extracting informative RGB features and polarization features, as shown in Fig. 9. EAC module is an efficient version of the Attention Complementary Module (ACM) [21], which replaces fully connection layers with 1×1 convolution layers whose kernel sizes are adaptively determined according to the channel number of the corresponding feature maps. On the one hand, the structure reduces computation complexity compared with ACM due to the use of local cross-channel interactions other than all channel-pair interactions. On the other hand, local cross-channel interaction effectively avoids the problem of losing information caused by the dimension reduction in learning channel attention.
Assuming the input feature map is ∈ R × × , we first apply a global average pooling layer to process , where , and are the height, width and channel number of the input feature map, respectively. Then, we obtain a feature vector = [ 1 , 2 , . . . . . . , ] ∈ R 1× , where the subscript label represents the sequence number of features' channel. The k-th ( ∈ [1, ]) element of can be expressed as: Then, the vector needs to be reorganized by a convolution layer with an adaptive kernel size K to obtain a more meaningful vector = [ 1 , 2 , . . . . . . , ] ∈ R 1× . K is the key point to attain the local cross-channel interaction attention weights, which can be acquired using t: where b and are hyper-parameters, which are set as 1 and 2 in our experiments, respectively. If t is divisible by 2, K is equal to t, otherwise K is equal to t plus 1. With the growing of channel depth, the EAC module can attain interaction among more channels. To limit the range of , sigmoid activation function (·) is applied to it. (·) can be expressed as: Then, we can get the final attention weights = [ 1 , 2 , . . . . . . , ] ∈ R 1× . All the elements of are in the range of 0 and 1. In other words, each element of can be viewed as the key weight of the corresponding channel of the input feature map. Finally, we perform an outer product of and to get the adjusted feature map ∈ R × × . Thereby, RGB features and polarization features can be adjusted dynamically by the EAC module. Then, the adjusted features will be fused by the fusion module.
Fusion Module is leveraged to fuse the adjusted feature maps from RGB branch and polarization branch following the EAC modules. As mentioned above, the fusion branch is the same as the RGB branch and the polarization branch. The main difference lies in the inputting feature flow. Assuming at the i-th dowmsampling stage, the RGB branch's feature map is ∈ R × × , and the polarization branch's feature map is R × × . Fig. 10 illustrates one layer of the fusion branch for the fusion process. The left part is the RGB feature, and the right part is the polarization feature, while the feature flowing through the center arrow is the fusion feature from the previous fusion stage. Then, the fused feature +1 at the current stage can be expressed as: where +1 working as the fusion feature is passed into the fusion branch to extract higher-level features. It should be noted that at the first fusion stage of our EAFNet architecture, it only has RGB feature and polarization feature as input information.

Experiments and Analysis
In this section, the implementation details and a series of experiments with comprehensive analysis are presented.

Implementation Details
The experiments concerning polarization fusion are performed on our ZJU-RGB-P dataset, while the preliminary experiment detailed in Section 1 and the supplement experiment are performed on the Lost and Found dataset [11]. The remaining implementation details are the same for all of the experiments. For data augmentation, we first scale the images with random factors between 0.75 and 1.25, then we randomly crop the images with a crop size of 768×768, followed with a random horizontal flipping. It is worth noting that AoLP's random horizontal flipping has a critical difference from DoLP and RGB images. According to Eq. (4), when the RGB images at four polarized directions are applied horizontal flipping, the AoLP will be: where is the ultimate horizontal flipped AoLP image, and is merely the spatially horizontal flipped version of the initial AoLP image. After all the data augmentation, all the processed images are normalized to the range between 0 and 1.
We use Tensorflow and an NVIDIA GeForce GTX 1080Ti GPU to implement EAFNet and perform training. We use Adam optimizer [50] with an initial learning rate of 4×10 −4 . We decay the learning rate with cosine annealing to the minimum 2.5×10 −3 of the initial learning rate until the final epoch. To combat over-fitting, we use the L2 weight regularization with a weight decay of 1×10 −4 . Unlike prior works [6,27], we have not adopted any pre-trained weights in order to investigate the effectiveness of multimodal SS, with the aim to reach high performance even with limited pairs polarization images. We utilize the cross entropy loss to train all the models with a batch size of 8. We evaluate with the standard Intersection over Union (IoU) metric.

Results and Analysis
Both of AoLP and DoLP can represent polarization information of scenes, but which is the better to be fused into polarization-driven SS remains an open question.
We have made a brief analysis of the superiority of AoLP over DoLP for polarization-driven SS in Section 3.1 intuitively. As preliminary investigation, according to Fig. 4, we find that AoLP's distribution has a remarkable difference to that of DoLP. Further, we present the statistics of the value distributions of DoLP and AoLP on the ZJU-RGB-P training set, as shown in Fig. 11. The majority of all pixels of the training set are with a small DoLP ranging from 0 to 0.4, while the portion of pixels whose DoLP values are larger than 0.4 is rather low, indicating that DoLP offers limited information, merely on categories with highly polarized characteristics. Different Table 1. Accuracy analysis on ZJU-RGB-P including per-class accuracy in IoU (%) and mean IoU (mIoU).

Model
Building For the basic control experiment, we first train the RGB-only SwiftNet on ZJU-RGB-P dataset as our Baseline. Then, four sets of training are performed for comparison. As shown in Fig. 8, EAFNet is a two-path network, where RGB images and polarization information are input into different paths. To explore the better polarization feature, we select AoLP (marked as AoLP-EX) and DoLP (marked as DoLP-EX) as the input polarization information, respectively. Considering the fact that both AoLP and DoLP can offer polarization information, we also concatenate AoLP and DoLP images along channel to build a polarization representation for training a variant model of EAFNet (marked as A/D-EX). Finally, we build a three-path version of EAFNet, where we select one path as the RGB path and the other two paths as polarization paths. AoLP and DoLP are passed into the two polarization paths (marked as 3-Path-EX).
All quantitative results of the experiments are shown in Table 1. It can be seen that models combined with polarization information can advance the segmentation of objects with polarization characteristics like glass (73.4% to 79.3%), car (91.6% to 93.7%) and bicycle (82.5% to 86.0%). In addition, we observe that not only the IoU of classes with polarization characteristics are advanced, but other classes' IoU have been improved by a great extent when combined with polarization information, especially pedestrian (36.1% to 63.8%). Meanwhile, the mIoU is lifted to 85.7% from 80.3%.
Further, we compare and discuss among the groups that fuse polarization-based features. As shown in Table 1, AoLP-EX is the optimal setting, while 3-Path-EX is the worst group. Here, we analyze from the view of data distribution and model complexity. Our main focus is on the classes like glass and car, as the initial motivation of this study is to lift the segmentation performance of objects with polarization characteristics. Making a comparison between AoLP-EX and DoLP-EX, we find that the former can better advance the IoU of glass and car than the latter. It is the data distribution that counts. The analysis of the different distributions in previous sections shows that AoLP offers a better spatial representation like contour information than DoLP, while DoLP only offers meaningful information on areas with high polarization information. In this sense, AoLP provides richer priors and complementary information for RGB-P segmentation. The reason why A/D-EX can attain higher IoU values on glass and car than DoLP-EX is that AoLP complements the spatial features of DoLP. However, it reaches a lower IoU on glass than AoLP-EX, because the interference between DoLP and AoLP features occurs, bringing some side effects and losing some serviceable information. 3-Path-EX is the worst group as the complex architecture of this model prevents the model from exploiting most informative features. Besides, the 3-path structure impairs the capacity of RGB features, which is critical for outdoor scene perception. Eventually, we conclude from the quantitative analysis that utilizing AoLP images to feed in the polarization path can greatly advance polarization-driven segmentation performance.
For qualitative results, we use the Baseline RGB-only model and the AoLP-EX polarizationdriven model, i.e., SwiftNet and our EAFNet fed with AoLP on the ZJU-RGB-P validation set to produce a series of visualization examples, as illustrated in Fig. 12. We find that SwiftNet wrongly segments the pedestrians into cars in the first row of Fig. 12 where EAFNet detects them correctly. In the second row, EAFNet successfully distinguishes glass from the car, while SwiftNet can not segment the full glass area. Moreover, SwiftNet even segments part of the car into road and part of the pedestrian into vegetation according to the last row of results in Fig. 12. The wrong segmentation results in outdoor traffic scenes can lead to terrible situations and even accidents once the model is selected to guide autonomous vehicles or assisted navigation [10,28]. It is obvious that polarization-driven SS can complement the missing information merely based on RGB images. Therefore, multimodal SS is beneficial for semantic understanding in pursuit of robust outdoor scene perception.

Analysis of EAC Module
EAC module is the key module of EAFNet, which can extract attention weights of RGB branch and AoLP branch. To better demonstrate the effect of EAC module, we visualize the fourth downsampling block's feature maps of RGB and AoLP branches, and their attention weights of EAC module as shown in Fig. 13. We only visualize feature maps of the former 16 channels. Here, (i, j) denotes the position at the i-th row and the j-th col of the feature map, which corresponds to the attention weight one by one, where some insightful results can be found. In the RGB branch, we find that the car and glass area have low responses in the feature maps. On the contrary, the corresponding area of the AoLP branch have high responses, especially at (2, 4), (3, 2) and (4,  4). Then, their EAC module extracts their attention weights, respectively. Taking (3, 2) as an example to illustrate the complement process, this channel's attention weights are 0.5244 and  0.4214 for RGB and AoLP branches, respectively. Then, the corresponding feature map will be multiplied by the attention weight. Finally, the adjusted feature map are added up to build the ultimate feature map, and it can be clearly seen that the feature maps spotlight the area of car perfectly.
As it can be seen in Fig. 13, the attention weights of RGB are higher than those of AoLP in most cases. Here, we evaluate the weights generated by EAC Module at all levels and illustrate the average of them as shown in Fig. 14. According to the curve, it can be easily observed that the RGB branch possesses higher weights than the AoLP branch at Layer1, Layer2, Layer3 and Layer4. On the contrary, the AoLP branch has a higher weight than the RGB branch at the first downsampling block, i.e., conv0. As mentioned before, AoLP offers a representation of spatial information and rich priors. Therefore, at the beginning of EAFNet, AoLP offers more distinguished features than RGB. With the features flowing into deeper layers, RGB branch becomes overwhelming. In addition, both of the weight curves keep a similar variation trend and reach the highest at Layer3.

Ablation Study
To better illustrate that fusing polarization information is beneficial and to verify that EAFNet has the better capacity to fuse features than a simple concatenation of RGB and polarization images, we have performed two extra training. First, we directly utilize AoLP images to train SwiftNet, Table 2. Accuracy analysis of the ablation study on ZJU-RGB-P (%).

Model
Building which is denoted as SN-AoLP. Second, we utilize the concatenation of RGB and AoLP images to train SwiftNet, denoted as SN-RGB/A. We also select Baseline and AoLP-EX of the previous experiment for the ablation study, as shown in Table 2. We find that SN-AoLP can reach a decent performance, which is benefited from the spatial priors of AoLP. It can be learned from the comparison between Basline and SN-AoLP that RGB possesses more distinguished features than AoLP, while AoLP offers meaningful information as well. Besides, SN-RGB/A can indeed advance the segmentation of classes with polarization characteristics like glass (73.4% to 75.6%), but it does not yield remarkable benefits for all classes in contrast with our AoLP-EX. In addition, SN-RGB/A even causes a slight mIoU degradation to 80.2% from 80.3%. It is the difference between RGB and AoLP distributions that accounts for this degradation. The interference between RGB and AoLP has an adverse impact on the extraction of distinguishable features. Making a comparison among all the groups, we draw a conclusion that AoLP-EX reaches the highest accuracy on all classes, indicating the effectiveness of our EAFNet and the designed polarization fusion strategy.

Supplement Experiment
To prove the flexibility of EAFNet to be adapted to other sensor combination scenarios besides polarization information, we utilize disparity images with EAFNet to investigate in the unexpected obstacle detection scenario, i.e., the preliminary investigation mentioned in Section 1.
According to Fig. 1, we can find that disparity images reflect the contours of the tiny obstacles. Thereby, combing RGB and disparity images with EAFNet is hopeful to address the devastating results of using merely RGB data. Considering the similar distribution between disparity and AoLP images, we train the EAFNet fed with pixel-aligned disparity and RGB images (marked as EAFNet-RGBD). We mark the baseline group as SwiftNet-RGB. All the training strategies are set according to Section 4.1. From the results in Table 3, we observe a remarkable elevation on the performance with the aid of EAFNet and complementary disparity images, where the precision and IoU of obstacle segmentation are lift to 76.2% from 26.5% and 52.7% from 20.9%, respectively. It indicates that combing disparity images with EAFNet bears fruit. To have a better realization of EAFNet-RGBD's effects intuitively, we visualize an example as shown in Fig. 15, where EAFNet-RGBD segments obstacles and most of the road areas successfully, but SwiftNet-RGB ignores most of the road and obstacles. Therefore, it is essential to perform multimodal semantic segmentation with complementary sensing information like polarization-driven and depth-aware features to have a reliable and holistic understanding of outdoor traffic scenes.

Conclusion and Future Work
In this paper, we propose EAFNet by fusing features of RGB and polarization images. We build ZJU-RGB-P with our integrated multimodal vision sensor, which is the first RGB-polarization semantic segmentation dateset to the best of our knowledge. EAFNet dynamically extracts attention weights of RGB and polarization branches, adjusts and fuses multimodal features, significantly advancing the segmentation performance, especially on classes with highly polarized characteristics like glass and car. Extensive experiments are conducted to prove the effectiveness of EAFNet for incorporating features from different sensing modalities and the flexibility to be adapted to other sensor combination scenarios like RGB-D perception. Therefore, EAFNet is a multimodal SS model that can be utilized in diverse real-world applications.
In the future, there are two research paths that can be explored. One is to build more kinds of multimodal dataset based on the integrated multimodal vision sensor like RGB-Infrared dataset to address nighttime scene understanding. Another line is to enlarge the categories of ZJU-RGB-P to cope with the detection of transparent objects, ice and water hazards.

Funding
This research was granted from ZJU-Sunny Photonics Innovation Center (No. 2020-03). This research was also funded in part through the AccessibleMaps project by the Federal Ministry of Labor and Social Affairs (BMAS) under the Grant No. 01KM151112.