Exploiting the Distortion-Semantic Interaction in Fisheye Data

In this work, we present a methodology to shape a fisheye-specific representation space that reflects the interaction between distortion and semantic context present in this data modality. Fisheye data has the wider field of view advantage over other types of cameras, but this comes at the expense of high radial distortion. As a result, objects further from the center exhibit deformations that make it difficult for a model to identify their semantic context. While previous work has attempted architectural and training augmentation changes to alleviate this effect, no work has attempted to guide the model towards learning a representation space that reflects this interaction between distortion and semantic context inherent to fisheye data. We introduce an approach to exploit this relationship by first extracting distortion class labels based on an object's distance from the center of the image. We then shape a backbone's representation space with a weighted contrastive loss that constrains objects of the same semantic class and distortion class to be close to each other within a lower dimensional embedding space. This backbone trained with both semantic and distortion information is then fine-tuned within an object detection setting to empirically evaluate the quality of the learnt representation. We show this method leads to performance improvements by as much as 1.1% mean average precision over standard object detection strategies and .6% improvement over other state of the art representation learning approaches.


I. Introduction
Autonomous vehicles (AV) have the potential to change existing transportation systems. However, one major concern is the interaction between their acquisition sensors (cameras) and their deep learning based decision algorithms. This concern exists because perception decisions made by an autonomous vehicle is dependent on the quality of the data they receive from the surrounding environment. In particular, camera setups with a wider field of view are attractive due to their ability to capture a more holistic representation of the entire scene. For this reason, fisheye camera lenses are gaining attention as the main vision sensor on these AV systems due to their effective receptive field of 180 degrees [1]. As a result of this advantage, fisheye cameras have seen widespread adoption in common vehicle settings such as parking assistance [2] and automated parking [3]. Despite their usage in diverse applications, fisheye cameras come with the unique challenge of exhibiting radial distortion as a function of distance from the center of the image. Analysis into the acquisition process of these cameras has shown that this distortion is an inherent consequence of projecting the hemispherical lens geometry onto a 2D plane [4]. A naive solution to this problem would be to simply apply a transformation that would rectify the distortion. However, it has been shown [5] that these types of approaches introduce artifacts at the edges and reduce the overall field of view of the image.
From a deep learning perspective, radial distortion introduces a plethora of issues because neural networks exhibit performance degradation outside of the pristine data setting [8]. Furthermore, most computer vision applications only consider data from narrow field of view cameras with mild radial distortion. As a result of these discrepancies, research FIGURE 1: 1) This is an example fisheye object with associated car objects from the center and edge. 2) Previous methods view these objects as lying on separate manifolds that need to be corrected through a term that identifies them as belonging to the same semantic class L C . 3) We propose an alternate view of the problem that views distortion and semantic context as belonging on separate sub-manifolds. The model is enabled to learn an intermediate representation that considers both concepts through a loss that enforces understanding of both the semantic class L C and a distortion class L DC .

FIGURE 2:
This shows the mean average precision(mAP) for a toy experiment on the WoodScape [6] dataset. An object detector is trained using the YOLO v5 [7] framework. Then, we compute total mAP, mAP of just the objects at the edge of the image, and mAP of objects within the center of the image. has gone into developing deep learning approaches that maintain performance on fisheye data while simultaneously avoiding the sub-optimal process of rectifying the image. This research can be roughly divided into two sub-categories that we will refer to as model-centric and data-centric approaches. Model-centric refers to approaches [9]- [11] that attempt to change certain architectural features of a model with the intent of better conformation to an identified fisheye feature. Data-centric refers to approaches [12], [13] that manipulate the available training data in an attempt to better generalize on the fisheye setting. While these approaches have seen varied levels of success, they are lacking in the sense that they are very specifically optimized for their task of interest. In other words, these approaches introduce improvements within their target application (monocular depth estimation, semantic segmentation, etc.), but they do not identify core properties that a model's representation of data should have in order to be tuned to the fisheye setting.
In this work, we address this research gap by introducing a representation-centric approach specifically designed for a general fisheye paradigm. Our perspective on this paradigm can be understood by the example presented in Figure 1. Within the first image, there are two cars: one that exists within the center of the frame and one that exists at the edge of the frame. It can be observed that the radial distortion property of fisheye data causes the car at the edge to exhibit a much higher level of distortion compared to the object taken from the center of the frame. To demonstrate the effect of this, we perform a toy experiment on the WoodScape [6] dataset to see how performance varies with respect to detecting center and edge located objects respectively. The results of this experiment on the Yolo v5 [7] architecture can be observed in Figure 2 and it clearly shows a significant difference in performance of about .06 mAP between objects found in the center compared to those located at the edge of the image. One interpretation, shown in Figure 1, is that the distortion may cause the model to view these objects as coming from a high M H and low M L distortion manifold despite their membership within the same semantic class. While the data space reflects the distortion characteristic of the data, standard methods do not integrate distortion into the training paradigm and rely on just semantic labels Y C with an associated semantic-based loss L C . However, previous work [14]- [16] has shown that a model lacks generalization capability when the learnt representation does not reflect the underlying distribution of the data space. We argue in this work that the underlying distribution of fisheye data reflects not only semantic context or distortion alone, but a complex interaction between both. We visualize this perspective in part 3 of Figure 1 where objects both exist on a semantic context manifold M S as well as separate distortion specific sub-manifolds M H and M L . In this view of the problem, all objects have a label with respect to the semantic manifold Y C as well as labels that reflect their location within distortion space Y DC . From this setup, it is then possible to train a model with a loss that integrates both the semantic characteristic L C and distortion characteristic L DC of the objects. We implement such a framework by first extracting objects from fisheye data and using their distance from the center to assign distortion based class labels alongside their semantic class label. We then take advantage of contrastive learning approaches [17] in order to explicitly enforce a model to learn a representation that reflects both the distortion and semantic characteristic of the object through a weighted contrastive loss αL C + (1 − α)L DC . We then fine-tune the learnt representation within an object detection setting to VOLUME , 3 empirically validate the approach. The target contributions of this work are: 1) We introduce a representation-centric approach to training fisheye data based on using contrastive learning as a way to constrain the interaction between semantics and distortion. 2) We perform an explicit analysis of this trade-off between being distortion-aware and semantically-aware within the context of an object detection setting. 3) We compare against standard object detection and representation learning baselines to demonstrate the advantage of our approach.

A. Rectification Approaches
One application of deep learning on fisheye data involves developing models to rectify the fisheye data with the goal of removing distortion. [18] introduced the use of a CNN as a means to extract features, parse the surrounding scene, and estimate the distortion parameters necessary to rectify the image. [19] built upon to this idea to derive distortion parameters by enforcing that straight lines maintain their straightness. Recent work [20] has explored the usage of transformer architectures to model domain shifts across distortions. [21] demonstrated a self supervised approach for this task. While these works have shown good performance for the task of rectification, the downsides of a computational cost, rectification artifacts, and reduction in the field of view remain significant concerns.

B. Model-Centric
Model-centric approaches refer to methods that introduce architectural changes in the hopes of performance improvements on their target task. Early work [22] made use of probabilistic appearance models for pedestrian tracking. [9] showed how changes to the output bounding box shape can improve the mIOU on distorted objects through better conformation to the distorted shape. Other work [10] has introduced new types of feature extraction strategies such as the usage of a hyperbolic convolutional kernel. [23] demonstrated the usage of adaptable deformable kernels for semantic segmentation. [24] introduced an additional feature pyramid block to detect smaller objects. Other ideas [11] showed how approximations to the domain of spherical data can work on fisheye data through the usage of an attention mechanism. While these architectural changes have shown performance improvements, it is unclear what aspects of fisheye data they are leveraging from the resultant features they extract. In our work, we make this explicit through our shaping of a representation space based on fisheyecentric principles of the interaction between distortion and semantics.

C. Data Centric
Data-centric approaches describe methods that intervene on the training data in the hopes of presenting the model with better views that are invariant to distortion. For example, [12] introduced a set of data augmentations that were tuned to the fisheye setting. Additionally, [13] showed how training with geometric perspectives can enable better training views within the context of a 3D object detection task. The main issue with these approaches is that these augmentations are tuned to a specific task and it isn't clear what general augmentation principle is at work. Our work circumvents this by introducing how to create a general representation principles for fisheye data.

D. Contrastive Learning
The main idea behind contrastive learning approaches is to learn a lower-dimensional embedding space where similar pairs of images (positives) project closer to each other than dissimilar pairs of images (negatives). The manner in which these positives and negatives are defined as well as their usage within the overall framework is what distinguishes contrastive learning methodologies from each other. Traditional approaches like [25]- [27] all choose positive instances by augmenting an image through some transformation and treating all other instances in the batch as the negative set. Within the domain of fisheye, [28] has utilized existing contrastive learning approaches on fisheye data for the task of semantic segmentation. This work differs fundamentally from ours in the sense that we are proposing a contrastive learning approach specifically geared towards creating a fisheye specific representation space, rather than a generic space based on previous learning approaches. In order to do this, we leverage the supervised contrastive loss [29] where positives and negative instances are chosen on the basis of belonging to the same semantic category or not. In other words, an additional constraint is placed on the embedding space to guide further understanding of learning similar and dissimilar data points. This inspired recent work [16], [30], [31] that generates labels using auxiliary information to shape a representation that is more appropriate for the application domain of medical and seismic data respectively. We introduce a way to make use of distortion auxiliary (distortion/semantic) information as a way to shape representations more suitable for the fisheye setting.

A. Dataset
The fisheye dataset utilized in this paper is the WoodScape [6] autonomous driving dataset. This dataset is collected using four surrounding fisheye cameras on a moving vehicle over a variety of urban scenes. It has over 8.2k images containing 5 different object categories: vehicle, pedestrian, bicycle, traffic light, and traffic sign. No public test set for this dataset is available, so the dataset was split into a training, validation, and test split of 80%, 10%, and 10% respectively. This specific dataset was chosen over the other fisheye datasets from [1] for the following reasons. Firstly, WoodScape is the only real-world fisheye AV dataset with object bounding box labels. Other fisheye AV datasets are either unlabeled with respect to object bounding boxes or are entirely simulated. The non-AV fisheye datasets that are both real and have bounding boxes are insufficient to demonstrate the interaction between distortion and semantic classes. This is because the AV setting has a wide distribution of objects across the entire image, while non-AV datasets only have center-focused objects. This makes it hard to study the impact of objects closer to the edge where the distortion is highest.

B. Statistics
In Figure 2, we observe that the test set performance of objects located at the edge is lower than that of objects located in the center of the image. In this section, we build intuition regarding the interaction between object distortion and regional location. We provide analysis that demonstrate that the performance difference of Figure 2 is due to the radial distortion of the fisheye images. To enable this analysis, we define two categories of objects -central and edge objects. All objects within an upper left image coordinate of (.25, .25) and lower right coordinate of (.75, .75) are central objects and objects outside this box are edge objects.
The first statistic we investigate is the distribution of the centered and edge objects across different classes in the training set. We can see from Figure 3 that the majority of objects are located closer to the edge of the image. This plot was generated by extracting the center coordinate for every object in every class and then using our definition of center and edge to delegate which location bin they belong to. This indicates that in Figure 2 the model was not biased towards detecting central objects better because of a greater prevalence of objects to train on from the center of the image. Another statistic to investigate is the size of objects at different regions of the image. This is shown in Figure 4 for the three most prevalent classes: vehicles, pedestrians, and bicycles. In this plot, every object's distance from the center is plotted against the area of their respective bounding box. We observe that the vast majority of objects in every class maintains a roughly similar size regardless of their distance from the center. This is further validated by the histogram in Figure 5 that shows the number of objects as a function of the object area for the majority classes. This histogram shows that most objects within each class have a similar size to each other. Together these plots show that the worse performance on the edges is not due to having smaller sized FIGURE 4: This plot shows the relationship between the coordinate distance and object area for every object among the three most prevalent classes in the WoodScape dataset. Distance was computed as the mean squared error between the center coordinate of the image (.5,.5) and the object's center coordinate. Object area was computed by multiplying the height and width of the object's bounding box. The maximum possible distance is 0.707 and the lowest is 0. objects to detect compared to the center. Both regions have similarly sized objects with the difference being the higher radial distortion exhibited by the objects at the edge.
To further build on this idea, we attempt various ways in Figure 6 to quantify the distortion exhibited by objects on the edge in comparison to objects in the center. The first plot shows how distortion changes according to the radial distortion mathematical model described in [4]. This shows the associated distortion should theoretically increase the further the away from the center of the image associated features are located. To evaluate this empirically, we also extract each object from both the pedestrian and bicycle classes and assign them as center and edge objects according to the conventions we introduce. We then compute BRISQUE [32] features that quantifies losses in "naturalness" due to distortions. This results in every image having an associated 36 x 1 feature vector. We compute a single value by averaging across this feature vector for each object. We compute the mean and standard deviation for this value across all objects in each class and associated regional location and plot the associated Gaussian. We observe that there is significant separation in BRISQUE features between the edge and center located objects. Specifically, this corresponds to a percent overlap of 18.55% for the pedestrian class 10.90% for the bicycle class which indicates that the distortion had some effect on the distributions of these objects.

IV. Methodology
Our methodology follows three distinct steps: regional label extraction, followed by pre-training of a ResNet-18 network [33] with a linear combination of contrastive losses, and finally fine-tuning the learnt representation with an object detection head. The overall philosophy behind this approach is to enable the model to recognize both semantic and distortion related information within its representation space. The regional extraction provides us with the labels for distortion and the contrastive learning operates with a weighted loss that constrains the representations learnt in terms of both semantic and distortion related contexts. VOLUME , : FIGURE 5: This is a histogram of the number of objects at different sizes for each class. The area for each object is computed by multiplying the height and width for each objects bounding box coordinates. This is then ranked and binned to create this histogram.

A. Regional Class Label Extraction
In order to train a model to recognize both semantic and distortion concepts, we need to acquire labels that reflect both. In the case of semantic information, the label files for every image identifies the class that each object belongs to. However, there isn't an explicit label that reflects the distorted nature of the object. In order to acquire this, we use the bounding box information of each object to receive the center coordinate of the object. This coordinate is important because the further the object from the center of the frame, the greater the distortion characteristics it exhibits. Therefore, it is necessary to define a threshold by which all objects outside of this threshold belong to a highdistortion class and all objects within this threshold belong to a lower distortion class. With these considerations in mind, the process to define these distortion-based labels is detailed in Figure 7. For each image in the training set, we extract each individual object as its own image patch along with its associated class label and bounding box coordinates. Every object is immediately assigned its original class label, but it is also assigned an additional label to describe its distortion class. This is done by analyzing the center coordinate. If the center coordinate of the object belongs in the inscribed box with an upper left coordinate of (.25, .25) and lower right coordinate of (.75, .75) then we consider the object to belong close to the center of the image. In this case, we assign the object with the additional label of a lower distortion version of its class. If the center coordinates lies outside of this defined box, we assign the object with the label of a higher level distortion version of its class. This extraction and label assignment process is repeated across the training set to create a large pool of object patches with associated label information. For example, this means that every car object receives its semantic class of car as well as its appointment as a high or low distortion version of its class, such as highly distorted car and barely distorted car, resulting in 10 possible distortion classes due to two variants of each of the 5 classes.

B. Contrastive Pre-Training
After disentangling the training set into object patches labeled with both a semantic and distortion class, we perform a contrastive learning objective that constrains the representation to consider both concepts. The overall block diagram of the proposed method is summarized in Figure  8. Given an input batch of extracted objects x k , associated distortion label (y dk ), and associated class label (y ck ) to form the triplet (x k , y dk , y ck ) k=1,...,N , we perform augmentations on the batch twice in order to get two copies of the original batch with 2N object patches and corresponding labels. These augmentations are random resize crop, random horizontal flips, random color jitter, and data normalization. This process produces a larger set (x l , y dl , y cl ) l=1,...,2N that consists of two versions of each object patch that differ only due to the random nature of the applied augmentation. Thus, for every object patch x k , distortion label y dk , and class label y ck there exists two views of the image x 2k and x 2k−1 and two copies of the labels that are equivalent to each other: y 2dk−1 = y 2dk = y dk and y 2ck−1 = y 2ck = y ck .
From this point, we perform the first step in Figure 8, where a linear combination of supervised contrastive losses is performed on the identified distortion and class based labels. The labeled augmented batch of object patches is forward-propagated through an encoder network f (·) that we set to be the ResNet-18 architecture [33]. This results in a 512-dimensional vector r i that is sent through a projection network G(·), which further compresses the representation to a 128-dimensional embedding vector z i . G(·) is chosen to be a multi-layer perceptron network with a single hidden layer. This projection network is utilized only to reduce the dimensionality of the embedding before computing the loss and is discarded after training. A supervised contrastive loss is performed on the output of the projection network in order to train the encoder network to have a weighted constraint based on both class and distortion labels. In this case, embeddings with the same class label are enforced to be projected closer to each other while embeddings with differing class labels are projected away from each other. At the same time, another loss enforces embeddings with the same distortion label to be projected closer to each other while embeddings with differing distortion labels are projected away from each other. This results in a class based supervised contrastive loss L C and a distortion class based supervised contrastive loss L DC . The form of the distortion contrastive loss is shown as: where i is the index for the object patch of interest x i . All positives dc for object patch x i are obtained from the set DC(i) and all positive and negative instances a are obtained from the set A(i). Set DC(i) represents all other object patches in the batch with the same distortion class label dc as the object patch of interest x i while set A(i) refers to every other element in the same batch as x i . Additionally, FIGURE 6: This plot shows different statistics regarding distortion of center and edge objects. 1) This shows how distortion varies according to the fisheye polynomial distortion model d(ρ) = a0 + a2ρ 2 + a3ρ 3 + a4ρ 4 where ρ = (x 2 + y 2 and d(ρ) represents the associated distortion at ρ distance. The parameters a0...a4 are chosen from calibration files provided by the WoodScape dataset. We also show what x and y coordinate distances correspond to our definition of center and edge. 2) and 3) To produce these plots, every object for both the bicycle and pedestrian classes were passed through the BRISQUE [32] algorithm to produce a 36 x 1 feature vector that. The mean and standard deviation of this vector were takent to produce the associated Gaussian.  z i is the l2-normalized embedding for the object patch of interest. z dc represents the embedding for the distortion class positives, and z a represents the embeddings for all positive and negative instances in the set A(i). τ is a temperature scaling parameter that is set to .07 for all experiments. The loss function operates in the embedding space where the goal is to maximize the cosine similarity between embedding z i and its set of distortion class positives z dc . The class based contrastive loss follows the same setup except positives are chosen on the basis of only class information. In order to weight the influence that each term has on shaping the representation learnt by the model we introduce an α parameter. This weighting between each contrastive loss can be represented by: L total = αL DC +(1−α)L C . In this way, we are creating a linear combination of losses from different label distributions for the same object patch.

C. Object Detection Fine-Tuning
After training the encoder on objects with a combined contrastive loss, we move to the second step in Figure 8 where the weights of the encoder are transferred into the backbone of the object detection setup and a YOLO v5 object detection head [7] is appended to the output of the encoder. The images from the WoodScape training set are input into this setup and the model is trained to perform standard object detection. In this way, we leverage knowledge learnt from constraining representations based on class and distortion information in order to improve performance for the task of object detection.

A. Training Details
The hyperparameters utilized can be divided into those applied for the contrastive learning step and that meant for the object detection fine-tuning. During contrastive learning, we set a batch size of 64 and training was performed for 25 epochs. A stochastic gradient descent optimizer was used for contrastive pre-training with a learning rate of .001, weight decay of .0001, and momentum of .9. The applied augmentations are random resize crop, random horizontal flips, random color jitter, and data normalization to the mean and standard deviation of the Woodscape dataset. The comparison methods of SimCLR [25], Moco v2 [26], and PCL [27] were trained in the same manner with certain hyper-parameters specific to each method. Specifically, Moco v2 was set to its default queue size of 65536. Additionally, PCL has hyper-parameters specific to its clustering step, but the original documentation made these parameters specific to the Imagenet [34] dataset on which it was originally built for. To fit these parameters to our setting, the clustering step was reduced in size.
Object detection fine-tuning on top of the contrastively trained representation space follows many of the same hyperparameter choices as in the original Yolo v5 training setup. The main points to note are resizing all images to a size of 640 x 640, a training time of 100 epochs, a chosen batch size of 32, and a stochastic gradient descent optimizer : FIGURE 9: This is a visual comparison between objects detected by the Yolo v5 baseline architecture and our method with an alpha parameter of .5. We include red and green arrows to highlight where our method did better than the baseline Yolov5.
with a learning rate of .01 that follows a cosine learning rate scheduler as training progresses. Further details of the architecture of the object detection head and its parameters can be found in the original YOLO v5 codebase [7].

B. Alpha Parameter Analysis
In order to get a sense of the trade-off between optimizing for distortion and semantic information, we vary the alpha parameter on the combined contrastive loss L total = αL DC + (1 − α)L C that we introduce. In this way, we can observe how shaping the representation with respect to each loss term effects downstream performance once fine-tuning for the object detection task is performed. We observe in Figure 10 that the choice of alpha has a significant impact on the downstream performance of the model. Specifically, we note that when α = .5, the performance is highest and when α = 0 or α = 1 the performance is no better than the standard object detection baseline. In other words, when the learnt representation is forced to consider both the distortion and semantic information equally, the performance is much better than when the model is subjected to each alone. This trend holds as performance decreases on either side of α = .5 as the α value increases or decreases. This result is significant because it validates the idea that a good representation space for fisheye data is one that reflects this interaction between distortion and semantic information. In particular, an α = .5 yielding the best performance suggests that both are equally important for shaping a good representation space. A possible reason for this performance increase relates to the work of [35] where the authors show that a representation space should not map all instances of a class to the same point, but rather uniformly distribute them across the surface of a hypersphere based on implicit subclasses within each higher order class. Within our setting this means that not only should objects of the same class map close to each other, but also be distributed based on distortion characteristics. We also observe visually in Figure  9 that our method is able to more accurately detect objects compared to the standard baseline method. Additionally, we see a performance boost for both center and edge mAP performance in Table 1.
We also compare our approach at α = .5 to other representation approaches in Table 2. Our approach beats all other contrastive learning strategies that only consider a simple  augmentation as a means of constraining the representation space. We also note that this improvement in performance is consistent across different bounding box intersection over union (IOU) thresholds. This indicates that the resultant    model produces higher quality bounding boxes compared to the other methods. Part of the reason for this improvement is that traditional approaches choose positives and negatives that are less reflective of the data distribution. For example, by choosing the only positive pair as an augmentation and every other point in the batch as a negative, this leads to situations where the negatives consist of points that should be positives. This is avoided in our approach through having to a much more diverse pool of positives based on both class and distortion related considerations.

C. Experiment Variation Studies
We also study different ways in which to define the distorted and clean image regions, different architectures, and different chosen contrastive learning hyperparameters.   Within our work so far, we have defined a box by which all objects inside this box are considered low distortion and all objects outside this box are considered high distortion labels. In Table 3 we explore the effect of a larger and smaller definition of this boundary. We observe in both cases performance degrades compared to our standard box based on the midpoint distance between the center and upper left corner of the image. Another way of defining these regions is to discretize the distance from the center as a series of ranges by which each range of values would denote a separate distortion class. To study this possibility, we discretize the distance from the center into l different distortion levels and train the contrastive learning setup in the same way discussed previously with the difference being the addition of additional distortion classes for every semantic class. We then fine-tune the representation trained in this manner for the task of object detection on the WoodScape dataset and report the results in Figure 11. We observe that any level of discretization beyond the low and high levels we introduced leads to a substantial drop-off in performance in terms of mAP. Part of the reason for this is that distortion in fisheye data doesn't change at a fast enough rate that would cause substantial differences between distortion levels. Therefore, from a contrastive learning point of view, the loss does not have significant enough differences in terms of features to contrast between. We also ensure that our methods works in other object detection frameworks. We show a performance improvement over baseline RetinaNet [36] and EfficientDet [37] in Table 5 when integrating our approach into the backbone pre-training of each network with an alpha weighting of .5. Additionally, we analyze the sensitivity of our approach with respect to batch size and temperature scaling in Table 4. As expected a higher temperature leads to poorer performance, as observed in [29], and an optimal batch size choice results in the best performing setting.

VI. Conclusion
In this work, we investigate how a contrastive learning methodology can be used to enforce a model's representation space to reflect the distortion and semantic interaction inherent within fisheye data. We show how our method reflects this interaction through experiments that vary the alpha parameter during contrastive pre-training. Additionally, further experiments that compare against different representation learning strategies and discretization levels shows the introduced strategy out-performs existing approaches as well as standard object detection. We conclude from these experiments that a quality representation space is one the reflects the features of the data on which it is trained on. In this case, our models are better able to overcome the fisheye radial distortion by being allowed to integrate this information within the training process. VOLUME , 9