Pre-computation of image features for the classification of dynamic properties in breaking waves

ABSTRACT The use of convolutional neural networks (CNNs) in image classification has become the standard method of approaching many computer vision problems. Here we apply pre-trained networks to classify images of non-breaking, plunging and spilling breaking waves. The CNNs are used as basic feature extractors and a classifier is then trained on top of these networks. The dynamic nature of breaking waves is exploited by using image sequences extracted from videos to gain extra information and improve the classification results. We also see improved classification performance by using pre-computed image features such as the Optical Flow (OF) between image pairs to create new models in combination with infra-red images with reduction in errors of up to 60%. The inclusion of this dynamic information improves the classification between breaking wave classes. We also provide corrections to a methodology in the literature from which the data originates to achieve a more accurate assessment of model performance.


Introduction
Large ocean waves carry huge amounts of energy which can be dissipated through a process known as breaking.One example of breaking is when an overturning of the crest causes a collapse and a breakdown to turbulence.The breaking process releases large amounts of energy in the dissipation of the kinetic energy of the wave through this turbulence.Understanding the breaking process, exchange of gases and energy in waves at various scales is of great importance to improving models of ocean-atmosphere interactions such as weather and climate models where the amount of carbon dioxide dissolved in the ocean through breaking plays an important role (Deike et al., 2015(Deike et al., , 2017)).The breaking process also alters the sea drag coefficient which in turn affects the air flow at the surface along with other influences on the atmospheric boundary layer (Babanin, 2011).The dissipation of large amounts of energy through turbulent breaking is also of interest in coastal engineering applications where waves may slam into cliffs or manmade coastal structures.
The dissipation of energy at small length-scales creates difficulty in modelling breaking waves, as they interact with the air above the surface in the breaking process, creating large and small bubbles through turbulent air entrainment (Lim et al., 2015;Lubin & Chanson, 2017).Breaking waves being turbulent, it is a complex two-phase process at the free surface whereby overturning waves generate bubbles and jets of water plunging back into the wave create vortices which increase the mixing further.Current numerical methods make great simplifications/ assumptions about the fluid flow at which point the details in the breaking process may be lost (Shi et al., 2020), and what happens after breaking is largely unresearched as numerical methods are unable to simulate the process on large scales and instead parameterise the energy dissipation that results from the wave breaking process.Wave breaking plays a significant role in negotiating the exchange of momentum, heat and gases between the atmosphere and the ocean.The process of wave collapse is a nonlinear mechanism of rapid transfer of wave energy and momentum to other motions.To date, there are no adequate mathematical and physical descriptions of such a process.
Ocean waves are difficult to recreate in the laboratory due to the different salinity, temperature and thus density gradients throughout the fluid.The waves are usually generated mechanically and do not experience the same breaking or spraying as ocean waves generated by wind.It is desirable then to study breaking waves so that it is feasible to measure turbulent quantities and determine characteristics of these waves in a real-world setting.Examples are the breaking threshold, a criterion that distinguishes breaking waves from non-breaking waves and the spreading of breaking through the wave.
To overcome these challenges, new methods are being pursued to extract dense or detailed information from ocean waves in real time using remote sensing.A range of different sensors are available (Holman & Haller, 2013).The physical manifestations of their interactions with the ocean surface differ significantly.A systematic use of nearshore remote sensing probably started at the US Army Corps of Engineers Field Research Facility in Duck, North Carolina, USA (e.g.Catalan et al., 2011).The focus of the present paper is on video data.A large video database of waves allows for a thorough analysis of breaking waves using image processing and modern computer vision techniques.The collected image data can also be used in the training of deep learning models for classification (Buscombe & Carini, 2019), clustering, segmentation (Stringari et al., 2021;Sáez et al., 2021), wave tracking (Kim et al., 2020;Stringari et al., 2019) and prediction of wave characteristics (Buscombe et al., 2020;Choi et al., 2020).Such models can be used in further processing the image data collected to automatically detect, track and estimate quantitative wave properties, providing new ways to study/measure the ocean surface and breaking waves in particular.To evaluate the ability of convolutional neural networks (CNNs) in extracting the features/patterns of breaking waves, we look at a simple classification task between nonbreaking, spilling and plunging breaking waves.With this task, we look to show that the breaking patterns can be learned and thereafter used for more complex tasks and to create more complex CNN models of breaking waves using remotely sensed video data.Note that it is essential to distinguish whitecaps coming from plunging breakers from those coming from spilling breakers.The exchanges between the ocean and the atmosphere are much stronger with plunging breakers.
Advances in the field of machine learning (ML) and increased processing power of computers over the last two decades have seen ML incorporated into almost every quantitative research field.Fluid dynamics and oceanography have been no different, with applications from the simulation of fluids (Fukami et al., 2019) to flow optimisation and control (Brunton et al., 2020).Image processing and computer vision techniques have also shown great promise in oceanographic research where both classical algorithms and deep learning approaches have been implemented to process large amounts of image data to produce threedimensional depth maps (Bergamasco et al., 2021;Fedele et al. 2013;Gallego et al. 2011), to reduce 3D reconstruction times (Pistellato et al., 2021) as well as for wave tracking (Stringari et al., 2019), classification and segmentation of waves (Stringari et al., 2021) and wave height estimations (Buscombe et al., 2020;Choi et al., 2020).
Infrared provides somewhat of an advantage for the study of breaking waves when using images due to small temperature fluctuations which are recorded by the infrared camera at the breaking region (Jessup, Zappa, Yeh, 1997).These temperature fluctuations manifest as streaky patterns on the back of plunging waves (Jessup, et al., 1997) and can be detected with the feature extraction in modern CNNs.These streaky patterns contain distinctive identifying markers for the type of breaking that occurs which can be seen in Figure 1.In the spilling case the breaking occurs more slowly and the white-water spreads over the back face of the wave.To contrast this, in the plunging wave there is more structure to the back of the wave as the breaking happens more suddenly and crashes in front of the wave.Training ML models to learn these features and patterns is the first step in building more complex end-to-end models as we must first prove these breaking features are learnable features.Infrared cameras are also less sensitive to the reflection of light from the sun on the sea surface than traditional visible wavelength cameras.Using infrared cameras does however come with limitations in image resolution and observation distance.The optical flow (OF) between a pair of successive images is the apparent motion of the objects appearing in the two frames (Sonka et al., 1993).Thus the OF is a vector field of the displacements of pixels and provides a mapping from the first image to the second image.Errors in this mapping can be caused by the following features: occlusion where the movement of objects blocks other objects in the second image, a lack of texture (high gradients in the pixel intensity) on the objects or sudden brightness shifts such that tracking the motion becomes difficult.
A variety of techniques for calculating the OF field exist including classical methods and methods using CNNs for the estimation of OF.Classical methods include variational techniques like the Horn-Schunck OF (Horn & Schunck, 1981) which minimises an energy functional over the images, yielding a dense (for each pixel) vector field of displacements.These techniques are based on the assumption that displacements are only one or two pixels in size; thus pyramid or multi-scale approaches are implemented which allow the detection of larger motions.The multi-scale method downsamples the image to calculate a rough OF to be used as a base approximation for each increase in resolution back to the original resolution.This is often referred to as a coarse-to-fine OF estimation.
The CNN-based methods are instead trained to estimate the motion fields in a more "black box" manner by providing many example images and their ground truth OFs.A CNN model then learns to detect the textures within the images, and the change in location of these textures between the images thus producing an OF field.The ability to generalise is then dependent on the variety of textures and objects in motion in the images used for training the CNN.Data sets used for this training create an artificial OF by overlaying objects in the images and moving them from one frame to the next, in this way they know the "ground truth" OF.
Optical flow has also recently been utilised to directly estimate the sea surface spectrum from stereo images with a large reduction in computation time compared to dense stereo reconstruction first (Bergamasco et al., 2021).
Video classification algorithms where both raw frames and OFs are combined to help with the classification task have also been explored (Ng et al., 2015;Ye et al., 2015) and shown to give increased performance over the same algorithms without the additional OF information.Although Ng et al. (2015) use a more complex long-short-term memory (LSTM) network architecture, which was not implemented for our experiments as we use only two frames compared to a longer video sequence, and Ye et al. consider a two stream fusion of the raw images and stacked OF, they both explore the usefulness of additional OF information.However it is noted that OF on its own in a case of noisy video frames can show degraded classification performance in comparison to just raw image frames.Experiments with multiple stacked frames have also shown improvements to video classification over single frames, which is further boosted through a choice of early, late or "slow" fusion of latent feature maps (Karpathy et al., 2014).
We have applied image processing algorithms to breaking wave image data to probe what extra information can be gained from this approach of analysing fluid flows and in particular breaking waves.Our objective is to achieve better results on the classification task by incorporating some of the dynamical information from the waves using the OF methods described in the following sections.In Buscombe and Carini (2019), the authors claim to have a high precision and recall on each of the classes (after augmentation F1 scores approximate 92 for all classes), but in a further analysis we found flaws with their data set and modelling approach.The metrics used to assess the models were significantly degraded after correction of the data set problems described in the next subsection.Thus, we aim to improve upon the corrected results by inclusion of the dynamical information.

Data collection
The data used for this project are from Buscombe and Carini (2019) in which they used a multitude of popular pre-trained CNNs as basic feature extractors for infrared (IR) images of breaking waves in order to classify them.Details of the data acquisition can be found within the cited paper.The data set consists of 9996 images split among three different wave classes: non-breaking, plunging and spilling waves.The IR images were taken at a resolution of 640 by 480 pixels and are downsampled to a resolution of 299 by 299 for the CNN feature extractions.The data set is highly imbalanced and contains relatively few examples of the plunging breakers: with 9996 total images, 208 are plunge, 2354 spilling and 7434 non-breaking wave images.To increase the number of samples, image augmentation was performed to yield an additional 2000 samples for each class of waves, thus adding 6000 images in total.Augmentation included random rotations up to 20 degrees, small brightness and contrast changes and random cropping of the images.The data set was split into waves where each image is 0.1 second apart from the next and of the same classification These waves are then split into training and testing sets with 6996 and 3000 images in the train and test sets respectively.Consecutive sample frames from each one of the wave classes are shown in Figure 1.
As documented in Buscombe and Carini (2019), the images were captured using a thermal infrared camera in November of 2016 at a US Army Corps of Engineers Field Research Facility in Duck, North Carolina, USA.The camera was angled at 45° to the sea surface and mounted on a pier at the research facility.The images were sampled continuously at 10 frames per second, while a light detection and ranging device (LIDAR) measured the sea surface elevation in the same field of view.The wave height, as measured by the LIDAR, varied between 0 and 5.94 metres during the 10.5-hour acquisition period.The sequences vary in length from 7 frames to over 700 frames in a sequence (shorter sequences typically belong to the plunge and spill breaking waves, while non-breaking sequences are much longer).The distribution of pixel intensities was used to determine if the image contained a breaking wave.The type of breaking wave detected in these images was then manually classified by examining the patterns in front of and behind the breaking region.Further details of camera and LIDAR specifications are available in Buscombe and Carini (2019).
We made several changes to the data set and how it is processed after finding errors in the manual classification, which are described in the next subsection.

Data set corrections
During this investigation, several issues were identified with the data set that is being used.Firstly after examining the original author's code, it was noted that the data set splitting into training and testing sets was randomised.As the data were sequential (because they are extracted from videos recorded at 10 frames per second), it essentially meant that the authors' model was trained on images that were one-tenth of a second away from the testing images and in most cases the trained model used images either side of the test image.Thus to model the data set correctly, we separated the images into discrete waves based on the filename (i.e. group images x×x110 ××x120 and x×x130 into wave 1 for the class they belong to).
The grouping of the images into waves revealed another problem with the data set: misclassified images.The grouping method described above revealed some waves in the classes contained only one or two images, but a file search for the images that would surround these revealed images of the same wave put into a different class (e.g.found spill/xx30, spill/xx50 but plunge/xx40).This is a clear mislabelling of the data set as the waves do not change class and back again within 0.2 seconds.Each of these corrections was manually verified.
Once the data were sorted correctly, we tested the original authors best model and found it was significantly less accurate and the original logistic regression (LR) model did not perform well, even on training data.One of the proposed reasons for this underperformance is the large class imbalance in the data set and the fact that there are very few samples of the plunging breaking waves.Thus, the model is unable to find a distinguishing feature between the spilling waves and the plunging waves before over-fitting to the training set.
It is cautioned to not randomly split sequential data (for example, image data extracted from videos) which exhibit high dependence between consecutive images when splitting into training and testing data sets.This will give false results in regard to the model's ability to generalise as the testing set contains near-identical examples to the training set.A more valid approach is to split the data into training and testing data on a video basis, rather than on an image basis.

Optical flow
We have selected two OF methods for use in this study; a classical method TV-L1 (Monzón et al., 2016;Sánchez et al., 2013;Wedel et al., 2009;Zach et al., 2007) and a CNN-based method SPyNet (Niklaus, 2018;Ranjan & Black, 2017).The details of the OF algorithms can be found in the respective articles along with working implementations.Figure 2 shows the OF (computed with TV-L1) corresponding to the sample frames shown in Figure 1.These implementations are applied to each sequential pair of the infrared images, and the resulting vector flow is converted to a RGB image which can be used as input to the pre-trained CNNs.

Training and classification metrics
For the feature extraction, we use different pre-trained image classification models from the Tensorflow Python library.The top (fully connected) layer of these networks was removed so that a vector of features was outputted.This feature vector is then saved for each image or image pair.For comparison with the original paper (Buscombe & Carini, 2019), we then trained the same LR model on these feature vectors.A new fully connected layer was found to give no significant improvement over the LR model, although more samples of the plunging class or more aggressive data augmentation could boost the performance of the neural network, but this was not pursued here.
The images were downsampled to the appropriate size for each respective CNN architecture (based on what they had originally been trained on) for optimal feature extraction (299×299 for the Xception network for example).We trained several different models of which some take two or three images as the inputs.For the multi-image inputs, we process each image with the CNN and combine the feature vectors together before training the classification model.Stacking the inputs on top of each other and processing these with the CNNs was found to produce inferior results.This result is likely because the channels in these pretrained CNNs corresponded to the red, green and blue features, but extracting all those channel features on each grayscale image gave more useful information.
We trained IR, OF, IR+IR and IR+OF models, where IR stands for infrared and OF stands for OF image inputs.The addition represents the concatenation of the feature vectors with IR+IR being two consecutive infrared images.The OF was pre-computed with the robust TV-L1 algorithm with default configuration, and the resulting flow field was visualised as an RGB image (see Figure 2 for sample images) and fed to the feature extractor or pre-trained CNN after being resized and normalised.
The LR classifier is fitted with "balanced class weights" ,which weights the loss to more heavily penalise incorrect predictions on plunging samples to combat the class imbalance within the data set.
The metrics used for the evaluation of the classification predictions are described below, with the inclusion of two more metrics than in the Buscombe and Carini paper.The models were assessed using five different metrics on the classes: precision (Pr), recall (Re), F1 score (F1), informedness (Youden, 1950) (In) and the Brier score (Brier, 1950) (Bs), and each is defined in terms of the true and false positives (TP and FP) and true and false negatives (TN and FN) as follows: where in the Bs the sums are over all N samples and all R classes, p ti represents the predicted probability for sample t and class i, and l ti is then a vector with a 1 in the position indicating the true label.It thus measures the mean square difference between the predicted probability and the actual labels, and thus a lower Bs is better.The informedness score estimates the probability of making an "informed decision": a score of 1 indicates a perfect classifier, while a score of 0 indicates random decisions.

Classification
After fixing the problems with the original data set, we use the original authors' methods as a new baseline measure for our results.The best model from Buscombe and Carini (2019) is a LR model fit to features extracted by a pre-trained MobileNet_V2 (Howard et al., 2017).We found the MobileNet_V2 CNN to perform much worse than using the features extracted from the Xception (Chollet, 2017) pretrained CNN.All model results below are from features extracted by the Xception network and a LR fit.
From Table 1 we can see that the OF of the images can be used in the classification task, and it provides results comparable to using a single image.However the best results came from using the two IR images or a combination of the IR images and the OF.The extra information from the use of two images does not affect the non-breaking classifications significantly, but it does have a positive impact on the spill and plunge classifications.In our experiments we also observed that the image augmentation implemented (rotation, zoom, crop) does not help when using the OF and in some cases it decreases performance.The SPyNet Figure 2. Samples of calculated optical flows (OFs) using TV-L1 for each of the given classes.The OF is calculated using the sample infrared (IR) images in Figure 1 and their respective next frame.The difference between the non-breaking and breaking classes is clear while the differences between spilling and plunging classes are more subtle, but the extended "white-water" region can be seen in the spilling OF.
model shows difficulty in extracting the features to differentiate the two breaking types (plunge and spill).The CNN's output for calculating OF was seen to produce inaccurate and overly smooth flows on these images and thus performs slightly worse than the TV-L 1 model.The probabilities of each class obtained from the models are visualised in Figure 3.
We also tabulate the errors from each of these metrics to make it clearer where exactly the best improvements in our models are.In Table 2 we give these improvements as percentages relative to the baseline IR model.It is clear that most models have performed significantly better in most metrics when compared to the single IR input model.The IR+TV-L1 model reports the majority of largest improvements (indicated by the bold text).

Misclassifications
In this section we present the analysis of some misclassifications and confusion matrices in Figure 4 of the models on testing (unseen) data both with and without augmentation applied.This allows for quick visual identification of misclassifications by looking at the off-diagonal terms in the confusion matrices.For the confusion matrices, each row corresponds to Table 1.Metrics for tested models.For each metric (column) the best score for each class is indicated by the bold text.Higher scores are better for all metrics except for the Brier score where a score of 0 indicates perfect predictions and confidence.Results for models using augmented data are in parentheses.Figure 3.The evolution of probabilities for each class from some of the tested models.The horizontal axis corresponds to the image in the sequences.We see that the gain from the OF method in the classification of plunge images comes at the cost of uncertainty in the spill cases.
a true label and each column corresponds to the predicted label.Higher performance corresponds to a darker main diagonal in the matrix.We observe that all models perform well on the non-breaking waves, with IR and the SPyNet OF having the largest errors on spilling waves (approximately 19 and 10% of spilling waves classified as non-breaking respectively).The TV-L1 OF and the combined IR+OF (TV-1) are the best performing models at separating plunging waves from spilling waves.The majority of misclassifications occur as a wave is just entering or exiting the frame when it is reasonable to be incorrect.

Discussion
We have explored the use of OF features for the classification of breaking wave images.This was tested on an image classification task to gain improved classification of sequences of IR images of breaking waves compared to only using a single IR image.After an initial exploration of the original data set, the original analysis was found to be flawed.A large class imbalance exists, which made it difficult to train and test a robust classifier, and initial results were compromised.After correction of the data, comparisons were drawn between the baseline IR model which was claimed as best by the original authors and several other models which took advantage of the dynamical nature of the waves by performing a feature extraction on sequential images.It was found that a novel combination of the IR and OF information could produce slightly better results and the inclusion of augmented images provides additional gains to the IR models but acts to somewhat deteriorate the performance of the OF models.
Our experiments have focused on the inclusion of temporal features in the data used to train the models.This gives a significant reduction in errors when compared to the baseline, single infrared image input model.In contrast to Buscombe and Carini (2019), who have quantified the performance of different pretrained network feature extractions, our focus was thus on the data input to the networks.
As opposed to the results in Table 1 of Buscombe and Carini (2019), we detected and removed a selection bias within the data split caused by the high correlation of the training and testing data when randomly selected.Our results give a baseline which corresponds to Buscombe and Carini's results for the Xception network when the selection bias is removed and all the models, with the exception of the SPyNet-based OF, have shown significant improvements over the baseline model.Furthermore, our results show that our application of OF as input to a pre-trained CNN has reduced errors on the classification task, but similar to Ng et al. (2015) noisy OFs lead to a deteriorated performance as with the SPyNet results.
An analysis of the misclassified cases of the models shows that most misclassifications occur at the beginning or end of the image sequences where the type of wave is unclear.The gains on identifying both the plunging and spilling waves from using the OF can be achieved also by using the two IR images and applying the feature extraction on these.Further improvements may be possible with more regularisation or more aggressive data augmentation, in combination with more complex models, for better performance on the plunging samples.The data set was deemed insufficient to train more complex neural networks or fine-tuning without over-fitting and adapting of the SPyNet OF network was abandoned after it failed to produce accurate OFs on the images.The inclusion of OF into the features for the classification task improved the results on the plunging wave category, and a combination of OF and IR images gave the best results on the spilling waves.However, the application of augmentation to the OF data did not yield additional gain in classifier performance, and this observation is inline with those of Ng et al. (2015) where they conclude the decrease is because of insufficient detail in the OF images, which is exacerbated by cropping or augmentation.In future work, a higher quality data set in terms of clearer and higher resolution images, and also in terms of a better balance of the breaking wave classes, will make the classification task and training of more complex models easier.Distinctive patterns in wave breaking can be identified, but capturing the temporal evolution of wave properties through a sequence of images remains a challenge.One of the difficulties is the discontinuous nature of breaking waves and the fact that these algorithms have been developed to deal with typically more rigid objects and motions.Thus it may prove necessary to use images from video at high resolution and high frequency to be able to resolve the details in the motions when considering data collection of breaking waves for ML application in the future.
Models that take the temporal dependence of the data into account such as LSTM models or vision transformer models may be better suited to more complex tasks related to video sequences of waves; however these models were not explored in relation to this limited data set.Investigation of how breaking evolves over time and an understanding of how it affects the free surface is crucial to provide accurate parameterisations for numerical forecasting systems.
We have shown that with pre-trained feature extractors and LR, we are able to make use of the dynamic information within image sequences to improve classifier performance.More accurate classifiers could be used to help in processing large batches of breaking wave video sequences for further analysis and ML applications.To increase the transferability of the trained models, some labelled training samples from different camera positions and locations would help in learning a more general representation of the breaking classes where it is likely that the OF may also help in learning this representation just as it has increased performance in this study.

Conclusions
The introduction of OF to incorporate temporal features leads to significant increases in a model's ability to distinguish between plunging and spilling waves.The novel application of OF (TV-L 1 and SPyNet) and sequential images to breaking waves to infer temporal features yields significant performance gains.Moreover, our use of sequential images has shown that improved performance was achieved in nearly every metric for each model tested versus a baseline result obtained using only a single infrared image.The addition of image augmentation to boost the number of image samples available for training of models leads to further increases with models that have raw infrared images as inputs (IR, IR+IR and IR+OF models), whereas degradation of performance is seen where only OF is the input (TV-L1 and SPyNet models) where we conclude that the decrease in performance is due to insufficient detail in the OF images, which is then exacerbated by cropping and other augmentations.In creating a model that can differentiate the breaking classifications, we have shown that the breaking patterns are extracted using pre-trained CNNs and could be used in training of more complex CNN models of breaking waves.

Figure 1 .
Figure 1.Samples of wave dynamics for each of the given classes.The image sequences span a time frame of 1.0, 0.4 and 0.4 seconds for the non-breaking, spilling and plunging sequences, respectively.The breaking waves (spilling and plunging) show turbulent white-water effects which are picked up well by the infrared (IR) camera.

Figure 4 .
Figure 4. Confusion matrices for each of the tested models.The true class is in each row, and the predicted class in each column.

Table 2 .
Improvement over IR errors.The IR errors are given as a reference.The rest of the rows contain the relative % improvement over the simple IR input model.Each error (except for Brier score) is calculated as 1−score, where the score is the value reported in Table1.For the Brier score, since a lower score is best, we leave it as is.The percentages reported are the improvements in respective errors for the models, so a high positive % indicates a large improvement in the metric (reduced error).The best improvements are once again highlighted in bold text.