Applications of Deep Learning in Fish Habitat Monitoring: A Tutorial and Survey

Marine ecosystems and their fish habitats are becoming increasingly important due to their integral role in providing a valuable food source and conservation outcomes. Due to their remote and difficult to access nature, marine environments and fish habitats are often monitored using underwater cameras. These cameras generate a massive volume of digital data, which cannot be efficiently analysed by current manual processing methods, which involve a human observer. DL is a cutting-edge AI technology that has demonstrated unprecedented performance in analysing visual data. Despite its application to a myriad of domains, its use in underwater fish habitat monitoring remains under explored. In this paper, we provide a tutorial that covers the key concepts of DL, which help the reader grasp a high-level understanding of how DL works. The tutorial also explains a step-by-step procedure on how DL algorithms should be developed for challenging applications such as underwater fish monitoring. In addition, we provide a comprehensive survey of key deep learning techniques for fish habitat monitoring including classification, counting, localization, and segmentation. Furthermore, we survey publicly available underwater fish datasets, and compare various DL techniques in the underwater fish monitoring domains. We also discuss some challenges and opportunities in the emerging field of deep learning for fish habitat processing. This paper is written to serve as a tutorial for marine scientists who would like to grasp a high-level understanding of DL, develop it for their applications by following our step-by-step tutorial, and see how it is evolving to facilitate their research efforts. At the same time, it is suitable for computer scientists who would like to survey state-of-the-art DL-based methodologies for fish habitat monitoring.


Introduction
Proper understanding of our planet and its ecosystems is not possible unless suitable tools are developed to explore and learn about our largest ecosystem, the marine environment. Computer Vision (CV) technology through deployment of its underwater cameras can help us better comprehend and manage remote marine fish habitats. However, due to the sheer volume of their visual data, manual processing is time-and cost-prohibitive, requiring a new radical shift in data analysis, through advanced technologies such as Deep Learning (DL).
DL is at the frontier of computer vision. Its deep neural network architectures are capable of learning complex mappings from high-dimensional data to interpretable feature representations, hence, DL has been successfully applied to various challenging computer vision tasks such as semantic image segmentation (Jing et al., 2020;Pathak et al., 2015;Laradji et al., 2021a;Qi et al.;Chuang et al., 2011), visual object detection Villon et al., 2016;Kim et al., 2016;Pathak et al., 2018), and tracking (Garcia et al., 2016;Duan and Deng, 2019;Kang et al., 2018;Lumauag and Nava, 2019). These applications have the potential to radically alter the way we interact with the world through computers. Recently, the applications of DL and its underlying Deep Neural Networks (DNNs) for underwater visual processing have received significant attention (Saleh et al., 2020a;Laradji et al., 2021b;Villon et al., 2018;Chuang et al., 2016;Nilssen et al., 2017;Mandal et al., 2018;Naseer et al., 2020;Salman et al., 2020;Siddiqui et al., 2018).
The main advantage of deep learning is its ability to learn features in different data types, such as underwater fish images, through end-to-end training. Training of DNNs is often thought to be easy. Many frameworks take delight in providing few lines of code that solve some CV tasks, providing the misleading impression that all that is needed is then plug and play, using some general Application Programming Interfaces (APIs). In these APIs, the developers have lifted the burden from us and, in doing so, disguised the complexity behind a few lines of code needed to achieve the task at hand. The framework developers have achieved the purpose of "providing a few lines of code" but we, the endusers, have been fooled to believe we need to spend only a few hours learning the intricacies of the provided APIs.
However, when it comes to training a DL algorithm, things become more complicated. The task of training a DNN is actually as complicated as the problem it is intended to solve. In fish monitoring for example, the number of input images you use, how you pre-process your images, how you build your models, how you fine-tune the model (using dropout or regularization, for example), how you extract the features, how you combine them to produce final predictions, what metric you use to report your model performance, and your choice of which layer to extract features from to feed to your classifier, are among some of the many variables to consider when training a DNN. You can include any number of variations on these factors to further optimize your model and to achieve the best possible accuracy.
Due to the above intricacies, most of the time DNNs are not simply an "off-the-shelf" technology that works with all kind of datasets, even those similar to the one that has been meticulously customised for it. The fact that training a customised high-performance DNN is rigorous and challenging is now widely accepted. However, this challenging process can be facilitated by being patient, paying attention to details, and working systematically. Developing customised DNNs with a specific application, for example, for underwater fish monitoring, should follow the same systematic steps of developing any other computer vision applications ( e.g. detection of vehicles in traffic). The only difference lies in the type of data being fed to the DNN.
In this paper, we first present a tutorial that covers the background of DL to help understand the above-mentioned common DL terminologies. The tutorial also provides a comprehensive overview of the essential systematic steps to help better develop a supervised DL model, with a focus on underwater fish habitat monitoring.
In the second part of the paper, we survey state-ofthe-art research and development on the use of DL for fish monitoring. We synthesize the literature into four main categories covering the common CV tasks of classification, counting, localization, and segmentation of fish images. We investigate different deep learning architectures and their performance. We also survey publicly available underwater fish image datasets. Finally, we provide a comprehensive overview of the challenges in applying DL to marine fish monitoring domains. We also draw a roadmap for future research works.
Although a number of previous relevant review articles (Goodwin et al., 2022;Li and Du, 2021;Zhao et al., 2021;Yang et al., 2021;Li et al., 2020;Moniruzzaman et al., 2017;Saleh et al., 2022) exist, our paper has a different approach and motivation that compliments prior surveys. Compared to (Goodwin et al., 2022), which provides a survey of the general domain of ecological data analysis, covering a wide array of studies on plankton, fish, marine mammals, pollution, and nutrient cycling, we focus only on fish monitoring. We also provide a detailed analysis of fish datasets and comprehensively review the literature on four key tasks in underwater fish video and image processing. This detailed analysis and review is not provided in (Goodwin et al., 2022), or any of the previous works, making our paper useful for readers who would like to study fish monitoring using DL in more details and depth, while seeing a comprehensive literature review.
In addition, (Li and Du, 2021) provides a review of studies on fish condition, growth, and behavior monitoring in aquaculture settings. It briefly covers and reviews various DL architectures and their aquaculture applications, unlike the present communication that is focused mainly on Convolutional Neural Network (CNN) and provides a detailed survey and analysis of the underwater fish monitoring literature.
The work presented in (Zhao et al., 2021) covers the general domain of Machine Learning, as opposed to the specific domain of DL in our paper. This is done for aquaculture applications as wide as fish biomass and behavior analysis to water quality predictions, while also briefly covering and reviewing fish classification and detection methods.
A survey of computer vision models for fish detection and behavior analysis in digital aquaculture is provided in (Yang et al., 2021). An interested reader should study (Yang et al., 2021) before reading our paper, due to the background technical details provided on image acquisition, which are key to developing effective DL datasets and models, as we discussed in our paper.
Furthermore, the DL-based studies presented in (Li et al., 2020) and (Moniruzzaman et al., 2017) are mainly around the two specific tasks of underwater fish tracking, and underwater object detection, respectively. These applications are different to our study. However, since our underwater fish monitoring task are related to these applications, our paper can complement these works.
In (Saleh et al., 2022), we have provided a historical survey of fish classification methods between the years 2003-2021. These methods cover traditional CV techniques and modern DL methods, only for fish classification in underwater habitats and not for the general domain of underwater fish habitat monitoring.

Deep Learning
Deep learning is a sub-field of machine learning composed of interrelated algorithms and concepts used in training a deep neural network (Saleh et al., 2022). One of the main reasons behind the extereme popularity of deep learning is the unprecedented and unparalleled performance it has achieved across different fields especially image recognition.
Deep learning utilizes multi-layered neural networks for automatic learning of input features. Features are distinguishing properties of learning inputs e.g. the color or shape of different fish. The deep learning concept was first proposed based on the idea that the traditional multi-layer artificial neural networks, could learn complex nonlinear features and their relations with more generalization and at a rapid speed. To learn deep features efficiently, researchers found that a modified version of neural networks, i.e. CNN, works very well in the image processing field (Saleh et al., 2022). In the following sections, we will first introduce the basic concepts of neural networks in general and then describe CNNs and explain how they learn and then process input images.

Neural Networks
A 'neural network' is a computational model that is inspired by biological neural systems and uses simple, non-linear, computational rules to mimic these systems. Neural networks are composed of simple processing elements called, neurons. By organising neurons in a layered structure, interconnecting them and changing the weights associated with each interconnection, a 'neural network' can be trained to solve a complex problem, such as recognising if a fish is present in an image. It is then possible to store the connections between neurons for later use. Training a neural network to perform different tasks e.g. recognizing fish in an image, or determining where a fish is in an underwater image, is called the 'learning process'. During supervised learning (explained later), the inputs to the network are presented with each input having a desired output. The learning process determines which interconnections (weights) are most important to the system for learning the task at hand and mapping all the inputs to all their desired outputs, as best as possible.
The general idea of neural networks is to have layers of neurons for learning the input data. There are three consecutive layer types in a neural network, i.e. input, hidden, and output. The hidden layers can learn the patterns in the data passed to the network through the input layer. It is within the hidden layers that classification, or in some cases regression, of the input data takes place. The hidden layers can learn abstract patterns and features in the data on their own. In general, there will be more layers in a DNN compared to artificial (shallow) networks for image classification tasks, and this is why DNNs are called deep and can achieve higher accuracies.

Neuron
The neuron, also known as a node or perceptron in a neural network, is its basic unit of computing. The neuron takes inputs from other nodes and produces an output. Every input has a weight that is allocated based on its relative significance to other inputs. As depicted in Figure 1, the node applies the activation function (described below) on the weighted sum of its inputs.

Activation Functions
The activation function (Vogels et al., 2005) in a neural network defines whether a given node is "activated" or not based on the weighted sum of input features. The sigmoid function is one of the most commonly used activation functions. It is defined as: where ( ) is the sigmoid function output that will be used as the input for the following node and is the weighted sum of input features from the previous layer. The sigmoid function is non-linear and its value ranges between 0 and 1. Sigmoid is popular in image classification because its 0-1 range can be represented as the probability of "activating" each output class. The output with the largest "activation" value is then selected, thus facilitating the network's ability to classify the image.

Bias Node
Another important component in successful neural networks are the "bias" nodes, which, as shown in Fig. 1, add a bias value to the sum of input-weight multiplications to increase the model's flexibility. In particular, when all input features equal to 0, the network can adjust to the data and decrease the distance between the fitted values in other data spaces.

Loss Function
In machine learning, there is always a function that needs to be decreased or increased to reach the closest possible mapping between the input and output domains. This function is usually known as the objective function. When it needs to be minimised, for instance for the case of neural network supervised learning, we might refer to it as the cost, loss, or error function. Although different DL publications may define specific meanings for some of these terms, we use them indiscriminately in this paper. In general, loss functions measure the performance of a databased Machine Learning (ML) model. The loss function is important to consider, as it measures and presents learning error in the form of a single real number between predicted values and expected values. As an example, the loss function for linear regression is defined as: where is the number of training examples,̂ is the predicted value of the model, and is the true value of the inputs in the training data.
For classification tasks, such as fish species classification, the loss function is generally a cross-entropy loss function. Cross-entropy loss measures the performance of a classification model with a probability value ranging from 0 to 1. The loss of cross-entropy functions will increase as the predicted probability differs from the ground truth. Another classification loss is Hinge Loss. In Hinge Loss, the correct category score should, by some safety margin, be higher than the sum of values for all incorrect categories.

Optimization
In supervised learning, the learning task can be reduced to an optimization problem in the form of * = arg min ( ), where is a parameter vector, at which the loss function ( ) that usually represent the average loss for all training examples, reaches its minimum. can be represented as where ( , ) represents a (input, desired output) training pair. Similarly, in DL, an optimization method is used to train the neural network by minimising the error function that is defined as where and are the weights and biases of the network, respectively. The value of the error function is thus the sum of the mean squared loss between the predicted valuê and true value , for m training examples. The value of̂ is obtained during the forward propagation step and makes use of the previously-mentioned weights and biases of the network, which can be initialised in different ways. Optimization minimizes the value of the error function by updating the values of the trainable parameters and . The error function is usually minimised by using its gradient slopes for the parameters. The most commonly used optimization method is Gradient Descent (Sun et al., 2019), in which the gradient is optimized by calculating a matrix of partial derivatives (computed using backpropagation, as detailed in the next subsection). These derivatives provide the slope of simultaneously at each dimension of . Therefore, the gradient is used to determine the next direction to search for the Global Optimum. To enhance and reach a lower , a small quantity is subtracted from in the optimal direction (since the gradient provides the direction of the rise and conversely the descent in ), such that the global optimum is eventually reached and is minimized.

Backpropagation
Backpropagation is probably the most important part of learning in neural networks. It is performed after a forward propagation or pass, in which a subset of the training dataset (named a batch) , =1 and the current network parameters are used to calculate the final layer output and the loss. During the forward pass, the data input is passed to the first layer to process according to its activation function and their values are passed on to the next layer, hence the term "forward pass". After the forward pass and calculating the final layer loss, backpropagation happens, through which we start to calculate the loss backwards, layer by layer, and the layer derivatives are then "chained" by the local gradients to minimise the overall loss, .

Regularization
Regularization is another important concept in neural networks learning. It is a technique that makes small changes to the learning algorithm to improve the performance of the model on testing or out-of-sample data (Bisong and Bisong, 2019). In other words, it avoids the risk of over-fitting the training data by discouraging the formation of complex mapping functions or models. Model regularization involves a regularization term being added to the general model loss function, which takes into account the loss function value for all the training dataset examples. Thus, when using regularization, the loss function ( ) (described in Eq. 4) becomes where, is the added regularization function. The most common forms of regularization are L1 and L2 (Ng, 2004). The difference between them is that L2 is the sum of the square of the weights, while L1 is the sum of the weights.

Convolutional Neural Network (CNN)
The most powerful class of DNNs are convolutional neural networks. As their name infers, convolutional networks work by performing a convolution (filtering) operation on the input data. A CNN is usually composed of several convolution layers, which extract useful features from the input data by sliding convolution filters across the input image represented to the network as matrices.   to medicine (Saleh et al., 2021). CNNs have also been widely applied in underwater visual monitoring and processing for counting, localizing, classifying, fc8 + softmax Figure 2: Schematic diagram of a CNN architecture used for the classification of fish images. The architecture consists of five convolutional layers that include the batch norm operation within them, followed by pooling layers (conv1-conv5). In this model, the feature maps from convolutional layers are pooled through pooling layers then flattened through two fully connected layers (fc6 and fc7). The classification output is the result of a fully connected layer and a softmax activation layer (fc8+softmax). and segmenting objects of interest such as fish (Saleh et al., 2020b).
A typical CNN architecture is composed of convolutional layers, pooling layers, non-linear activation layers, and final output layers, as shown in Figure 2. It is through the filtering convolution operation combined with other parts of the CNN that useful features of the input data are extracted and learned automatically. The learning of a CNN usually involves finding the appropriate number, size, and structure of convolution filters, pooling layers, and activation functions and their parameters during training and seeing various examples of the inputs. In the below subsections, we will cover these basic building blocks and layers of a typical CNN.
• Convolutional layer: As already mentioned, a convolutional layer applies a filtering (convolution) operation on its input matrix data to generate another matrix called a feature map. The input matrix can contain the input image information or the feature map generated by a previous CNN layer. The feature maps are the core of a CNN, where useful features of an input are extracted and learned across several convolutional layers.
• Batch Normalization: The goal of this operation, which follows the convolutional operation, is to normalize the learning of the network across the current set of training data (batch), hence the name batch normalization. This is done to improve the speed of learning and the convergence of the deep learning model, because otherwise, the network may see very wide variety of features extracted in its convolutional layers, due to wide input variations. Batch normalization happens by subtracting its input mean and dividing the result by its standard deviation.
• Activation layer: This layer that follows the batch normalization layer is the normal neuron activation function explained earlier. It is used to increase the non-linearity of the convolutional layer output and increase its power in learning complex data. The most common activation functions used in conjunction with convolutional layers are Rectified Linear Unit (ReLU) and Sigmoid. Activation functions are also used in the final non-convolutional fully-connected layers of a CNN. A common output activation function is Softmax.
• Pooling layer: The output feature map of the convolutional layer that is batch normalized and passed the activation function, is often too big for the next convolutional layer to handle. To reduce its size and improve the efficiency of computation, it can be pooled in a pooling layer to generate a reduced sized feature map, while keeping important features.
Pooling is a common operation in CNNs and is used in almost all practical convolutional networks. The most common pooling layers are max pooling and average pooling.
• Dropout: To avoid overfitting to the training data, dropout operations is introduced after the pooling layers. Their task is to cut the network's dependence to a single data instance at each traing step, by randomly removing (dropping out) features extracted using the previous convolutional layer.
• Fully connected layer: Fully connected layer, also known as dense layer, is the second last layer of a CNN, before the output layer. This layer contains a small number of neurons, each of which connected to every neuron in the previous layer. So the network is said to be fully connected. The fully connected layer takes all the inputs and weights from the previous layer, and combines them together into a single vector or matrix. This vector is then passed through an activation function, such as the sigmoid, to calculate output values of the CNN generated by its final output layer.

Supervised Learning
There are two main approaches to learning in general DL. These include unsupervised and supervised learning. Unsupervised learning is often used to discover the structure and composition of the input and output domains without explicit and supervised target domain. This approach enables generalization from one input domain to another by transforming data representations that are not directly related to the data distribution of target domain.
The supervised learning approach, on the other hand, is designed to explicitly map data from the input domain to its output domain via training pairs that exhibit matching representations. These pairs are carefully crafted by a human (supervisor), hence the name. The training process of supervised learning can suffer from instability and is less effective than the unsupervised learning method, because it learns with an accurate target distribution without domainspecific knowledge.
Supervised deep learning uses a subtle deep neural network mechanism to extract useful features from large amounts of input training data that are labelled to show their desired output domain. The learning is done by using the repetitive backpropagation process (Rojas and Rojas, 1996) explained earlier, to adjust the DL architecture internal parameters, such as the shape, number, and size of convolutional, pooling, and fully connected layers, that have been used to determine the representation in each layer from the representation in the preceding layer. In general, adjusting the DL architecture and its parameters to do the best mapping of the input training data to their desired output, as best as possible, is the same as optimising a function , through backpropagation, to map the input domain , to its matching output domain , i.e. ( ∶ ↦ ).

Developing Deep Learning Models
A comprehensive overview of the essential systematic steps for training a DL model is summarized in Figure 3. Even though these steps are general in DL training, we included useful tips arising from our experience in developing DL applications in various domains from medical imaging to marine science applications. Nevertheless, we

Training Dataset
The available training data is essential for developing an efficient DL model. Datasets are becoming increasingly crucial, even more so than algorithms. Perhaps, the most important factor when considering a supervised learning dataset is its size. The requirement for a large training dataset to achieve high accuracy is often a big obstacle. Because visual algorithms are trained by pairs of images and labels, in a supervised manner, they can only identify what has already been given to them. As a result, depending on the project, the number of objects to identify, and the required performance, training datasets might contain hundreds to millions of images. However, smaller training datasets with only a few hundred samples per class may also achieve good results (Saleh et al., 2020b;Konovalov et al., 2019aKonovalov et al., , 2018Konovalov et al., , 2019b. Nevertheless, the larger the training dataset, the greater the recognition accuracy. Because of the scarcity of datasets and the difficulty of acquiring reliable data, approaches for boosting the accuracy rate from small samples will inevitably become a focus of future studies. The problem of limited sample data can be also alleviated by transfer learning (Mathur et al., 2020;Molchanov et al., 2016;Lee et al., 2018). Furthermore, data augmentation will become increasingly critical. Section 5.3 covers some challenges of limited data and some approaches to address these challenges.
The second factor to consider when preparing a dataset for DL training is having a balance. This is critical to ensure that each class to be identified contains a sufficient number of instances to minimise class imbalance biases. These biases happen when the DL favours one or more classes due to seeing them more often when being trained.
Also, the training dataset is typically divided into two subsets, the training subset for efficiently training the model and the validation/test subset for assessing the trained model's performance. For the training subset, a subset of the training dataset is reserved for training the model. If the training subset is too large, it can prolong the model training. If, on the other hand, the training subset is too small, the resulting model may not generalise well to unseen inputs. The validation/test subset is typically used to avoid overfitting, which is a common problem in machine learning and happens when the developed model simply memorises the inputs rather than properly learning them. Cross-validation is another widely used methodology for testing a DL model's training performance, by splitting the training dataset into multiple mutually exclusive subsets of training and testing data. One method of cross-validation is called − cross-validation, in which the training dataset is split into equally sized subsets. In this method, − 1 folds are used for training the model, while the remaining fold is used to test the learning performance. This process is repeated until all the folds have been used once as a test/validation set.
In addition to the above, it is usually vital to, initially and before embarking on code development, perform a comprehensive inspection of the dataset. This will help to clean the dataset, for instance by finding and removing duplicate data instances. It also helps identify imbalances and biases, as well as data distribution, trends, or outliers, which will help in better model design and understanding of possible wrong DNN predictions.
Fortunately, in the domain of fish habitat monitoring, researchers currently have access to a variety of datasets. Table 1 lists publicly available underwater fish datasets, their sources, and where to get them, in addition to a summary of their features, their labels, and their sizes. The main point to note about these datasets is that they differ in both size and the number of features. Although the number of these fish datasets is still small (17), the diversity of aquatic species they cover is already quite wide. They cover a large number of aquatic species, as indicated in Fig.  4. Moreover, each dataset features a different number of images that have varying resolutions. For each image, there is also a ground truth annotated by a human expert, which make them very useful. For instance, these datasets can be used by researchers to test their DL models or to pre-train them, as the first step, for their more specific fish monitoring tasks.
After preparing the training dataset or utilising alternative approaches to addressing insufficient data challenge, one can start developing their DL model using a machinelearning development framework.

Development framework
The rapid evolution of DL has led to the creation of a vast number of development libraries and packages that enable the setting up of DNNs with insignificant effort. Usability and availability of resources, architectural support, customisability, and hardware support are all various benefits of using existing machine-learning frameworks. The most commonly used frameworks are PyTorch, Tensorflow, MATLAB, Microsoft Cognitive Toolkit (CNTK) and Apache MXNET. In the context of DL for marine research, as will be shown later in Tables 3 to 5, PyTorch and  TensorFlow are the dominant frameworks, while Matlab and Caffe have been used only in a few works. Overall, details such as the project needs and the programmer and developer preference should be taken into account, when choosing the development framework.
When the development framework is chosen, the next step is to find the most suitable network architecture for the task at hand. This sometimes depends on the framework, as some recent methods may not immediately be supported by all frameworks.

Network Architecture
Network architecture is the structure of the DL model, which depends on what it intends to achieve and its expected input and output. Therefore, the type of training dataset and the expected outcome influence the architecture's choice and its performance. DL network architectures can differ in a variety of ways such as the type and number of layers, their structure, and their order. Before selecting a network architecture, it is critical to understand the dataset you have and the task you are going to complete. For example, convolutional neural networks or CNNs are known to learn higher-order features, such as colours and shapes, from data within their convolution layers. Therefore, they are ideally adapted in image-based object recognition. On the other hand, Recurrent Neural Networks (RNNs) have the capability of processing temporal information or sequential data, such as the order of words in a sentence. This feature is ideal for tasks such as handwriting or speech recognition.
In the context of fish habitat monitoring, if you are working on a task that requires you to learn temporal information of the input sequence, for example fish image sequence analysis, the DL architecture you choose can be very important. For example, a CNNs architecture is more suited for image-based object recognition such as fish classification, while the RNN architecture is more suitable for tasks where the input sequence is temporal in nature such as generating fish habitat descriptions.
To find a suitable architecture, you first need to define your problem. This problem is defined by two questions: (1) What features will you extract? (2) How will you label these features? The features you extract are defined by your data. In other words, you are interested in the representation of the data you have. The number of features you choose to extract is defined by the task you are trying to solve. As described above, the DL architectures can learn features such as colours and shapes from image-based object recognition. Before trying to construct your network, you first need to decide what data type you will use and how will you encode the information. After you have defined your task, you should think about what features are important for the task. You will need to define this in order to construct your network. For example, if the features you want to extract are fish shape and fish location, then you could define a convolutional architecture. The features you choose to Table 1 Summary of some publicly available datasets containing fish for training and testing deep learning models. define should be a subset of all the features in the data. For example, for an image-based object recognition network, you would extract features such as fish species. However, your extracted features will also need to cover all the data. For example, you will also need features of the type of water or the type of background. It is important to take all these features into account when defining your network. For a complete discussion on different DL architectures see .

Network Model
When a general network architecture is selected, the next step is to select, or sometimes develop, a network model of that architecture. For instance, when you decided to use a CNN, you can use different varieties of CNN models. The rule of thumb for selecting a CNN is to choose a model that results in a satisfactory training loss for your dataset. Creating an exotic and creative model is not recommended at this stage. It is usually recommended to avoid the temptation and choose a model big enough to overfit your dataset, and then regularise it properly to improve the validation loss.
For example, one may pick a well-known CNN model, e.g. ResNet, which can be used out-of-the-box, if their task is simple, e.g. fish classification. In later stages, they can customise their model to adequately capture their dataset. We show in Tables 3 to 5 in the next section that ResNet is the most commonly used model for fish counting (Table 3), fish localization (Table 4), and fish segmentation (Table 5).

Training the model
After choosing the best model is the time to set up a full train/validation pipeline. The below steps are recommended at this stage of development.
• Start with a simple model (i.e. a small number of convolutional layers) that can hardly go wrong and visualise the model performance metrics. Do not use an out-of-the-box large model like ResNet, just yet. It is recommended to plot training loss to see how the network is progressing during learning and if the loss is getting smaller. This also shows the speed of learning.
• To better understand the process, it is recommended to use a fixed random seed (for randomly initialising the network parameters) to ensure that the same results can be achieved when running the code twice.
• Do not perform any data augmentation at this stage as it may introduce errors. You can do data augmentation at a later stage after confirming that your network works properly. You can see a brief introduction to data augmentation and other methods at subsection 5.2.
• Use ADAM algorithm (Kingma and Ba, 2014), which helps the learning by applying adaptive optimisation to the learning rate of the network.
• The learning rate is an important hyperparameter of a deep learning model. It is usually the most crucial value during training and should be configured using trial and error. Depending on the size of your dataset, a specific learning rate decay may be needed. The learning rate decay is a technique that allows the learning rate to fall during successive training epochs, until it converges. A high learning rate at the start prevents the network from memorising noisy data, whereas decaying the learning rate improves complex pattern learning.
• Implement early stopping and monitor the learning process by looking at the training loss plot to prevent overfitting.
• Add complexity to your model gradually, e.g. add more layers or use off-the-shelf CNN models, and obtain a performance improvement over time.

Testing the model
When the model is trained, its accuracy and performance should be tested using the test subset of the training dataset. A test set can also be independent to the training dataset to evaluate the model performance. The main point to remember is that the test set should not have been used for the training or evaluation of the model, at all.
The model's performance should be measured by computing appropriate metrics suitable to the task at hand. A list of most common metrics used in testing fish monitoring models are given in Tabel 2. For classification tasks, Classification Accuracy (CA), Precision and Recall rates are appropriate metrics, while F1-score, which is a combination of precision and recall, can provide a better measure of model performance and is used in fish counting and localization tasks as shown in Tables 3 and 4. The Intersection-Over-Union (IoU) is the appropriate metric for segmentation tasks, while the mean average precision (mAP) metric suits pixel-wise localization of fish in images. Looking at Tables 3 to 5, other metrics such as Mean Square Error (MSE) and Root MSE (RMSE) have also been used in the marine fish monitoring literature. These can be considered and used if required.

Fine Tuning the model
The performance and accuracy of the model could be improved if needed. The amount of this improvement is, though, strongly influenced by its current accuracy. This step may quickly become complicated, since increasing the model accuracy might require several steps such as adjusting the learning rate, collecting new data, or fully modifying the model's architecture. You should keep this fine tuning step to a reasonable level. Otherwise, the model might overfit the data.

Deploying the model
Finally, the model deployment mode should be chosen. This depends on the application and the deployment requirements. The model can be deployed to run on a local or remote device (on a web server, a docker container, a virtual private server (VPS), etc). This will determine whether the results can be accessed remotely or only within the local network. It is recommended to use a cross-platform deployment method to avoid issues such as input/output data format, or the type of files used for storing data.
The most commonly used cross-platform model deployment method is Docker (Potdar et al., 2020; Abdul

Table 2
Performance metrics used to compare various surveyed works. Mean average precision mAP Depending on the detection difficulty, the mean across all classes and/or total thresholds are used.

Classification Error
CE Is how often is the classifier incorrect and also known as "Misclassification Rate". = ( + )∕( + + + ) et al., 2019), which is a virtualization software that allows setting-up and running other software environments on top of a base Linux distribution without the need to set-up virtual machines. Docker helps build, configure, and run applications using the same Docker file. Typically, Docker is the recommended approach for web applications. In this method, you can use Docker container or Docker host on your development machine. Docker container may be the easiest option for web applications. You can also deploy your network to a remote machine via Docker. The advantage of using a container is that you can share the development environment and run tests of your model using multiple docker containers. You can also install the Docker tool on your local machine to manage containers, so it is convenient.

Applications of Deep Learning in Underwater Fish Monitoring
Deep learning has been widely used in marine environments with applications spanning from deep-sea mineral exploration (Juliani and Juliani, 2021) to automatic vessel detection (Chen et al., 2019). However, we confine the scope of this paper to only marine fish image processing, which typically includes four tasks of classification, counting, localization, and segmentation of underwater fish images, as shown in Fig. 5.
Here, the goal is to assist the reader in understanding the similarities and differences across these tasks and their relevant DL models and techniques. We provide a background of what each task involves, what previous works have been published toward addressing it using deep learning, and synthesize the literature on each task.

Classification
As its name infers, in visual processing, classification is the task of classifying images into different categories. There can be only two categories, i.e. a binary classification, in which the images are classified into two groups, e.g. "fish" and "no fish", depending on the presence or absence of fish in an image (e.g. Deepfish dataset described in the first row of Table 1). The classification can also involve multiple "classes" or groups. For instance, consider assigning different underwater fish images into different groups based on the species (e.g. FishPak dataset in Table 1) present in them.
Consider a manual procedure, in which images in a dataset are compared and relative ones are classified based on similar features, but without necessarily knowing what you are searching for in advance. This is a difficult assignment as there could be thousands of images in the dataset. Moreover, many image classification tasks involve images of different objects. It rapidly becomes clear that an automatic system, such as a DNN, is required to complete this task quickly and efficiently.
Classification is the most widely-used and -studied underwater image processing task using DL. In a previous work, we have covered the use of DNNs specifically for the task of underwater fish classification. We refer the reader to (Saleh et al., 2022) for a comprehensive review of prior art on classification.

Counting
The purpose of the counting task is to predict the number of objects existing in an image or video. Object counting is a key part of the workflow in many major CV applications, such as traffic monitoring (Khazukov et al., 2020;Zhang et al., 2017). In the context of marine application and fish monitoring, counting may be used to map distinct species and monitor fish populations for effective conservation. With the use of commercially available underwater cameras, data gathering can be done more comprehensively. It is, however, difficult to correctly count fish in underwater habitats. To perform effective counting, models must understand the diversity of the items in terms of posture, shape, dimension, and features, which makes them complex. Meanwhile, manual counting is very time-consuming, costly, and prone to human error.
DL affords a faster, less expensive, and more accurate alternative to the manual data processing methods currently employed to monitor and analyse fish counts. Table 3 lists several of the recent DL techniques used for fish counting. Saleh et al. (Saleh et al., 2020a) created a novel largescale dataset of fish from 20 underwater habitats. They used Fully Convolutional Networks (FCNs) for several monitoring tasks including fish counting and reported a Mean Average Error (MAE) of 0.38%. DL has the potential to be a more accurate method for assessing fish abundance than humans, with results that are stable and transferable between survey locations. Ditria et al. (Ditria et al., 2021(Ditria et al., , 2020a compared the accuracy and speed of DL algorithms for estimating fish population in underwater pictures and video recordings to human counterparts in order to test their efficacy and usability. In single image test datasets, a DL method performed 7.1% better than human marine specialists and 13.4% better than citizen scientists. For video datasets, DL was better by 1.5% and 7.8% compared to marine and citizen scientists, respectively. Despite this high potential, DL has not been thoroughly investigated for counting underwater fish. One possible reason for the lack of comprehensive research for fish counting is the scarcity of large publicly available underwater fish datasets. In addition, properly annotating fish datasets to train robust DL models is time-prohibitive and expensive. Although the underwater fish counting is limited in the literature, several previous works have advanced the field in this area. For instance, Tarling et al. (Tarling et al., 2021) created a novel dataset of sonar video footage of mullet fish labelled manually with point annotations and developed a density-based DL model to count fish from sonar images. They counted fish by using a regression method (Xue et al., 2016) and achieved a MAE of 0.30%. Other researchers (Schneider and Zhuang, 2020;Liu et al., 2018) used sonar images as well because they present substantially different visual characteristics compared to natural images. Counting fish in sonar images, however, is substantially different from counting fish in underwater video surveillance (Mandal et al., 2018). Unlike natural images, sonar images present unique visual characteristics and are in lower resolution due to the specific imaging forming principle.
Using DL, a computer can be taught to identify fish in underwater images, thus eliminating the subjectivity of humans in counting fish. However, its use for fish population and count analysis is dependent on the model performance on a set of well-defined performance metrics and parameters, which is in itself a challenge. In section 3, we discussed how one can train high-performance DL models, how the use of the current DL pipeline (and other methodologies) can be improved, and how future DL models can be designed for better assessing fish population including their abundance and their location, which is the subject of the next subsection.

Localization
Object localization is an essential task in CV, where the goal is to locate all instances of specified objects (e.g. fish, aquatic plants and coral reef) in images. Marine scientists assess the relative abundance of fish species in their environments regularly and track population variations. Various CV-based fish sample methods in underwater videos have been offered as an alternative to this tedious manual assessment. Though, there is no perfect method for automated fish localization. This is mostly owing to the difficulties that underwater videos bring, such as illumination fluctuations, fish movements, vibrant backgrounds, shape deformations, and variety of fish species.
To address these issues, several research works have been carried out, which are listed in Table 4. Saleh et al. (Saleh et al., 2020a) have developed a fully convolutional neural network that performs localizing of fish in realistic fish-habitat images with a high accuracy. Jalal et al. (Jalal et al., 2020) introduced a hybrid method based on motionbased feature extraction that combines optical flow (Beauchemin and Barron, 1995) and Gaussian mixture models (Zivkovic and van der Heijden, 2006) with the YOLO deep learning technique (Chaudhari et al., 2020) to identify and categorise fish in unconstrained underwater videos using temporal information. They achieved fish detection F-scores  (Joly et al., 2014) and their own dataset, respectively. Gaussian mixture is an unsupervised generative modelling approach that may be used to learn first and second order statistical estimates of input data features (Zivkovic and van der Heijden, 2006). Within an overall population, this is used to indicate Normally Distributed subpopulations. The weakness of Gaussian mixture is when trained on videos with some fish but no pure background, the fish are modelled as background as well, resulting in misdetections in subsequent video frames (Salman et al., 2019). In order to compensate for the Gaussian mixture's weakness, optical flow can be used to extract features which are solely caused by underwater video motion. The pattern of apparent motion of objects, surfaces, and edges in a visual scene generated by the relative motion of an observer and a scene is known as optic flow (Beauchemin and Barron, 1995). Knausgard et al. (Knausgård et al., 2021) also implemented YOLO (Chaudhari et al., 2020) for fish localization. To overcome their small training samples, they employed transfer learning (explained in the next Section). The YOLO technique achieved Mean Average Precision (mAP) of 86.96% on the Fish4Knowledge dataset (Giordano et al., 2016). YOLO-based object detection systems have been also used in several other research to robustly localize and count fish (Jalal et al., 2020;Xu and Matzner, 2018;Knausgård et al., 2021). To test how well Yolo could generalise to new datasets, (Xu and Matzner, 2018) used it to localize fish in underwater video using three very different datasets. The model was trained using examples from only two of the datasets and then tested on examples from all three datasets. However, the resulting model could not recognise fish in the dataset that was not part of the training set.
Other CNN models have also been adapted to robustly detect fish under a variety of benthic background and illumination conditions. For instance, (Villon et al., 2016) and (Choi) used GoogLeNet (Szegedy et al., 2015a), while (Labao and Naval, 2019a) used an ensemble of Regionbased Convolutional Neural Networks  that are linked in a cascade structure by Long Short-Term Memory networks (Hochreiter and Schmidhuber, 1997). In addition, Inception (Szegedy et al., 2015b) and ResNet-50 (He et al., 2015) were examined in (Zhuang et al., 2017) for fish detection and recognition based on weaklylabelled images. Furthermore, (Han et al., 2020) and (Li et al., 2015) used Fast R-CNN (Region-based Convolutional Neural Network)  to detect and count fish. Table 4 demonstrates that state-of-the-art methods (e.g. YOLO and Fast R-CNN) can achieve high accuracy in localization tasks. These methods generally train object detectors from a wide variety of training images (Felzenszwalb et al., 2010;Girshick et al., 2014) in a fully supervised manner. The drawback is that these models depend on instance-level annotations, e.g. tight bounding boxes need to be drawn around fish in training datasets. This is time-consuming and labour-intensive and make the use of DL in marine research very challenging, if not impossible. In Section 5.3.4 we discuss how this critical issue can be addressed using weakly supervised localization of objects, where only binary image-level labels showing the existence or absence of an object type are needed for training.
Similar to fish classification, counting, and localization, fish segmentation, i.e. detecting the entire body of fish in an image is a critical task in marine research and applications. In the next subsection, we discuss how DL can be used to perform fish segmentation and how it is useful in marine research.

Segmentation
Semantic segmentation task is to predict a label from a set of pre-defined object classes for each pixel in an image (Shelhamer et al., 2017). In the context of marine research, fish segmentation provides a visual representation of fish contour, which might be helpful for human expert visual verification or to estimate fish size and weight. Table 5 lists a number of research addressing the task of fish segmentation.
Saleh et al. (Saleh et al., 2020a) developed a FCN model that performs fish Segmentation in realistic fish-habitat images with a high accuracy. Labao et al. (Labao and Naval, 2019b) proposed a DL model that can simultaneously localize fish, estimate bounding boxes around them and segment them using a unified multi-task CNN in underwater videos. Unlike previous approaches (Qian et al., 2016;Wang and Kanwar, 2021) that relied on motion information to identify fish body, their proposed method predicts fish object spatial coordinates and per-pixel segmentation using just video frames independent of motion information. Their suggested approach is more resilient to camera motions or jitters since it is not dependent on motion information, making it more suitable for processing underwater videos captured by Autonomous Underwater Vehicles (AUVs). Region Proposal Networks (RPN) (Ren et al., 2017) have been also used for fish segmentation in underwater videos (Alshdaifat et al., 2020). RPN is a FCN that generates boxes around identified objects and gives them confidence scores of belonging to a specific class, simultaneously.
Computational efficiency is essential in the autonomy pipeline of visually-guided underwater robots. For this reason, (Islam et al., 2020) developed SUIM-Net, a fullyconvolutional encoder-decoder model that balances the trade-off between performance and computational efficiency. On the other hand, for higher performance, (Zhang et al., 2022) proposed Dual Pooling-aggregated Attention Network (DPANet) to adaptively capture long-range dependencies through a computationally friendly manner to enhance feature representation and improve not only the segmentation performance, but also its computational resources and time.
All previously discussed models use fully-supervised methods that require a large amount of pixel-wise annotations, which is very time-consuming and expensive, because a human expert must segment and label, for example, each NA fish in an image. To overcome this serious issue, weaklysupervised semantic segmentation models are used. These models do not need to be trained with pixel-wise annotation (Rajchl et al., 2016). However, due to a lower level of supervision, training weakly-supervised semantic segmentation models is often a more challenging task. Applying weakly labelled ground truth derived from motion-based adaptive Mixture of Gaussians Background Subtraction, (Labao and Naval, 2017) managed to get an average precision of 65.91%, and an average recall of 83.99%. Recently, several other weakly-supervised methods have been introduced to overcome the cost of a large amount of pixel-wise annotations. These new methods include bounding boxes (Khoreva et al., 2017;Dai et al., 2015), scribbles (Lin et al., 2016), points (Laradji et al., 2021b;Bearman et al., 2016), and even image-level annotation (Pathak et al., 2015;Wang et al., 2018;Ahn and Kwak, 2018;Huang et al., 2018;Wei et al., 2018). Since weakly-supervised methods are integral to success of important DL-based segmentation tasks, in Section 5.3, we discuss them further.
In the previous subsections, we discussed how DL is useful in a number of key applications in fish habitat monitoring. In the following Section, we discuss the many challenges on the way of developing DL models for such applications.

Challenges in underwater fish monitoring
. Underwater fish monitoring presents a series of challenges for DL, which have been the focus of many research works. In this section, we first introduce the major enviromental challenges faced when developing underwater fish monitoring models. We then show that one of the approaches to properly address these enviromental challenges is to use DL. However, DL training for fish monitoring has its own challenges, which will be discussed in details.

Environmental challenges
In order to work in underwater environments, monitoring models must be able to recognize objects and scenes in complex, non-trivial backgrounds. This presents both a challenge in the development and training of these models and in robustly testing them. The main environmental challenges in underwater visual fish monitoring can be categorized as follows: 1. The environment is noisy including very large lighting variation. An object viewed from a distance is much less bright than a close-up object. These problems become more acute when the background is not uniform. 2. Underwater scenes are highly dynamic, i.e. the scene's content and objects change very quickly. The background can change from being completely occluded to being visible and vice versa. 3. Depth and distance perception can be incorrect due to refraction. This is more severe for short distances.
4. Images are affected by water turbidity, light scattering, shading, and multiple scattering. 5. The image data are frequently under-sampled due to low-resolution cameras and power constraints underwater.
One of the main approaches used in literature to address these challenges is for the monitoring models to use handcrafted features (Rova et al., 2007;Hu et al., 2012;Fouad et al., 2014;Huang et al., 2014;Chuang et al., 2016;Ogunlana et al., 2015;Hossain et al., 2016;Wang et al., 2017;Islam et al., 2019). Hand-crafted features are defined by a human to describe a fish image. For example, a low-level feature can be the histogram of a texture or a Gabor filter response. As a more complex and representative feature, a mid-level feature can be a Scale-Invariant Feature Transform (SIFT) (Lindeberg, 2012), or a Histogram of Oriented Gradient (HOG) (Dalal and Triggs, 2005). However, human-defined features cannot be applied to other datasets, and the definition of a human-defined feature is a time-consuming task, which restricts real-time detection and requires manual effort. Moreover, hand-crafted features are limited by human experiences, which may contain noise and are difficult to design. For example, a SIFT descriptor doesn't work well with lighting changes and blur.
Therefore, a fish image is transformed into a feature space that a computer can understand. The feature space is often based on a combination of low-level image features (for example, colour distribution and gradient), and other features in the image such as edges, shapes, and textures. Models using hand-crafted features, however, do not perform well under varying environmental conditions, and the feature space cannot be easily or robustly created. Additionally, the features created are too low-level and cannot be easily used for processing images from different sources.
An alternative way to build prediction models capable of working in the presence of these significant environmental challenges is to use DNNs. However, training effective DNNs require resolving some other challenges, which we discuss in the below subsections. We also describe some of the approaches in literature addressing them. The reviewed approaches in addressing these common challenges can provide a quick reference for future researchers developing DL-based fish monitoring models.

Model Generalisation
Improving the generalization abilities of DNNs is one of the most difficult tasks in DL. Generalization refers to the gap between a model's performance on previously observed data (i.e training data) and data it has never seen before (i.e testing data). This is a fundamental problem, with implications for any applications using deep neural networks to process image data, videos, etc. This challenge is even more pronounced when more difficult tasks such as fish recognition in underwater environments.
Generalization problem happens usually because during training the network over-fits to the training data. In other words, the weights of the network are adapted to produce a response that is best suited for reproducing the training examples. During testing, the network produces a response that is a compromise between the different training examples. This mismatch is a common cause of poor performance on test data, which is often referred to as a network overfitting to the training data, even when the network has been trained for many epochs. The reason it occurs is that the network "memorizes" the training data during the training. The training data can become quite large, consisting of hundreds of thousands or millions of examples. This makes the issue of network over-fitting quite significant. In the last few years, there has been significant research efforts toward solving the problem of over-fitting to improve model generalization.
Previous works have shown that it is possible to prevent the network from over-fitting using techniques called regularisation (Kukačka et al., 2017). There are also some theoretical techniques to make the network more robust to training data. Below, we provide a brief overview of some of these techniques and how they have been applied to solve the problem of deep network over-fitting to training data, to improve generalisation in DL.
• Regularisation Term: It is hypothesised that neural networks with fewer weight matrices can result in simpler models with the same capability as the complete model. A regularisation term is, therefore, added to the model loss function to remove some of the weight matrices components. The most popular methods of regularisation are L1 and L2. For example, Tarling et al. (Tarling et al., 2021) showed that incorporating uncertainty regularisation improves performance of their multi-task network with ResNet-50  backend to count fish in underwater images.
• Batch normalisation: Introduced in Section 2.2 as part of the convolutional layer in CNNs, batch normalisation was first introduced by Ioffe and Szegedy (Ioffe and Szegedy, 2015) to decrease the effect of internal covariate shift. Internal covariate shift is the shift in the mean and covariance of inputs and network parameters across a batch of examples. Internal covariate shift can impede the training of deep neural networks. Batch normalisation is used in almost any DL model training, to improve the model generalisation. In the fish monitoring domain, for instance, Islam et al. (Islam et al., 2020) proposed an optional residual skip block consisting of three convolutional layers with batch normalisation and ReLU non-linearity after each convolutional layer to perform effective semantic segmentation of underwater imagery.
• Dropout: Introduced in Section 2.2 as a common operation in CNNs, dropout reduces the network dependency to a small selection of neurons and encourages more useful and robust properties and features of the dataset to be learnt. When working with a complex neural network structure, dropout is frequently recommended to introduce additional randomisation, which helps with the generalisation capability of the network. For example, Iqpal et al. (Iqbal et al., 2021) claimed that the inclusion of dropout layer has enhanced the overall performance of their proposed model for automatic fish classification.

Dataset Limitation
Preparing training datasets is one of the central and most time-consuming bottlenecks in developing DL models, which require a large amount of data, e.g. a variety of underwater fish images in different environmental conditions, which should also be labelled and analyzed by humans for supervised learning. Due to these requirements, making a large dataset is most of the time, very challenging, which makes the datasets limited and small. However, When compared with DL models trained with a large dataset, the convergence speed and training accuracy of the models trained with small datasets are much lower. Generally, increasing the size of training datasets by adding more data to them is the classic way to accelerate the training and improved accuracy of DL models, but it is expensive. Therefore, in recent years, researchers have tackled the dataset limitation challenge by devising new ways described below.

Data Augmentation
Data augmentation is a technique to increase the number of labelled examples required for DL training. It artificially enlarges the original training dataset by introducing various transformations such as translation, rotation, scaling, and even noise, to the original data instances, to make new instances. It is particularly relevant to the challenge posed when the quantity or quality of labelled data is insufficient to train a DL model. At the same time, data augmentation can be used to reduce the probability of overfitting and increase model generalisability. In contrast to the techniques listed above for improving model generalisation, data Augmentation addresses overfitting from the source of the problem (i.e. the original dataset). This is done under the notion that augmentations can extract additional information from the original dataset by artificially increasing the size of the training dataset. It is also critical to consider data augmentation's "safety" (i.e. the possibility of misleading the network posttransformation). For example, rotation and horizontal flipping are typically safe data augmentation techniques for fish classification tasks (Saleh et al., 2020a;Sarigül and Avci, 2017) but not safe on digit classification tasks, due to the similarities between 6 and 9. A data augmentation technique is to use the super-resolution reconstruction method (Ledig et al., 2017) based on Generative Adversarial Network (GAN) (Goodfellow et al., 2014) to enlarge the dataset with high-quality images. This has been previously used to improve small-scale fine-grained fish classification (Qiu et al., 2018), and to increase models predictive performance (i.e. ability to generalise to new data) (Konovalov et al., 2019a) for underwater fish detection and automatic fish classification (Chen et al., 2018).
Using augmentation techniques such as cropping, flipping, colour changes, and random erasing together can result in enormously inflated dataset sizes. For example, Islam et al. (Islam et al., 2020) used rotation, width shift, height shift, shear, zoom and horizontal flip for semantic segmentation of underwater imagery to significantly increase their dataset size. Another data augmentation technique used during training DL models is scale jittering, which has been used in (Mandal et al., 2018) for assessing fish abundance in underwater videos. Gaussian filtering to blur images and different degrees of rotation for fish recognition in underwaterdrone with a panoramic camera is another augmentation technique used in the marine monitoring domain (Meng et al., 2018).
However, augmentation is not always favourable, as it might lead to large overfitting in cases with very few data samples. As a result, it is critical to determine the best subset of augmentation techniques to train your DL model using a limited dataset.

Transfer Learning
Transfer Learning is preserving information obtained while solving one problem, and transferring the learned knowledge to another similar problem. For instance, one may initially train a network on a large object dataset, such as ImageNet that includes 1000 different object classes, and then utilise the learned network parameters from that training as the initial learning parameters in a new classification task, e.g. fish classification. In most cases, just the weights in convolutional layers are transferred, rather than the complete network, including fully connected layers. This is extremely useful since many image datasets have lowlevel spatial features and properties that are better learnt in massive datasets. For example, Zurowietz et al. (Zurowietz and Nattkemper, 2020) presented unsupervised knowledge transfer to use their limited amount of training data in order to avoid time-consuming annotation for object detection in marine environmental monitoring and exploration.

Hybrid Features
DL architectures have demonstrated excellent capabilities in capturing semantic knowledge that is latent in image features. Handcrafted features, on the other hand, can provide specific physical descriptions if they are carefully chosen. In addition, attributes of natural images have been demonstrated to be described differently by CNN features and hand-crafted features. This means a feature's discriminative ability may behave differently on different datasets. Therefore, these two types of features may complement each other for better learning.
However, increasing feature dimensions by fusing handcrafted and DL-generated features can result in increased computational requirement. One way to avoid this is to initially utilise DL features for a particular dataset, and later add hybrid features to enhance the performance. As a result, when working with difficult datasets, such as uncommon and rare marine species, more sophisticated algorithms and techniques based on hybrid features may be required. In fact, several research groups have used such strategies to improve the performance of marine species recognition tasks.
For instance, Mahmood et al. (Mahmood et al., 2016) used texture-and colour-based hand-crafted features extracted from their CNN training data to complement generic CNN-extracted features and achieved a classification accuracy higher than when using only generic CNN features when classifying corals. A combination of CNN and handdesigned features have also been used in (Cao et al., 2016) for marine animal classification, again showing that their method achieves higher accuracy than applying CNN alone. In another work, Blanchet et al. showed that aggregation of multiple features outperforms models using single featureextraction techniques, for automated coral annotation in natural scenes (Blanchet et al., 2016).

Weakly-Supervised Learning
DL methods (LeCun et al., 2015) have consistently achieved state-of-the-art results in a variety of applications, specifically in fully supervised learning tasks like classification and regression (Li et al., 2009;Lin et al., 2014). Fully supervised learning methods create predictive algorithms by learning from a vast amount of training patterns, where each pattern has a label showing its ground-truth output (Kotsiantis, 2007). Although the current fully supervised methods have been very successful in certain activities (De Vos et al., 2017;Wörz and Rohr, 2006;Mader et al., 2018), they come with a caveat of requiring a large portion of the data to be labelled, and it is sometimes difficult or extremely time consuming to obtain ground-truth labels for the dataset. Thus, it is desirable to develop learning algorithms that are able to work with less labelled data (i.e. weakly supervised) (Zhou, 2018;Oquab et al., 2015).
Weak supervision in particular can be very useful in underwater fish monitoring, where the limited dataset size and the time-and cost-prohibitive nature of labelling limits achieving a useful dataset for developing effective, smart, and automated habitat monitoring tools and techniques. A number of works in literature have already used weak supervision for underwater fish habitat monitoring. For example, Laradji et al. (Laradji et al., 2020) proposed a segmentation model that can efficiently train on underwater fish images, not manually segmented for training, but only labeled with simple point-level supervision. This work demonstrated that in the marine monitoring context, weakly-supervised learning can effectively improve the accuracy and speed of model development with limited dataset sizes and limited labelling budget.

Active Learning
Active learning is a sub-field of ML and, more broadly, of AI. In active learning, the proposed algorithm is allowed to be "inquisitive", that is, it is allowed to pick the data to learn, which in theory means the algorithm can do more with less guidance, similar to weak supervision. Active learning systems are seeking to solve the constraint of labelling by posing a questionnaire in the context of unlabeled examples to be labelled by an oracle (e.g. a human annotator). In this manner, the goal of the active learner is to attain high precision by using as few labelled examples as possible, thus minimising the expense of acquiring labelled data; see Figure 6.
In many cases, the labels come for little or no cost, like the "spam" label that is used to mark spam emails, or the five-star rating that a user could post for a movie on a social networking platform. Learning methods use these labels and scores to help screen your spam email and recommend movies that you might enjoy. In these cases, certain labels are given free of charge, but for more sophisticated supervised learning tasks, such as when you need to segment a fish in an underwater environment, this is not the case. For example, in (Nilssen et al., 2017) active learning has been used for the classification of species in underwater images from a fixed observatory. The authors proposed an active learning method that assigns taxonomic categories to single patches based on a set of human expert annotations, making use of cluster structures and relevance scores. This active learning method, compared to traditional sampling strategies, used significantly fewer manual labels to train a classifier.

Opportunities in applications of DL to underwater fish monitoring
New methodologies and strategies should be developed to advance DL models for various underwater visual monitoring applications, including fish monitoring, and to bring them closer to their terrestrial monitoring equivalents. In a previous study that was focused on the task of fish classification (Saleh et al., 2022), we have discussed some of the future research opportunities including (i) utilizing spatiotemporal data to add space and time domain information to the current training algorithms that mainly learn fish images regardless of their spatial and/or temporal correlation; (ii) Developing efficient and compact DL models that can be deployed underwater for real-time parsing of the fish images at the collection edge; (iii) Combining image data from multiple collection platforms for improved multi-faceted learning; and (iv) Automated fish measurement and monitoring from underwater captured images. Below, we expand on some of the previously discussed opportunities in (Saleh et al., 2022) and explore a few other prospective research areas for increasing the performance and usability of visual fish monitoring tasks.

Knowledge Distillation for Underwater
Embedded and Edge Processing DL models used for fish monitoring applications are usually very large containing millions of parameters and requiring extensive computational power. To deploy these models on resource-limited devices and in resource-constrained environments such as undersea monitoring sites, different hardware-emabled compression techniques such as quantizing and binarizing DNN parameters (Lammie et al., 2019) can be used, as discussed in (Saleh et al., 2022). Another method that has seen a lot of interest and attention for compressing large-scale DL models is knowledge distillation.
Knowledge distillation is a technique for training a student (i.e. a small network) to emulate a teacher (i.e. ensemble of networks), as shown in Figure 7. The primary assumption is that in order to achieve a competitive or even superior performance, the student model should imitate the teacher model. The main issue is, however, transferring the knowledge from a large teacher to a smaller student. To that end, Bucilua et al. (BucilÇŐ et al., 2006) proposed model compression as a way to transfer knowledge from a large model into a small model without sacrificing accuracy. In addition, several other model compression approaches have been developed, and the community has shown an increasing interest in knowledge distillation, due to its potentials (Amadori, 2019;Wang et al., 2020;Rassadin and Savchenko, 2017;Kushawaha et al., 2021).
A significant research opportunity lies in applying Knowledge distillation into embedded devices and underwater video processors to achieve online and more effective surveillance with high accuracy while using limited resources. This is particularly useful because of the limitations of transferring data from underwater sensors and cameras, and due to the challenging underwater communication in the Internet of Underwater Things (Jahanbakht et al., 2021).

Merging Image Data from Multiple Sources
As discussed in (Saleh et al., 2022), to train more effective DNNs, multiple data collection platforms like Autonomous Underwater Vehicles (AUVs) or inhabited submarines can give varied visual data from the same monitoring subject. This can provide additional monitoring information, such as fish distribution patterns. Although it is straightforward to combine multiple data sources for training a DL network, several issues should be addressed in future research. These include possible preprocessing on part of data to make it compatible with the rest of the training dataset, class-wise weights (i.e. when you have an imbalanced dataset), and the number of outputs of a network. In addition, multiple training data sources, in particular, when using AUVs or submarines, incurs significant data collection and manual labelling cost, which is not always viable.
For this reason, some researchers have focused on learning from data with the least amount of human-labeling. To reduce human-labelled data cost, several methods have been proposed to train models on data that are unlabeled (Shimada et al., 2021) or only have pseudo-labels (Wu and Prasad, 2018). Future research can advance this further by developing faster and cheaper annotating tools for underwater fish images.

Automatic Fish Phenotyping From Underwater Images
Automatic fish phenotyping, i.e. extracting their weight, size, and length, in their natural habitats can provide invaluable information in better understanding marine echosystems and fish ecology (Goodwin et al., 2022). Although many studies have addressed fish monitoring in aquaculture and fish farm settings (Li and Du, 2021;Zhao et al., 2021), monitoring fish for measurement in natural habitats remain mostly unexplored, and can be investigated in future research. These research should address problems such as low visibility and light, fish occlusion and overlap, which are shared with aquculture monitoring. However, other problems unique to natural habitats such as cluttered background environments and underwater distance measurement should be addressed too.

Visual Monitoring of Fish Behavior and Movements
Although some telemetry and satellite tracking devices can be used in limited settings (Lennox et al., 2017), fish monitoring in their natural habitats over a period of time is not achievable using these techniques mainly due to the hostile underwater signal communication medium (Jahanbakht et al., 2021). For instance for tracking fish movements, schooling, and behavior, new visual monitoring techniques should be devised. A possible direction for future studies is to devise better understanding of fish vision characteristics (Boudhane and Nsiri, 2016) and their implications in the current and next generation of automated DL-based tracking systems (Li et al., 2020) and marine object detection (Moniruzzaman et al., 2017). An example of an alternative tracking method is presented in , where the image-based identification and tracking method for fish is designed based on biological water quality monitoring. To improve the fish tracking task, some techniques can also be combined with visual image enhancement algorithms. For instance, when the image enhancement methods are used, the underwater images can be corrected for distortion and noise, and the fish tracking task can be easily performed. In (Saberioon and Cisar, 2016), the authors studied the potential of underwater fish monitoring by using visual and underwater sensing methods.
Another challenging research area is developing novel underwater fish tracking algorithms, using DL or other technologies, with low power consumption and real-time speed. For this, various hardware technologies and techniques used in other domains such as biomedical applications  can be explored. Of course, any automated vision-based tracking system should be validated through real-world trials, which is a significant undertaking requiring many resources, in order to ensure the accurate and realtime tracking of fish.

Summary and Conclusion
The goal of this article was to provide researchers and practitioners a summary of the contemporary applications of DL in underwater visual monitoring of fish, as well as to make it easier to apply DL to tackle real challenges in fish-related marine science.
DL has progressed as a technology capable of providing unprecedented benefits to various aspects of marine research and fish habitat monitoring. We envision a future where DL, complemented by many other advances in monitoring hardware and underwater communication technologies (Jahanbakht et al., 2021), is widely used in marine habitat monitoring for (1) data collection and feature extraction to improve the quality of automatic monitoring tools; and (2) to provide a reliable means of surveying fish habitats and understanding their dynamics. We expect that such a future will allow marine ecosystem researchers and practitioners to increase the efficiency of their monitoring efforts. To achieve this, we need concentrated and coordinated data collection, model development, and model deployment efforts. We also need transparent and reproducible research data and tools, which help us reach our target sooner.