Deep Learning Approach For Objects Detection in Underwater Pipeline Images

ABSTRACT In this paper, we present automatic, deep-learning methods for pipeline detection in underwater environments. Seafloor pipelines are critical infrastructure for oil and gas transport. The inspection of those pipelines is required to verify their integrity and determine the need for maintenance. Underwater conditions present a harsh environment that is challenging for image recognition due to light refraction and absorption, poor visibility, scattering, and attenuation, often causing poor image quality. Modern machine-learning object detectors utilize Convolutional Neural Network (CNN), requiring a training dataset of sufficient quality. In the paper, six different deep-learning CNN detectors for underwater object detection were trained and tested: five are based on the You Only Look Once (YOLO) architectures (YOLOv4, YOLOv4-Tiny, CSP-YOLOv4, YOLOv4@Resnet, YOLOv4@DenseNet), and one on the Faster Region-based CNN (RCNN) architecture. The models’ performances were evaluated in terms of detection accuracy, mean average precision (mAP), and processing speed measured with the Frames Per Second (FPS) on a custom dataset containing underwater pipeline images. In the study, the YOLOv4 outperformed other models for underwater pipeline object detection resulting in an mAP of 94.21% with the ability to detect objects in real-time. Based on the literature review, this is one of the pioneering works in this field.


Introduction
Submarine pipelines are mainly used to carry oil, gas, and water. Harsh underwater environment conditions often change the appearance and state of installed pipes. In order to guarantee the regular operation of the subsea pipeline infrastructure, the detections of submarine pipeline components and leakage are essential. Since remotely operated vehicles (ROVs) can adapt to the harsh sea environment, they can replace human visual underwater inspections. Nowadays, computer vision is used to assist the ROVs in completing various underwater tasks, such as underwater pipeline object detection and inspection, tracking, scene reconstruction, and other (Jacobi and Karimanzira 2013;Lu et al. 2017). The primary operational challenge for underwater vehicles is that the underwater environment often significantly affects visual sensing despite using high-quality cameras.
The performance of a vision-based inspection is severely impacted by the quality of underwater imagery, which is often highly degraded by optical artifacts. Those artifacts include poor visibility, light refraction, absorption, scattering, and attenuation. Light scattering is caused by a light ray incident on the object reflected and deflected multiple times by particles present in the water before reaching the camera; this reflection introduces a homogeneous background noise to the image. Attenuation causes exponential decay of light between the image scene and the camera (Uplavikar, Wu, and Wang 2019). The subsea environment presents a unique challenge to the perception that is not present on the land; sea-land has a significant diversity of underwater image distributions. The images captured in deep oceanic water look different from those captured in muddy waters or shallow coastal waters. Color distribution can be manipulated by varying degrees of attenuation encountered by light traveling in the water with different wavelengths. As light propagation differs underwater (compared to the air), a unique set of non-linear image distortion occurs, propelled by various factors (such as attenuation and scattering). Underwater tends to have a dominating green or blue hue since red wavelengths get absorbed in deep water (Schettini and Corchs 2010).
Object detection is a critical problem that is utilized in a wide range of industries for sorting, inspection, monitoring, and other purposes. The traditional vision-based detection method for underwater pipeline and cable detection is based on the edge information in images (Narimani, Nazem, and Loueipour 2009). Harsh underwater environments impact methods that use edge information by reducing object detection accuracy. In order to improve detection speed and accuracy, the generic method based on Convolutional Neural Network (CNN) occupies a dominant position in object detection research today. The CNN can be divided into two main categories (Zhao et al. 2019): Region Proposal-Based Framework (two-stage) and Regression/ Classification-Based Framework (one-stage).
The region proposal-based framework is a two-step process that first gives a coarse scan of the whole scenario and then focuses on regions of interest (RoIs). Girshick et al. (2014) proposed R-CNN, which adopts the CNN to produce RoIs in order to localize and segment objects and a pretrained linear Support-Vector Machine (SVM) classifier to categorize the produced region of interests. The R-CNN training is expensive in memory and time. Features are extracted from different RoIs and stored on the disk. The Fast R-CNN achieved impressive improvements in both accuracy and efficiency, but not enough for real-time detection (Girshick 2015). The Faster R-CNN uses a Region Proposal Network that shares full-image convolutional features with the detection network . It has been used for real-time detection, face detection (Jiang and Learned-Miller 2017), pedestrian recognition (Zhao et al. 2016), seagrass detection (Moniruzzaman et al. 2019) and in other fields where inference speed in real-time is not crucial.
A regression-based framework, also called the single-stage detector based on global regression, performs mapping straight from the image pixels to bounding box coordinates and class probabilities, which can reduce computational cost. To overcome the problem of the poor real-time performance of the target detection in R-CNN, Redmon et al. (2016) proposed a novel real-time object detector called YOLO. It makes use of the whole topmost feature map to classify and locate objects in one step. Based on YOLO, Redmon et al. proposed YOLOv2 (2017) andYOLOv3 (2018). YOLOv2 adopts a max-pooling layer and batch normalization, which improves detection accuracy and speed. YOLOv3 uses RESNET and faster R-CNN RPN, which improves spatial representation. Bochkovskiyet et al. (Bochkovskiy, Wang, and Liao 2020) proposed YOLOv4 based on a combination of new features, which improve detection accuracy.
The rest of the paper is structured as follows. Section 2 provides an overview of the related work in the field of object detection. The methodology of our study and elaboration on trained deep-learning models are provided in section 3, followed by the description of the experiment setup given in section 4. A detailed assessment of the obtained results is provided in section 5. The paper conclusion and future work directions are given in section 6.

Related Work
Object detection is one of the tasks of computer vision systems, where its goal is to recognize objects and locate them in an image. Deep learning models are shown to be capable of recognizing and extracting information from images in difficult environments while simultaneously working with a vast amount of data. Underwater object detection is generally achieved by sonar, laser, and cameras. Compared to sonar and laser, the cameras are low-cost, and they can capture more types of visual information with high temporal and spatial resolution.
YOLO has been adopted by various researchers for the purpose of underwater object detection because of its high detection efficiency. As an example, Xu and Matzner (2018) utilized YOLOv3 for underwater fish detection for waterpower application. With high turbidity, rapid velocity, and murky water, the datasets utilized to train and test the model were challenging. The testing of the model yields a mean average precision (mAP) value of 54.92%. Another version of YOLO was used for fish detection in research by Sung, Yu, and Girdhar (2017). They trained the YOLOv1 detector on a custom dataset consisting of 929 fish images with annotation having no negative class images. Testing of the model achieved 65.3% mAP. Raza and Hong (2020) improved the YOLOv3 method for detecting fish in demand for monitoring the marine ecosystem. The improved version of YOLOv3 uses k-means clustering to increase the anchor boxes, transfer learning technique, improved loss function, and increased detection scale. The results show it outperforms the original YOLOv3 on the task of fish detection by 4% in terms of the mAP. Asyraf et al. (2021) investigated four versions of the YOLOv3 detector (they trained the original YOLOv3, Tiny-YOLOv3, YOLOv3-SPP, and Tiny-YOLOv3-PRN) on two open-source datasets to determine the efficiency of the model's ability to detect underwater life. Results showed significant evidence that YOLOv3 can detect underwater objects with a ranging mAP score from 74.88% to 97.56%. Application of the newer version of the YOLO detector, YOLOv4, was demonstrated in research performed by Rosli et al. (2021) for underwater animal detection. The dataset used to train and test the model was challenging due to the varying visibility. The training results show the mAP score of 97.86%.
Aside from fish detection, computer vision has been employed for a variety of other underwater applications. Chen et al. (2021) utilized YOLOv4 for underwater target recognition on a dataset named Underwater Robot Picking Contest (URPC). The URPC dataset contains 4757 images of four target categories: echinus, starfish, holothurian, and scallop. The detection results show 73.48% mAP. Training and testing of the YOLOv4 on the same URPC dataset were conducted by Zhang et al. (2021) achieving testing results of 81.01% mAP. In order to protect the underwater biodiversity, Tian et al. (2021) tackle the problem of aquatic environment pollution. They developed a computer-vision-based autonomous underwater garbage cleaning robot utilizing a modified YOLOv4 detection network. The detection with the trained model achieved results of 90.3% mAP. Lei et al. (2022) utilized the YOLOv4 method for detecting swimming and drowning behavior patterns. Their study resulted in the mAP value of 89.23% for drowning and 93.86% for swimming behavior, respectively.
Underwater object detection is also used in aquaculture for formulating scientific feeding strategies that can effectively reduce feed waste and water pollution, which is a win-win scenario in terms of economic and ecological benefits. The detection of uneaten feed pellets provides rich information for formulating scientific feeding strategies. Hu et al. (2021) utilized improved YOLOv4 to detect uneaten feed pellets in underwater images. The custom dataset consists of blurred and high-density images captured from a net cage located in the cold-water mass area of the Yellow Sea of China. The original YOLOv4 method was improved by changing the PANet network structure, adding the DenseNet shortcut connection, and reducing the number of network layers. The training and testing results of the improved YOLOv4 method achieved the mAP score of 92.61% on the test dataset.
Another use of underwater computer vision is pipeline detection, which is also the focus of this paper. Underwater pipeline detection was done in research by Zhao, Wang, and Du (2020). The researchers used the YOLOv3 algorithm to locate the oil spill point of the underwater pipeline. In a training network, there are two types of detection targets: pipeline and leakage point. The trained model was able to achieve 77.5% of leakage point detection accuracy with 36 frames per second of processing time. Detection accuracy for the pipeline was 93.67%. Based on the literature review, we found just this one paper applying the deep CNN for underwater pipeline object detections (limited to distinguishing just two object classes); hence, to the best of our knowledge, our study may be considered one of the pioneer researches in the field. Next, we present deep-learning models utilized for this purpose in our study.

Methodology
This section, presenting the methodology set up and elaboration on trained deep-learning models, is divided into two subsections. The first subsection explains the architectures of each version of the utilized YOLO object detector; the second describes Faster RCNN, an object detection method whose detection results are later compared to YOLO results.

Introduction to the YOLO Architectures
For our case study, we chose the YOLO method because it achieves near-stateof-the-art performance for object detection tasks in a variety of applications. The original YOLO paper (Redmon et al. 2016) describes the proposed algorithm that is based on regression; instead of selecting the interesting part of an image, and predicts class probabilities and bounding boxes for the whole image in one run of the algorithm.
The network architecture of the original YOLO model is based on the CNN, as shown in Figure 1. It is the first implementation of the singlestage detector concept and uses reduction layers of dimension 1 × 1 followed by a convolutional layer of dimension 3 × 3 and batch normalization and leaky ReLU activation function. The YOLOv1 network has 24 convolutional layers and two fully connected layers. Its detection pipeline is shown in Figure 2. The convolutional layers perform feature extraction, while fully connected layers predict bounding box location and class probabilities. YOLO splits the input image into cells, typically a S � S grid. Each cell is then responsible for predicting two bounding boxes with correspondent probabilities. YOLO determines the probability that the cell contains a particular class during the one pass of the forward propagation. The bounding box around an object has a confidence value corresponding to the IoU score of the bounding box and the ground truth box. Versions YOLOv2 (Redmon and Farhadi 2017) and YOLOv3 (Redmon and Farhadi 2018) use max-pooling layers and different way of generating bounding box proposals with network depths of 19 and 53 layers. Additionally, YOLOv3 can perform multilabel classification achieved by replacing the softmax with logistic regression to calculate the possibility that an input belongs to a specific tag.

YOLOV4
The YOLOv4 (Bochkovskiy, Wang, and Liao 2020) network is composed of four distinct sections: input, backbone, neck, and dense prediction. The structure is shown in Figure 3. The backbone of YOLOv4 is defined as the essential feature-extraction architecture. The backbone is Darknet53, which was used in the original YOLOv3, but it has been enhanced with Cross-Stage- Partial (CSP) connections . As a result, the backbone was named CSPDarknet53. This backbone can improve CNN's learning potential by assisting in the development of a robust object detection model, especially in our case of underwater computer vision. CSPDarknet53 consists of 53 layers of 3 × 3 and 1 × 1 filters, 725 × 725 receptive fields, and 27.6 M parameters. This architecture has proven superior to its competitor architecture, CSPResNet50 (Bochkovskiy, Wang, and Liao 2020). The authors of YOLOv4 chose a modified version of Path Aggregation Network (PANet) (Liu et al. 2018) as the architecture's neck. For the prediction step, each feature needs to be flattened first, which is accomplished with Spatial Pyramid Pooling (SPP) . The SPP significantly increases receptive field performance by bringing out contextual features. The head section consists of dense prediction, which plays an important role in producing the final prediction and locating bounding boxes. This same head section can be found in the YOLOv3 implementation, which detects the bounding box coordinates and confidence score for a specific class. In short, the YOLO head works in three steps. First, it divides the entire image into N � N grids. Each grid has five parameters (x, y, h, w; and c; confidence score), where ðx; yÞ is the offset value between the prediction box and the respective grid cell-bound. Parameters ðh; wÞ are the height and width from the prediction box to the entire image; confidence score c is the probability of the class object. Second, CNN extracts the feature and predicts classes with class probability scores. Finally, non-maximum suppression is used to eliminate repetitive bounding boxes. Improvements created to help enrich the YOLOv4 capability for underwater usage are Mosaic and Cutmix data augmentation process (Yun et al. 2019).
The data augmentation method, named Mosaic, was introduced by the original YOLOv4 authors. It mixes four training images, resulting in mixing four different contexts. This allows the detection of objects outside their normal context. In addition, batch normalization calculates activation statistics from four different images on each layer, significantly reducing the need for a large mini-batch size. Regional dropout strategies were used as data augmentation steps to enhance the performance of the CNNs. These augmentations remove informative pixels in training images by overlaying them with a patch of either black pixels or random noise. It makes the model focus on non-discriminative parts of the object but causes information loss. The CutMix augmentation helps the model classify two objects from their partial views in the same images by taking two images and labeling pairs. Its strategy is to cut out and paste patches among training images where the ground truth labels are also mixed proportionally to the area of the patches. The Cutmix augmentation increases localization ability by making the model focus on less discriminative parts of the classified object.

YOLOv4 Tiny
YOLOv4 Tiny (Wang, Bochkovskiy, and Liao 2021) is a simplified and lightweight version of YOLOv4 that may be used to design applications for mobile and embedded devices. It works on the same idea as the original model, but with a different set of parameters that minimize the convolutional layer's depth. YOLOv4 Tiny has only two YOLO heads as opposed to three in YOLOv4, and it has been trained from 29 pretrained convolutional layers as opposed to YOLOv4 which has been trained from 137 pretrained convolutional layers. Supposing that the size of the input figure is 416 � 416 and feature classification is 80, the YOLOv4 Tiny network structure is shown in Figure 4. Those changes helped the network achieve faster detections. The YOLOv4 Tiny method uses a feature pyramid network to extract feature maps with different scales and increase object detection speed without using the spatial pyramid pooling and path aggregation network used in the YOLOv4 method. At the same time, the YOLOv4 Tiny uses two different scale feature maps that are 13 � 13 and 26 � 26 to predict the detection results. However, the accuracy for YOLOv4 Tiny is approximately two-thirds that of the YOLOv4 when tested on the MS COCO dataset (Lin et al. 2014).

CSP-YOLOv4
Wang, Bochkovskiy, and Liao (2021) proposed a network scaling approach that modifies not only the depth, width, and resolution but also the structure of the network. CSP-YOLOv4 was introduced to get a better speed/accuracy trade-off by converting the first CSP stage in the backbone into the original DarkNet residual layer. The PAN architecture is CSP-ized in order to reduce the amount of computation effectively.

Modified backbone of YOLOv4
The YOLOv4 has CSPDarknet53 as its backbone. The model backbone can be modified in order to have different detection results. Our paper uses a modified version of the YOLOv4 backbone to compare results obtained with the original backbone CSPDarknet53. Models ResNet50-YOLO and DenseNet201-YOLO were used to train and test the detection and recognition of underwater targets. ResNet50 is a deep convolutional neural network that is 50 layers deep. He et al. (2016) proposed an innovative neural network that won the top position at the ILSVRC competition. The strength of this model lies in skip connections that connect blocks of the network which enables the same performance for higher layers. The residual network (ResNet) improves the efficiency of deep neural networks by adding outputs from previous layers to the outputs of stacked layers, making it possible to train much deeper networks. In a DenseNet architecture, each layer is connected to every other layer, hence the name Densely Connected Convolutional Network. DenseNet requires fewer parameters, as there is no need to learn redundant feature maps. DenseNet concatenates the output feature maps of the layer with the incoming feature maps.

Faster RCNN
We compared results in underwater object detection achieved by the YOLObased models described above to Faster R-CNN. In the field of object detection, the Faster R-CNN is a classic two-stage method. Ren et al. improved the R-CNN method for object detection by adding region proposal networks (RPN) that share CNN layers with the same network for object detection . Overview of object detection with Faster R-CNN is shown in Figure 5. A Faster R-CNN object detection network consists of a feature proposal network for extracting the useful features of the target, an RPN whose task is to propose regions of interest, and a Fast R-CNN detector to classify the regions (Girshick 2015). The whole structure of the feature proposal network consists of 13 convolutional layers. Each convolutional layer is followed by a maximum pooling layer. In practical application, the more convolutional layers used, the more image features extracted, and the better the recognition effect of the network on unknown images. The features are used as input to the box regression and classification layer. The RPN outputs the proposed regions and their region score. The core idea of Faster R-CNN is to avoid the two-stage detection technique. The RPN network is created with extra CNN layers, which perform regression simultaneously to produce the region proposal and the region score. The spatial window sliding technique is used to generate region proposals from the convolutional feature map. For every sliding window location, RPN predicts more than one region proposal. Fast R-CNN is responsible for classifying the region of interest and fine-tuning the location border, judging whether the region of interest identified by RPN contains the target and the target category. In this work, Detectron2 Faster RCNN implementation was used (Detectron2 is a PyTorch-based modular object detection library) .

Experiment Setup
This section, presenting the experimental results, is divided into three subsections. The first subsection discusses the preparation of the underwater dataset for YOLOv4 models. The second describes the training process and requirements, and the last subsection lists evaluation measures for trained object detectors.

Data Preparations
The data for the experiment were collected by a remotely operated vehicle (ROV) recording underwater pipelines and from different camera angles. After recording the videos, two frames per video second were extracted to create the dataset for analysis. Namely, the dataset consists of 3021 images taken from three main camera shooting directions (above, left, and right angles), with every shooting angle having the same number of representing images. Extracted frames are then labeled using a labeling tool YOLO-Label (2019), as shown in Figure 6. The dataset distribution is shown in Table 1. The dataset was split up into 80:10:10 ratio, training part of dataset consisted of 2415 different images, testing and validation part of dataset comprised of 303 images taken from three camera angles. Each part of the dataset contains the same number of different camera angle images. An annotation of each image is  given in the text file. Images were labeled in YOLO format containing details on object class, bounding box coordinates, and the height and width of the bounding box (with the most bottom-left point as the origin). Bounding box coordinates consist of center x and center y, which represent the coordinates of the center points of the bounding box. The distance of center from x-axis is represented as center x, and center y is the distance of the center from the yaxis. The coordinates are normalized to lie within the range [0, 1] which makes them easier to work with even after scaling or stretching images.

Training
Neural network framework (in particular, open-source framework Darknet (Redmon 2013(Redmon -2016) is used to provide flexible APIs and configuration options for performance optimization since it is designed to facilitate and fasten the training of deep learning models (Shatnawi et al. 2018). Darknet is written in C and CUDA, allowing for the execution of the training and detection in the Graphical Processing Unit (GPU). The training was performed on the workstation with the following hardware: Intel(R) Xeon(R) CPU E5-2620 v4 @2.10 GHz, NVIDIA GeForce RTX 2080 Ti (11GB of graphic memory), and 128GB RAM.
The training setup has five types of detection targets: pipeline, leakage point, concrete weight, concrete mat, and pipe coupling. Thus, the configuration files were modified in order to define parameters used during training. In particular, the number of full connection layers output of the YOLOv4 is set to 5 because we have five classes, and the number of filters is obtained by ðclasses þ 5Þx3. The number of filters for YOLOv4@ResNet50 is set to 50 due ðclasses þ 5Þx5. The YOLOv4 uses 30 filters and can detect up to three objects per grid cell, while YOLOv4@Resnet50 uses 50 filters with the ability to detect five objects per grid cell. The subdivision number for training YOLOv4-Tiny was 64, as for YOLOv4, and 16 which takes more of an image into account during processing. Here, we were able to use a smaller number of subdivisions for YOLOv4-Tiny since its network is shallower compared to other models. Transfer learning is utilized for all YOLO models. Models were pretrained on the public COCO dataset. All training parameters are given in Table 2. When the confidence of output is less than the threshold value of 0.5, it was interpreted as there is no target class. The output of the corresponding full connection layer is interpreted as the target class when maximum confidence taken is greater than a threshold value. The detection results are compared to the ground truth in order to determine whether the detection is a true positive. The detected bounding box's intersection over union (IoU) score should be at least 50%. Figure 7 shows an example of positive and negative object detection for intersection over union (IoU) score in the case of pipeline detection. In the case of multiple detections of the same object, only one detection is counted as a true positive. Non-maximal suppression is used to choose the correct detection result.
The network's input should be an image, so the video is processed by extracting frames, which are then forwarded to the YOLO algorithm for object detection. The YOLO output provides the confidence score and the class ID of the object class in the bounding box. After the training, the detection model was tested on the test dataset, which was not included in the training and validation process.

Performance Evaluation
As evaluation metrics, detection accuracy, mean average precision (mAP), and frames per second (FPS) are used. The detection accuracy, as in Equation 1, refers to the ratio of the number of prediction boxes to the total number of prediction boxes when the intersection ratio of prediction boxes and annotation boxes is greater than 0.5:

Accuracy ¼ prediction boxes total number of prediction boxes
(1) The mean average precision (mAP) is calculated by taking the mean Average Precision (AP) over all classes for the selected IoU threshold, denoted in Equation 2. The mean AP represents the area under the precision-recall curve, while k stands for number of classes.
The Frames Per Second (FPS) metric, as in Equation 3, is used to express how fast the model can process the input in one second. The Number of Frames represents the number of processed images, while Total detection time is a time frame of usually 1 second.

FPS ¼ Number of Frames Total detection time
Next, we present numerical results for tested object detector models on our dataset.

Results and Discussions
Next, we present underwater object detection results for different implementations of YOLOv4 and Faster RCNN models. Namely, we compare the obtained results for YOLOv4, YOLOv4 Tiny, CSP-YOLOv4, YOLOv4@ResNet50, and YOLOv4@DenseNet201, as well as for the Faster RCNN object detection. The training and testing of the model were conducted on the same custom dataset. First, we will discuss the training process and loss charts. The losses in each batch were calculated from the log file generated during the training phase, where Figure 8(a) shows the loss and mAP plotted against iteration for the YOLOv4 model. The loss decreases, and mAP increases with iterations. The network can be further trained until the average loss decreases below 0.2, and the final loss expectation is 0. The YOLOv4 model started to converge with a good performance at about the 7500th iteration and having a stagnant performance at about the 10000th iteration. It took 23.7 hours to complete the training. Figure 8(b) presents the loss and mAP graph for the YOLOv4-Tiny@16 model, which shows impressive results of average loss below 0.8. Training of the YOLOv4-Tiny@16 method lasted 3.2 hours. The model started to converge after some 8000 iterations. Figure 8(c) shows the loss and mAP graph for the YOLOv4-Tiny@64 model with a great average loss below 0.7 and poor mAP metrics. Poor results for the YOLOv4-Tiny@64 show that the larger minibatch sent to the GPU processor is better for 416 × 416 image size. With subdivision set to 16, a better generalization of the problem is obtained. Training of the YOLOv4-Tiny@64 model lasted 6.4 hours, while it started to converge at about the 8500th iteration. Finally, Figure 8(d) shows the loss and mAP graph for YOLO@DenseNet201. A deep network with numerous training epochs of 50,000 resulted in a long training time that lasted some 65.1 hours. The same number of training epochs were used for the YOLOv4@Resnet50, with training lasting 58.8 hours. Those two deep models started to converge at about the 37500th iteration.  The object detection results obtained by the trained models on our custom underwater pipeline image dataset are as follows. In general, excellent results had been achieved for YOLO object detectors, as shown in Table 3. Table 3 presents results achieved by each trained object detecting method in terms of mean average precision (mAP) and accuracy for each target class. The mAP is an often-used metric that calculates average precision for each class across varied Intersection over Union (IoU). In this study, the threshold was set to 0.5, and it was shown that the YOLOv4 delivered the best mAP result of 94.21% on the tested dataset. This mAP result proves the superiority of the CSPDarknet53 backbone compared to other competitive methods. It should also be noted that the YOLOv4-Tiny@16 model showed high classification efficiency, achieving the mAP of 92.43%, accomplished for a short training time of only some 3.2 hours. As expected, the obtained results show that the deeper model architectures deliver higher mAP.
The underwater object detection was tested on both images and videos. The trained neural networks detect targets in given images and display bounding boxes around the detected object. Different implementations of the YOLOv4 prove its ability to be trained and detect objects in underwater environments. This can be seen in Figure 9 showing detecting performances of all tested models. Deeper models, such as YOLOv4, CSP YOLOv4, and YOLOv4@DenseNet201, reveal the benefits of the architecture and prove that better generalization of the problem is ensured by a larger minibatch sent to the GPU processor.
The processing speed should also be highlighted in addition to classification and detection performances. Table 4 shows the processing speed performance of each tested model required to classify the input images in the test dataset correctly. The obtained result confirmed that the networks, especially YOLOv4 and YOLOv4-Tiny, could simultaneously detect target classes in real-time while the video is playing. Here, we should emphasize the difference in FPS performance for YOLOv4-Tiny compared to other implementations. Namely, the tiny model could achieve high FPS due to the model's small size (shallow network), resulting in faster inference speed. The detection rate of Faster RCNN was not taken into account since it is a two-stage object detector, and thus, it is not intended for real-time object detection. Comparing our obtained results in underwater pipeline object detection to other underwater object detection studies found in literature (such as Chen et al. 2021;Hu et al. 2021;Rosli et al. 2021;Tian et al. 2021;Zhang et al. 2021), we can conclude that our obtained result of mAP 94.21% is quite remarkable. Please note that the underwater environment is rather challenging, and the above-referred papers detect different underwater objects (like, for example, fish and not the underwater pipeline objects), often achieving a smaller mAP. A similar problem of underwater pipeline detection and its component is done by Zhao, Wang, and Du (2020), distinguishing only between two target classes (while our research dealt with five underwater target classes). Comparison of detection results from different research is not possible due to a lack of public  datasets. We could compare results obtained from the trained model on the custom dataset with the YOLOv4 model trained only on the COCO dataset. The YOLOv4 trained only on the COCO dataset without fine-tuning on our dataset cannot detect any target class. We can conclude that the transfer done in this research learning were successful. As proof, results were compared with the YOLOv4 model trained from scratch. That model achieved a smaller mAP of 90.98%. To conclude, our study investigates the performances of seven deeplearning architectures for pipeline component detection from images in challenging underwater environments, achieving remarkable mAP of up to 94.21% on a custom dataset. As a suggestion for future work, a more challenging dataset should be obtained, containing different underwater conditions. Another applicable aim of the project could be detection of pipeline failures. Also, the study can be extended to inside pipeline detections if a dataset is acquired.

Conclusion
In this paper, different implementations of the YOLOv4 were trained and tested on a same custom underwater image dataset to investigate the model's robustness in detecting pipeline objects in demanding underwater scenarios. Detection results of YOLOv4, YOLOv4-Tiny, CSP-YOLOv4, YOLOv4@ResNet, and YOLOv4@DenseNet were compared on test dataset. Further, achieved detection results were compared to the two-stage object detector Faster RCNN. The study was focused on detecting five object classes in the different subsea environments from different camera angles. The YOLOv4 method outperformed other competitive methods in terms of mAP (achieving mAP of 94.21%), with YOLOv4-Tiny achieving the highest FPS and high mAP of 92.43%. In comparison to other similar methods, our method gives promising results dealing with the problem of underwater pipeline detection. This research could be used in the future by autonomous underwater vehicles (AUVs) and remotely operated vehicles (ROVs) to inspect underwater pipelines. In order to achieve additionally improved performance metrics, it is possible to use various image enhancement methods for improving the quality of the underwater imagery dataset.