An efficient detection model based on improved YOLOv5s for abnormal surface features of fish

: Detecting abnormal surface features is an important method for identifying abnormal fish. However, existing methods face challenges in excessive subjectivity, limited accuracy, and poor real-time performance. To solve these challenges, a real-time and accurate detection model of abnormal surface features of in-water fish is proposed, based on improved YOLOv5s. The specific enhancements include: 1) We optimize the complete intersection over union and non-maximum suppression through the normalized Gaussian Wasserstein distance metric to improve the model’s ability to detect tiny targets. 2) We design the DenseOne module to enhance the reusability of abnormal surface features, and introduce MobileViTv2 to improve detection speed, which are integrated into the feature extraction network. 3) According to the ACmix principle, we fuse the omni-dimensional dynamic convolution and convolutional block attention module to solve the challenge of extracting deep features within complex backgrounds. We carried out comparative experiments on 160 validation sets of in-water abnormal fish, achieving precision, recall, mAP 50 , mAP 50:95 and frames per second (FPS) of 99.5, 99.1, 99.1, 73.9% and 88 FPS, respectively. The results of our model surpass the baseline by 1.4, 1.2, 3.2, 8.2% and 1 FPS. Moreover, the improved model outperforms other state-of-the-art models regarding comprehensive evaluation indexes.


Introduction
Aquaculture provides humans with a wealth of nutrients and has become an important part of the global agricultural economy.According to statistics, 88% of global annual fishery production is directly consumed by humans [1,2].With population growth and economic development, the demand for fish continues to increase, and the scale of aquaculture is gradually expanding, which brings huge challenges to fish farming [3].During fish farming, abnormalities such as diseases and parasites may occur in fish, resulting in a decrease in fish attributes, quality and fish welfare.Fish abnormality detection helps farmers adjust breeding strategies in a timely manner, prevent disease outbreaks and improve breeding efficiency [4,5].In the past, manual visual inspection was the primary method for abnormal fish detection.However, this method has problems such as low efficiency, high missed detection rate and strong subjectivity.Detecting abnormal features on the surface of fish is an important basis for distinguishing abnormal fish.Therefore, rapid and accurate detecting of abnormal surface features of fish has become a hot issue in aquaculture.
Computer vision technology is an effective, cost-efficient and non-invasive detection technique, carrying substantial significance in driving the automation and intelligence of aquaculture [6].It has great potential for abnormal fish detection in aquaculture [7].With the development of artificial intelligence, such as computer vision and deep learning, especially in object detection, image classification and image segmentation, researchers have begun to detect abnormal fish surface features by applying neural networks to distinguish abnormal fish.
Yasruddin et al. [8] used computer vision and deep convolutional neural networks to detect fish diseases and used Faster-RCNN to train the surface features of diseased fishes.The results showed that the recognition accuracy was satisfactory.Ashraf and Atia [9] used a transfer learning model to learn two different shrimp disease signatures and detect diseased shrimps from normal shrimps.Wang et al. [10] proposed a computer vision-based detection method for abnormal surface features of Penaeus vannamei.Rapid detection of Penaeus vannamei diseases is achieved through image enhancement methods such as denoising and feature enhancement, as well as the LeNet model.The accuracy of the deep learning model used reaches approximately 96.1%.Chen et al. [11] proposed a two-stage ImageNet deep learning model with a convolutional neural network structure.The model was able to classify three abnormal appearances of grouper, achieving a high average accuracy of 98.94%.Gupta et al. [12] used a convolutional neural network based on VGG19 for fish wound detection, which can classify normal fish and abnormal fish, and the recognition accuracy reached 96.7%.In this body of research researches, although deep learning techniques have shown promising results for fish behavior detection, there are still certain limitations: 1) The detection of abnormal fish in complex backgrounds presents challenges of missing and inaccurate detection.2) The fish abnormal surface feature data sets used were collected on the workbench and cannot be suitable for abnormal fish detection in underwater scenes.3) Previous enhancements made by convolutional neural networks have some drawbacks such as insufficient feature extraction and complex model network structure, resulting in an inability to maintain a balance between model complexity, detection speed and detection accuracy.
You only look once (YOLO) [13][14][15][16][17][18] is an advanced single-stage object detection algorithm, which can be used, for example, small target detection in aquaculture, detection of key components of power transmission lines and detection of cigarette appearance defects, etc. Due to its exceptional performance, it has found extensive applications in land-based recirculating aquaculture systems.Yu et al. [19] proposed a fish skin disease detection model based on the YOLOv4 model, combined with depth-separable convolution and optimized feature extraction network and activation function.The proposed model has high learning ability and the model is lightweight.Compared with the baseline, its mean average precision (mAP) and detection speed are increased by 12.39% and 19.31 FPS, respectively.Wang et al. [20] proposed a diseased fish detection model based on improved YOLOv5s, using the C3 structure instead of the cross-stage partial (CSP) structure, and replacing all 3 × 3 convolutions in the backbone network with parallel 3 × 3, 1 × 3 and 3 × 1 convolutions.The convolution kernel group composed of kernels and the introduction of convolutional block attention module (CBAM) attention mechanism achieved an average accuracy of 99.38%.Prasetyo et al. [21] enhanced the YOLOv4-tiny model for the determination of fish freshness, species classification, and biomass estimation.Their approach involved the integration of novel techniques such as the wing convolutional layer (WCL) and tiny spatial pyramid pooling (Tiny-SPP) to refine and balance diverse feature representations.They effectively optimized computational resources by employing bottleneck and expansion convolution (BEC) for feature fusion.To further improve the model's detection accuracy, they introduced an additional small object detector.Zhao et al. [22] proposed a high-precision lightweight model that uses an improved YOLOv4 to detect dead fish, significantly reducing the number of model parameters and computational amounts.Li et al. [23] introduced a real-time detection approach for identifying abnormal fish behaviors, which combines images of mosaic pixel points with an enhanced version of YOLOv5s, referred to as BCS-YOLOv5.Their proposed method not only improves the extraction of positional information for abnormal fish, but also enables quantitative detection of similar abnormal behaviors.Based on image fusion, BCS-YOLOv5 achieved an impressive inference accuracy of 96.69% on the dataset.The majority of the aforementioned studies have focused on enhancing YOLOv5 for specific detection tasks, resulting in notable improvements and achieving commendable evaluation metrics.
The above-mentioned studies show that good accuracy has been achieved in detecting obvious abnormal fish surface features.However, there are certain limitations in extracting abnormal surface features for small targets and complex scenes.Therefore, this study presents an enhanced YOLOv5based detection model designed for abnormal surface features.Several novel improvements are introduced in our method, distinguishing it from prior research, as outlined below: • We introduce the normalized Gaussian Wasserstein distance (NWD) metric to optimize the loss function and non-maximum suppression (NMS) of YOLOv5s to enhance the model's ability to detect small targets and speed up the model's convergence speed.
• We introduce the lightweight MobileViTv2 module and designed DenseOne module.These enhancements improve detection accuracy, while reducing the model size and parameters for resourceconstrained edge devices.
• According to the ACmix principle, we obtain the ODC-CBAM module by fusing omnidimensional dynamic convolution (ODConv) and CBAM, and further integrate it into the feature extraction network, which reduces the missed detection rate and false detection rate of abnormal surface features located in complex scenes.
The rest of this article is as follows.Section Ⅱ proposes methods for the problem and improves the detailed description of the structure of YOLOv5s and the improvement point.Section Ⅲ describes experimental data collection, data set construction and some experimental details.Section Ⅳ analyzes the experimental results.Section Ⅴ summarizes the work of this article.

Detection methods for abnormal fish
After reviewing the literature, research and interviews with relevant breeders, we identified the following challenges in discerning abnormal fish by the detection of surface features: 1) Since abnormal surface features of fish are an occasional phenomenon in aquaculture, this creates a problem of data scarcity.Moreover, annotating the abnormal surface features requires a lot of time and resources.Therefore, it is difficult to construct a data set of abnormal fish surface features.2) As shown in the Figure 1, the abnormal surface features of longsnout catfish are clearly visible on the workbench.Inwater environments differ from those on the table and often exhibit phenomena such as reflectivity, which can result in unclear fish images.Longsnout catfish in the water may also exhibit such as pixel blur, small size due to variations in distance and serious overlap.3) Although the current convolutional neural networks exhibit good detection accuracy, they have shortcomings such as weak learning ability for abnormal surface features of small targets and slow detection speed.Inspired by the current challenges and previous detection of abnormal fish surface features, this study proposes an object detection model based on improved YOLOv5s.Input the data set into the backbone with MobileViTv2 and ODC-CBAM as the main body and extract the abnormal surface features of abnormal longsnout catfish in the complex background.Then, use the designed DenseOne module to improve the reusability of the features and reduce the overall network complexity.Finally, the NWD metric is introduced to optimize the loss function and NMS of the baseline to enhance the model's ability to detect small targets and accelerate the convergence of the model.The method flow chart is shown in the Figure 2.

Improved YOLOv5
YOLOv5 represents a notable enhancement over the YOLOv4 introduced in 2020.YOLOv5 has five different versions: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.The difference between the five models is the depth and width of the network [24].To select a suitable baseline from five different versions of the YOLOv5 model, we trained them on 1280 training sets of in-water abnormal fish.The training results are shown in Table 1, and the detection accuracy of YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x are nearly consistent.To balance the detection precision and model size for edge devices in actual scenarios, YOLOv5s is selected as the baseline to detect the abnormal surface features of fish.The network structure of YOLOv5s is illustrated in Figure 3. YOLOv5s encompassed four main parts: Input, Backbone, Neck and Head.Regarding the Input, YOLOv5s retains the mosaic data augmentation technique and adaptive image scaling in YOLOv4.Furthermore, YOLOv5s integrates the adaptive anchor boxes calculation into the program, enabling the selection of optimal anchor box values for different data sets.Compared to YOLOv4, several enhancements were introduced to the YOLOv5s Backbone, including the Focus module, CSPDarkNet53 module, and spatial pyramid pooling fast (SPPF) module.These additional modules expand the network's receptive field and further enhance its feature extraction capabilities.The CSPDarkNet53 module includes the CSPNet module, the Bottleneck module and the C3 module.The Neck of YOLOv5s adopts the feature pyramid network (FPN) and path aggregation network (PAN) structures.FPN employed an up-bottom side connection to extract multiscale features and construct the structure of feature pyramids.PAN added a bottom-up route and facilitated dense localization of high-level features.FPN and PAN aggregate parameters across different layers, raising the accuracy of object detection.At the Head, YOLOv5 generates multi-scale prediction results based on the outputs of the different Necks.The bounding box loss function is a complete intersection over union (CIoU) loss [25], which builds upon the generalized intersection over union (GIoU) loss by considering information about the position, scale and shape of the target boxes.

Improvements on YOLOv5s
The YOLOv5s is the smaller network structure among YOLOv5 family.It has obvious advantages in model size and detection speed compared with wider and deeper networks, but inevitably sacrifices detection accuracy.According to the definition of small targets in the COCO data set, small targets are with resolutions less than 32 (pixels) × 32 (pixels).The data set of this study are collected from the abnormal surface features of longsnout catfish in a recirculating aquaculture laboratory.In the process of data collection, certain inherent challenges such as blurred images, complex backgrounds and small targets have been identified.The performance of YOLOv5s is notably inadequate when applied to downstream tasks in this domain.As a result, several advanced methods were proposed for the vanilla YOLOv5s model, which consisted of four main parts: 1) Introduce a new NWD metric [26] and replace the CIoU loss function and NMS of YOLOv5s.We model the bounding boxes as 2D Gaussian distributions and compute their similarity using the NWD between the two distributions.NWD could enhance the detection ability of the model for small targets, optimize the convergence speed of the model and reduced the occurrence of false positives (FP).2) The MobileViTv2 [27] module is utilized to replace the 6th layer network of the Backbone.This substitution elevates the features representation ability and computing efficiency of the model.
3) Supplant the C3 module of the PAN part with the designed DenseOne module.The DenseOne module is derived from DenseNet [28] and incorporates three additional 1 × 1 convolution operations.By establishing shortcut connections to reuse features, this reduces the model size and number of parameters.
4) The ODC-CBAM module is embedded into both the Backbone and PAN.By incorporating the principles of ACmix [29], the ODConv and CBAM are fused.The module incorporates ODConv [30], which adopts the convolutional kernel according to the shape and scale of the longsnout catfish dynamically.Simultaneously, the CBAM [31] assists the model in emphasizing the abnormal surface features of longsnout catfish while suppressing interference from complex backgrounds.
The improved YOLOv5s structure is shown in Figure 4.

NWD metric
In YOLOv5s, intersection over union (IoU) and its extensions are employed as evaluation metrics for the loss function and NMS.Nonetheless, IoU exhibits certain limitations and disadvantages in these applications, as outlined below: 1) The original model employs the CIoU loss function to compute the bounding box localization loss.This loss function extends the concept of IoU and incorporates three geometric properties: bounding box overlap area, centroid distance and aspect ratio.The calculation of the CIoU loss formulas is as follows:

𝛼
(2) where,  means the ratio of the intersection and union of the prediction bounding box and the actual bounding box. ,  represents the Euclidean distance between the centroids of the true and predicted boxes and  represents the diagonal distance of the minimum closed area that can cover both boxes. is weight factor,  is a similarity ratio of length to width.
Equations ( 1)-( 3) indicate that the CIoU loss function enhances the detection accuracy of the model by incorporating a penalty term based on the aspect ratio while calculating the predicted bounding boxes.Nevertheless, the CIoU loss function may exhibit reduced sensitivity when dealing with targets that possess extremely large or small aspect ratios.Consequently, this can lead to poor detection performance for small targets as the loss function may not adequately provide the necessary gradients for optimizing the network in these scenarios.
2) NMS is a widely employed post-processing technique in object detection.Its primary purpose is to suppress redundant predicted bounding boxes, ensuring that each object is associated with only the most accurate and optimal predicted bounding box.However, the selection of the IoU threshold greatly impacts the final detection result.If the threshold is set too high, there is a risk of erroneously rejecting small targets.
Hence, this investigation introduces the NWD metric and incorporates it into YOLOv5s by replacing the CIoU loss function and NMS.To address the concentration of foreground and background pixels of the small longsnout catfish within the center and boundary of the bounding box, this study models the bounding box as a two-dimensional Gaussian distribution.The highest weight is assigned to the center pixel of the bounding box, gradually decreasing towards the border.The NWD metric is employed to assess the similarity between the modeled distribution and the actual pixel distribution, enabling a comprehensive evaluation of their likeness.
The object bounding box   ,  , , ℎ can be modeled as a two-dimensional Gaussian distribution  ,  .The NWD equation is shown in Eq (5).The NWD metric is shown in Eq (6).
,   , where  and  denote the mean vector and the co-variance matrix of the Gaussian distribution, and  ,  ,  and ℎ denote the center coordinates, width and height, respectively.‖⋅‖ represents the Frobenius norm. is a constant closely related to the data set, and we set  to 5 (the average absolute of our data set).In the detection of small target longsnout catfish, the NWD metric offers several advantages over the IoU: 1) Modeling the target bounding box as a two-dimensional Gaussian distribution presents a more effective approach for capturing the continuous and variable position deviation within the bounding box.Furthermore, by assigning weights and normalizing the pixels in different regions of the bounding box, we achieve improved performance.In comparison to IoU and its extension, the NWD method offers substantial advantages in terms of scale invariance and smoothness in handling position deviations.
2) By employing the NWD to measure the similarity between the predicted bounding box and the ground truth box, we can effectively address the issue of sensitivity in CIoU to small position deviations of the target.This approach proves beneficial even when there is no overlap or containment relationship between the two bounding boxes.

MobileViTv2
The C3 module, integrated into the backbone of YOLOv5s, serves as a convolutional neural network module specifically designed for feature extraction.It employs multiple convolutional kernels with varying scales to extract a more comprehensive range of feature information, thereby enhancing the model's ability to accurately detect objects of different sizes.However, the incorporation of the C3 module expands both the depth (number of convolutional layers) and width (number of channels) of the backbone network.Consequently, the model experiences a substantial increase in computational requirements due to the presence of multiple convolutional layers, resulting in higher latency during deployment on resource-constrained edge devices.
Mehta and Rastegari [27] proposed a light-weight and mobile-friendly hybrid network called MobileViTv2.MobileViTv2 replaces the multi-head self-attention (MHA) mechanism utilized in MobileViTv1 [32] with a separable attention method.MobileViTv2 initially applies depth-wise separable convolution and a 1 × 1 convolutional layer to process the input feature map, facilitating the extraction of local information.It then employs a transformer module with separable attention to extract global information.The separable attention method computes the context score of the latent token L in the local features of the input.These scores are then reweighted for the input tokens and generate global information.The transformer module with separable self-attention is implemented by element-wise operation, which reduces the computational complexity.Lastly, the module incorporates a 1 × 1 convolutional layer to integrate local information, perform dimensional transformation, and merge features.The utilization of depth-wise separable convolutions allows for efficient capture of spatial information within the input feature maps while maintaining computational efficiency.The transformer module with separable self-attention is implemented by element-wise operation, which reduces the computational complexity.Therefore, this study utilizes the MobileViTv2 module to replace the C3 module in the sixth layer of the original model.This substitution aims to enhance the model's reasoning speed and alleviate the computational complexity resulting from feature extraction: Refer to Figure 5 for details.

DenseOne
Traditional convolutional networks with L layers have L connections (one connection between each layer and subsequent layer).DenseNet contains shortcut connections between input layers and output layers.DenseNet has L(L+1)/2 direct connections.The output of traditional convolutional networks at the Lth layer is shown in Eq (7).The output of DenseNet at the Lth layer is shown in Eq (8).

𝑥
(7) where H(•) is a non-linear transformation function and xL is the Lth layer of the networks.DenseNet enhances feature maps propagation by short connections, establishing direct connections between each layer and subsequent layers.This approach effectively improves the model's detection capability by encouraging the reuse of feature maps.Additionally, the input feature maps undergo processing through the transition layer, which includes a batch normalization layer, a 1 × 1 convolution layer and a 2 × 2 average pooling layer.Furthermore, a 1 × 1 convolution is applied in the bottleneck layer to reduce the dimensionality of the input feature map, resulting in a significant decrease in the number of parameters.Alongside improved parameter efficiency, DenseNet offers several advantages, including enhanced information flow and gradient propagation throughout the

Global Representation block
Cout × H × W entire network.Moreover, it serves as a regularization technique to address overfitting problems in downstream tasks, particularly when dealing with data sets that have limited samples.Figure 6(a) illustrates the details of the DenseNet.
We designed DenseOne based on the CSPNet [33] and DenseNet.First, feature extraction operations are performed on the input feature maps through two 1 × 1 convolutions.These operations help increase the gradient path of networks.Because of the CSP strategy, one could alleviate the disadvantages caused by using explicit feature map copy for concatenation.To balance the computation of the DenseNet, when the dimensionally reduced feature map is input to the subsequent DenseNet module, since the number of channels of the feature map becomes half of the original feature map, the computational bottleneck can be effectively reduced by nearly half.Then, the features of the first branch are reused in DenseB so that each layer in the network shares the global information in the feature map.In addition, the concat operation is used to perform channel merging of the feature maps of the two branches.Finally, a 1 × 1 convolution is used to recombine the connected features.Compared with DenseNet, DenseOne not only increases the gradient path, reducing the computational amount, but also shows more important feature expression capabilities.The structure diagram of DenseOne is shown in Figure 6

ODC-CBAM
Convolution and self-attention could enable the model to make more precise predictions, and they are usually considered as two peer approaches that are distinct from each other.ACmix is a mixed feature extractor that enjoys the benefit of both self-attention and convolution.For a detailed representation of the ACmix approach, refer to Figure 7.
ACmix could be divided into two stages.At stage Ⅰ, ACmix implements 1 × 1 convolution operations on the input feature map, obtaining a rich set of intermediate features containing 3 × C feature maps (H × W × C → H × W × 3C, C stands for the number of channels, and H × W stands for the feature size.).At stage Ⅱ, the intermediate feature maps are used through the convolution path and self-attention path.Because the convolution kernel size is k, the convolution path first utilizes a fullyconnected (FC) layer to transform the number of channels to equal the number of all shift directions.Subsequently, features are generated via shifting and aggregation.In the self-attention path, they represent the features obtained in the stage Ⅰ that are equally divided into queries, keys and values, following the traditional multiheaded self-attention module.Finally, outputs from the convolution path and self-attention path are added together, the strengths are controlled by two learnable weights.
ACmix introduces 1 × 1 convolutions for the weight mapping part of the convolution and selfattention mechanisms to achieve correlation between the two at the underlying level.ACmix integrates the respective characteristics of convolution operations and self-attention while reducing computational overhead.However, the convolution in ACmix ignores the spatial information and channel information of the convolution kernel, making it difficult for the model to accurately fit features.Moreover, self-attention causes the model to converge too slowly and cannot quickly locate the regional location of useful features.Therefore, we employed the fusion of ODConv and CBAM to enhance the network's ability to learn deep features within complex factory farming environments.
The Dynamic Convolution achieves attention-based dynamic weighting of  parallel convolution kernels.The parallelism of  kernels not only maintains the network's width and depth, but also enhances its representation capability.However, existing dynamic convolutions overlook the other three dimensions of the convolutional kernel space: the size of each kernel's spatial dimension, the input channel number and the output channel number.To enable the model to learn more complex features, ODConv utilizes a novel multi-dimensional attention mechanism and parallel strategy to learn complementary attention for convolution kernels across all four dimensions of the kernel space.The CBAM is a lightweight attention module that finds extensive usage in various convolutional neural networks.It comprises two components: The channel attention module (CAM) [34] and the spatial attention module (SAM) [35].The CAM gathers valuable spatial information from feature maps through average pooling and maximum pooling operations.It produces average pooling and maximum pooling features, which are then fed into a multi-layer perceptron (MLP) to generate key features consisting of multiple perceptual layers.Ultimately, channel attention maps are obtained.On the other hand, the SAM generates 2D spatial attention maps by applying average pooling and max pooling operations across channels.By leveraging the channel attention sub-module and the spatial self-attention module, CBAM dynamically learns and adjusts the weight distribution of channels and spatial dimensions in the feature map.This adaptive learning enhances the network's ability to express distinctive features effectively.While ODConv and the CBAM attention module are often regarded as distinct paradigms, it has been demonstrated that the extensive calculations involved in both paradigms are essentially accomplished through the same operations.An ODConv with a kernel size of   can be devided into two stages: the first stage, where the ODConv is decomposed into  individual 1 × 1 convolutions, and the second stage, which performs shifting and summation operations.Similarly, the CBAM attention module is also accomplished in two stages: the first stage projects the queries, keys, and values in the attention module into different 1 × 1 convolutions kernels, and the second stage calculates attention weights aggregately.Consequently, the first stages of both paradigms involve similar computational operations.
Therefore, there is a possibility of fusing the two paradigms: the ODConv module and the CBAM attention module.In this study, the fusion weight is 0.5 for both the ODConv and CBAM attention modules.The ODC-CBAM module is less computationally complex than the pure convolution or attention mechanism, and it can obtain better performance than both paradigms.Therefore, we used the ODC-CBAM module before embedding it into the output of YOLOv5s to improve the model's ability to identify difficult samples and enhance the detection accuracy.In this study, the ODC-CBAM module was used to replace the conventional convolution module in the original model, and the results are shown in Table 2.The best evaluation metric achieved was with Replacement4 (Replacement4 is the ODC-CBAM the replacement layer 7 of the backbone and the regular convolution in front of the Head).

Data acquisition
The experimental data were collected from December 15th to December 22nd, 2022, at the Genetic Breeding Center for Longsnout Catfish of the Agriculture and Rural Ministry Affairs in Pudong New Area, Shanghai.The experimental fish species used in this study are diseased longsnout catfish, provided by the College of Fisheries and Life Science at Shanghai Ocean University.The experimental fish comprise 20 individuals with a weight range of 50-100g and a body length of 10-15 cm.The water temperature in the aquaculture environment is maintained at (25 ± 1) ℃, with a dissolved oxygen level of (5 ± 0.3) mg/L and pH value of (7 ± 0.5).When the disease occurs in the longsnout catfish, there are fewer white spots on the surface of the longsnout catfish in the early stages of the disease.Over time, the longsnout catfish develop large areas of abnormal surface features.A fish image acquisition system is developed to obtain raw data more efficiently, which simultaneously captured in-water images.As shown in Figure 8, this system consists of a BARLUS camera (S97K8F-8D6X10), a support bracket, and a circular fish-rearing tank.The rearing tank has a radius of 76cm, a height of 85.5 cm and a water depth of 40 cm.The BARLUS camera captures 24-bit RGB true-color images with a resolution of 3840 (pixels) × 2160 (pixels) and a frame rate of 60 FPS.

Dataset for improved YOLOv5s
The College of Fisheries and Life Science at Shanghai Ocean University manually screened the videos, resulting in 38 segments containing abnormal surface features of longsnout catfish.The average duration of each segment is approximately 9 seconds.The initial step of this study involves reading each frame from the 38 video segments collected at the Genetic Breeding Center for Longsnout Catfish of the Agriculture and Rural Ministry Affairs in Pudong New Area, Shanghai.Then, every five frames are sampled to extract one frame, yielding 4104 images of abnormal longsnout catfish.We employ the structural similarity index (SSIM) algorithm [36] to further screen the original dataset obtained from the video streams.This screening aimed to eliminate redundant and noisy images.The SSIM algorithm assesses the similarity of a pair of images based on three main image features: luminance, contrast and structure.It computes a comprehensive SSIM index by weighting and summing these similarity measurements.One of the advantages of SSIM is its consideration of structural information in images, making it robust against lighting variations, noise and distortions.The computation formula for SSIM is shown in Eq (9).

𝑆𝑆𝐼𝑀 𝑋, 𝑌
where  ,  is a metric used to measure the similarity between images X and Y.  and  denote the mean values of images X and Y, respectively, and their standard deviations are represented by  and  .The covariance of X and Y is represented by  .The values of  and  are constants that can be arbitrarily set.
After applying the SSIM algorithm for screening, the final dataset consists of 1600 surface images of longsnout catfish (in-water) images from video streams.1600 surface images of longsnout catfish (in-water) images in the video stream were manually annotated using LabelImg.Our labeling principle for the in-water data set is: annotate the entire fish.The annotations generated XML files containing coordinate information, image size, 27200 annotated bounding boxes and label name (disease).These XML files were saved in the VOC2007 dataset format, creating the abnormal longsnout catfish inwater dataset, which was utilized in this experiment.
To split the datasets for training, validation and testing, we followed an 8:1:1 ratio, and the training set and test set cannot come from the same video sequence.Thus, each dataset was divided into 1280 images for the training set and 160 for the validation and testing sets.Some sample images from this study dataset are shown in Figure 9.As observed from Figure 9, the abnormal longsnout catfish inwater dataset presents some challenging samples which pose some difficulties in detecting abnormal fish.The specific challenges are as follows: 1) Pixel blur: The fast swimming speed of longsnout catfish poses a challenge of accurate target capture.
2) Serious overlap: Due to the biological characteristic of longsnout catfish liking to gather in groups, the targets being tested in the collected images are heavily overlapped.3) Similarity between features: In the early stages, the features of abnormal longsnout catfish are similar to those of healthy longsnout catfish, making it necessary for the model to have a strong feature learning ability.4) Small target: Most abnormal longsnout catfish are young fish that are far away from the image collection system, making them small or tiny targets with few pixels and insufficient features.Therefore, the model's ability to detect small targets needs improvement.
5) Light attenuation: When longsnout catfish are disturbed, the mucous cells on their surface secrete a large amount of acidic mucus, causing the aquaculture water to become a gel-like substance, which leads to a certain attenuation of the light reflected to the camera.This makes it more difficult to identify surface features.
6) External interference: There are uncontrollable factors, such as the operation of motors in the circulating water aquaculture laboratory where longsnout catfish are raised, causing slight water surface fluctuations in the aquaculture tanks.As a result, the collected images may contain phenomena such as reflections and inverted images.

Experiment platform and training hyperparameters
In this paper, the improved model is experimented on a deep learning server with the configuration shown in Table 3.

Model evaluation
The aim of this study is to develop an abnormal longsnout catfish surface feature detection model that balances both detection accuracy and speed.Mean average precision (mAP) is a commonly used evaluation metric in object detection models.It is calculated based on the Precision-Recall (PR) curve, which is composed of precision and recall [37].mAP50 and mAP50:95 can comprehensively evaluate the model's ability to detect targets of different sizes and shapes and more objectively reflect the accuracy of the model.Correspondingly, four indexes are used to evaluate the accuracy of the model: precision, recall, mAP50, and mAP50:95.FPS is the number of detected frames per second, and an FPS of 30 is sufficient for real-time detection.For practical applications in the field of aquaculture, real-time detection of abnormal fish is very important.FPS, as one of the performance evaluation indicators, can show the advantage of the model in processing speed.The formulae for precision (P), recall (R), mAP50 and mAP50:95 are shown as: mAP ∑ where  is the number of true positive samples;  is the number of false positive samples;  is the number of false negative samples;  is the average precision of a category; and  is the number of categories.The difference between mAP50 and mAP50:95: mAP50 refers to the average AP at an IoU threshold of 0.5, and mAP50:95 refers to the average AP over a range of IoU thresholds, typically from 0.5 to 0.95, in steps of 0.05.

Training result analysis
This study's training results are shown in Figure 10.From Figure 10(a), it can be seen that the precision of the model rose rapidly to 95.53% within the initial 70 training epochs.Subsequently, the precision reached a stable level of approximately 99.5% as the training progressed.Examining Figure 10(b), it can be observed that the recall of the model demonstrated a swift increase of 0.976 within the first 50 epochs of training.With further training, the model's recall stabilized at around 99.3%.Figure 10(c) displays a notable trend in the loss value, wherein a significant decrease occurred within the initial 50 epochs of training, followed by a stabilized pattern after 300 epochs.The training results from Figure 10 demonstrate that the enhanced model performs well in abnormal surface features of longsnout catfish detection.The decreasing loss function indicates that the model has reached a state of convergence.After calculating the timestamp function, our model trains for about 80 seconds for 1 epoch, and the training time for 500 rounds is about 11.1 hours, which allows us to optimize the training time of the model based on the specific needs of actual applications.This time frame may vary depending on the complexity of the model, the size of the data set and the computing resources used.

Ablation study
To assess the overall performance of the enhanced model, this study conducts specific ablation experiments on each component of the improvement and analyzes their respective effects.It is crucial to ensure that the ablation experiments are conducted using the same data set and hyperparameters.
The training results are presented in Table 5.From the table, it can be observed that the model's performance can be improved by employing the NWD metric, DenseOne module, ODC-CBAM module and MobileViTv2 module individually.Notably, the ODC-CBAM module exhibits the most significant impact, surpassing the baseline mAP50 and mAP50:95 by 2.1 and 3.2%, respectively, with minimal increase in parameters.This can be attributed to the ODC-CBAM module's integration of ODConv and CBAM modules based on the ACmix principle, which leverages the strengths of both modules.As a result, it can effectively learn useful features from complex backgrounds and suppress irrelevant background features, thereby enhancing the model's capability to represent features, especially for challenging samples.The DenseOne module achieves parameter reduction while improving the detection accuracy of the model.Compared to the baseline (Model 1), our proposed model (Model 8) achieves a maximum increase of 2.9% in mAP50 on the abnormal longsnout catfish dataset, with a remarkable growth of 12.25% in mAP50:95.However, it is important to note that the FPS significantly decreases after reaching the highest mAP50:95 value.In the detection of abnormal longsnout catfish, faster reasoning speed facilitates timely identification of affected specimens, reducing unnecessary economic losses and environmental pollution.MobileViTv2 utilizes depthseparable convolution, separable self-attention and element-wise operation to improve the inference speed of the model.Notably, our method maintains nearly the same detection speed as the baseline while comprehensively enhancing the detection accuracy of abnormal longsnout catfish, albeit with a slight increase in parameters (parameters are usually included as part of the memory access and do not affect the inference speed of the model), model size and GFLOPs.

Algorithm performance
To verify the detection performance of the enhanced model for abnormal longsnout catfish, a predetermined test set was utilized to input both the pre-improvement and post-improvement models.The visualization of model detection results, as depicted in Figure 10, provides insights into the comparison.Observing Figure 11(a) and (b), the baseline model is prone to missed detection.In contrast, the improved model tackles this issue by substituting the original model's NMS and CIoU loss functions with the NWD metric.By employing two-dimensional Gaussian modeling on the target bounding box and utilizing the normalized Wasserstein distance, the NWD metric calculates similarity and effectively eliminates sensitivity to small deviations in object position based on IoU and its extensions.Figure 11(c) and (d) visually demonstrate the superiority of the improved model over YOLOv5s.This notable improvement can be attributed to the challenges posed by uncontrollable factors such as pixel blur resulting from the high-speed motion of the longsnout catfish and complex backgrounds involving lighting and reflection.Introducing the ODC-CBAM at the backbone network and the front end of Head allows for the extraction of valuable abnormal surface features from complex backgrounds and suppresses the interference of useless features, such as the background.Moreover, replacing the C3 module in PAN with DenseOne enhances feature reuse, facilitates feature propagation and augments feature expression capability.Consequently, our proposed method not only enhances the detection confidence score for abnormal longsnout catfish, but also effectively mitigates false detection issues.

Model evaluation on validation set
In order to better evaluate the feasibility of the improved model in the field of aquaculture, we performed statistical analysis on the validation results.Confidence threshold (Conf_thres) is 0.001, IoU threshold (IoU_thres) is 0.6 and batch-size is 32.The results are shown in Table 6.It is worth noting that the improved model precision, recall, mAP50 and mAP50:95 have increased by 1.4, 1.2, 3.2 and 8.2%, respectively.Our method has extremely low missed detection rate and false detection rate compared with the baseline.Our inference speed increased by 1 FPS compared with the baseline, while the model weight size only increased by 0.9M.The model we proposed is suitable for the detection of fish with abnormal surface features in real aquaculture and is easy to deploy to edge devices and web interface development.As can be seen from Figure 12, the improved model result in FP when facing extremely small targets.This is because the IoU between the predicted box and the real box is less than the threshold set by the model, which leads to missed detection.In addition, our model is prone to false positive problems when faced with severe occlusion between targets.

Grad-CAM visualization results
Gradient-weighted class activation mapping (Grad-CAM) [38] is a technique that enhances model interpretability by visualizing the input regions crucial for predictions, providing visual explanations without requiring architectural modifications or retraining.In this study, two random images of abnormal longsnout catfish from the test set were selected to generate visualized heat maps using the Grad-CAM method for the YOLOv5s before and after the enhancement.The results are presented in Figure 13 (In the color spectrum, regions closer to blue indicate a lower proportion of features, while redder regions denote a higher proportion.A higher feature proportion implies greater importance in detecting abnormal longsnout catfish.). Figure 13 reveals that the red areas in the original model mainly correspond to the healthy parts and the background of the abnormal fish, whereas the improved model precisely identifies the abnormal surface features of the longsnout catfish.
The introduction of the ODC-CBAM in the backbone allowed for the extraction of important features related to the abnormal surface features of longsnout catfish in terms of the convolutional kernel space, input or output channels and more, inhibiting the learning of background and healthy part features.Additionally, the integration of MobileViTv2 into the Backbone facilitated the extraction and integration of local and global information from the features of the abnormal longsnout catfish, resulting in more comprehensive feature extraction.In the PAN part, the C3 module was replaced with the DenseOne module, enhancing feature reuse through short connections and enabling the model to learn a more complete information flow.As a result, the improved model exhibits improved accuracy in detecting abnormal longsnout catfish and mitigates the interference caused by complex backgrounds and other uncontrollable factors.

State-of-the-art models' performance comparison
To validate the superior performance of our proposed method on the abnormal longsnout catfish dataset, we conducted comparative experiments with the state-of-the-art methods, including mainstream one-stage object detection methods: YOLOv4, YOLOv5, YOLOv7, YOLOv8, SSD and the mainstream two-stage object detection algorithm Faster R-CNN.These comparative experiments were meticulously carried out under identical hardware environments, datasets, hyperparameters and training epochs.The results of the comparison are presented in Table 7, while Figure 14 illustrates the comparison of the PR curves of the seven algorithms.
The area surrounded by the PR curve reflects the algorithm's performance.From Figure 14 and Table 7, it is evident that our proposed method surpasses other models in the downstream task of abnormal longsnout catfish surface features detection.First, Table 7 reveals that the evaluation metrics of YOLOv4 and SSD are not excellent.Compared with the one-stage object detection methods SSD and YOLOv4, Faster R-CNN has a certain increase in mAP50 and mAP50:95, but it is not good with detection speed.Because the one-stage target detection algorithm has a faster detection speed, its detection accuracy is inferior to the two-stage target detection algorithm.YOLOv7 and YOLOv8 fail to exhibit superior performance compared to the baseline, and their models' large number of parameters, big model size and GFLOPs make them unsuitable for deployment on resourceconstrained IoT devices.Notably, the parameters, model size and FLOPs of our proposed model are 7.36M, 15.2MB and 16.4G, respectively, ranking second only to the baseline.This proves that our method does not require expensive hardware device support.In the comparative experiments, our method outperforms other algorithms in terms of detection accuracy, with an impressive mAP50 reaching 99.3%.Finally, in terms of FPS, the improved algorithm achieves a remarkable 88 FPS, meeting the real-time detection needs in factory farming and outperforming the baseline by 1 FPS, thus surpassing other algorithms in detection speed.We visually compare the proposed method with the six mainstream target detection models mentioned above, as shown in the Figure 15.As can be seen from Figure 13, YOLOv8 and YOLOv7 have good detection capabilities for individuals with abnormal surface features.However, there is one false positive, two missed detections and one false detection in the detection of aggregated fish with abnormal surface features.This shows that the ability of the two to distinguish background interference is weak, and they cannot accurately allocate prediction frames to the aggregated abnormal surface features of longsnout catfish.As a two-stage target detection algorithm, Faster R-CNN has higher detection accuracy than the one-stage algorithms SSD and YOLOv4, but these three have poor recognition capabilities for fish with abnormal surface features in complex scenes.Our proposed method not only leads other models in confidence scores and has extremely low miss and false detection rates, but it also has higher precision and recall for dense small targets in low-light and background distractor environments.

Conclusions
This study proposes an improved YOLOv5s target detection model for the automatic monitoring of abnormal surface features of fish.Compared with previous manual detection methods, our model is not affected by factors such as emotion, fatigue or subjectivity.It avoids the impact of individual differences or supervisor bias on detection results, can process large amounts of data in a shorter time and provides more consistent detection results.In the model, we introduce a notable enhancement by substituting the CIoU loss function and NMS with the NWD metric.This improvement aims to enhance the model's ability to detect small targets and speed up convergence speed of the model.The MobileViTv2 module is added to the Backbone to improve the feature representation ability and computing efficiency of the model.In addition, we design the DenseOne module to improve detection accuracy while reducing the model size and parameters for edge devices.Based on the above improvements, the ODC-CBAM modules are integrated into the Backbone and the PAN part of the network, which reduces the missed detection rate and false detection rate of abnormal surface features located in complex scenes.The improved model was evaluated on the validation set, with a precision of 99.5% and a recall of 99.3%, which are 1.4 and 1.2% higher than the baseline respectively.While the inference speed is increased by 1 FPS, the model size is only increased by 0.9M, achieving a balance between model detection speed, model size and detection accuracy.
The experimental results show that the proposed model can be quickly and effectively used to detect abnormal surface features of fish, but it also has certain limitations: 1) Single data type: The object of study in this study is only one kind of fish with abnormal surface features, and no related experiments were carried out on other fish with abnormal surface feature.The model effect still needs to be verified in future work.2) Fish density is fixed: The paper did not verify the model effect in a high-density breeding scenario.3) Schools of fish lack minimal targets: In scenarios where there is an extreme lack of abnormal surface feature information, model performance may be poor.Therefore, in future work we will conduct in-depth research on the problem of abnormal fish detection for different fish or aquaculture scenarios.Moreover, we will combine multi-modality and transfer learning and construct different abnormal fish surface feature datasets to solve various downstream tasks.
Although our model has certain shortcomings, we can quickly and accurately detect abnormal surface features of longsnout catfish in the water, and the model size and parameters are relatively lightweight.We can deploy this model to embedded devices and web platforms.Therefore, this study provides new ideas for the realization of smart aquaculture.

Declaration of ethical considerations of computer vision in aquaculture
Fish Welfare: We recognize that any technology that affects fish health and welfare needs to be treated with caution.Our research aims to help monitor abnormal fish species to improve farming efficiency, but we will also highlight the need to ensure the impact of these techniques on fish species is minimized.
Privacy Issues: We take the protection of fish privacy seriously when it comes to data collection and image processing.We try to minimize disruption and impact on individual fish and take steps to ensure data security and privacy protection.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 1 .
Figure 1.Images of abnormal surface features of fish photographed at a table.

Figure 2 .
Figure 2. Abnormal fish detection method based on improved YOLOv5s.

Figure 6 .
Figure 6.Schematic diagram of the DenseNet and DenseOne modules.

Note: Replacement 1
is ODC-CBAM replacing the regular convolution of layers 1, 3, 5 and 7 of the Backbone and the front of the Head.Replacement 2 is ODC-CBAM replacing the regular convolution of layers 3, 5 and 7 of the Backbone and the front of the Head.Replacement 3 is ODC-CBAM replacing the regular convolution of layers 5 and 7 of the Backbone and the front of the Head.Replacement 4 is ODC-CBAM replacing layer 7 of the Backbone and the regular convolution in front of the Head.

Figure 8 .
Figure 8. Schematic diagram of the fish image acquisition system.

Figure 9 .
Figure 9.Some example images from this study dataset.

Figure 11 .
Figure 11.Model inference results before and after improvement.The first line is the inference result of YOLOv5s; The second line is the inference result of our method.

Figure 12 .
Figure 12.Examples of missed detections and false detections in the validation set.

Figure 13 .
Figure 13.Grad-CAM visualization results.The first column is two randomly selected test set images; the second column is the visualization result of the improved model; the third column is the visualization result of the baseline.

Figure 15 .
Figure 15.Comparison of different model inference results.

Table 1 .
Comparison of evaluation metrics for the five YOLOv5 models.

Table 2 .
Experimental results of the ODC-CBAM model at different locations.

Table 3 .
Deep learning server configuration.Some of the training hyperparameters of the improved model are: the input image size is 640 × 640, the optimizer is SGD with decay and momentum of 0.937, Warming-up strategy, learning rate decay, L2 regularization and data preprocessing techniques are used in the training process.The maximum learning rate is 0.01 and gradually decreases.The batch size is 16 in order to reduce the computing pressure, with a total of 500 epochs of training.The training hyperparameters are shown in the

Table 5 .
Comparison of model training evaluation metrics.

Table 6 .
Validation set experimental results.

Table 7 .
Experimental results of the comparison of the seven models.