One-Year-Old Precocious Chinese Mitten Crab Identification Algorithm Based on Task Alignment

Simple Summary Simple Summary: We developed R-TNET, a detection model tailored to identify one-year-old sexually precocious crabs. By addressing issues like subtle classification features and overlapping masks, the model was able to accurately identify and localize juvenile crabs on the validation data, and the mean average precision reached 88.78%. This effect meets practical needs and can be applied in the field of intelligent sorting, thus improving non-destructive technology in aquaculture. Abstract The cultivation of the Chinese mitten crab (Eriocheir sinensis) is an important component of China’s aquaculture industry and also a field of concern worldwide. It focuses on the selection of high-quality, disease-free juvenile crabs. However, the early maturity rate of more than 18.2% and the mortality rate of more than 60% make it difficult to select suitable juveniles for adult culture. The juveniles exhibit subtle distinguishing features, and the methods for differentiating between sexes vary significantly; without training from professional breeders, it is challenging for laypersons to identify and select the appropriate juveniles. Therefore, we propose a task-aligned detection algorithm for identifying one-year-old precocious Chinese mitten crabs, named R-TNET. Initially, the required images were obtained by capturing key frames, and then they were annotated and preprocessed by professionals to build a training dataset. Subsequently, the ResNeXt network was selected as the backbone feature extraction network, with Convolutional Block Attention Modules (CBAMs) and a Deformable Convolution Network (DCN) embedded in its residual blocks to enhance its capability to extract complex features. Adaptive spatial feature fusion (ASFF) was then integrated into the feature fusion network to preserve the detailed features of small targets such as one-year-old precocious Chinese mitten crab juveniles. Finally, based on the detection head proposed by task-aligned one-stage object detection, the parameters of its anchor alignment metric were adjusted to detect, locate, and classify the crab juveniles. The experimental results showed that this method achieves a mean average precision (mAP) of 88.78% and an F1-score of 97.89%. This exceeded the best-performing mainstream object detection algorithm, YOLOv7, by 4.17% in mAP and 1.77% in the F1-score. Ultimately, in practical application scenarios, the algorithm effectively identified one-year-old precocious Chinese mitten crabs, providing technical support for the automated selection of high-quality crab juveniles in the cultivation process, thereby promoting the rapid development of aquaculture and agricultural intelligence in China.


Introduction
Eriocheir sinensis (Crabidae, Decapoda, Crustacea) is an important aquaculture target in China, and its yearly production amounts to more than 0.7 million tons [1].It can be seen that it also plays a crucial role in the aquaculture industry worldwide.Since the 1980s, when artificial seawater preparation technology [2] was adopted to break through the breeding problem of Eriocheir sinensis, China has established a solid industrial chain from juvenile crab breeding [3], crab breed cultivation, adult crab breeding, and processing to export, which has become an important way to increase income in agriculture and make fishermen rich.However, in the process of crab seed breeding, inbreeding [4] due to artificial reproduction, the use of small-size parents for seedling breeding, introductions between different aquatic systems, and disorderly breeding have caused serious degradation of germplasms and the mixing of river crab breeding populations.This has had a significant impact on breeding performance and has brought greater harm to the river crab aquaculture industry [5].
The issue of precocious maturity in one-year-old Chinese mitten crabs has long hindered development in the industry [6].Typically, it refers to individuals weighing over 20 g with developed gonads, known as "old man crabs".In natural habitats like the Yangtze River Basin, around 5% to 10% exhibit precocious maturity, while in controlled environments, rates can soar to 18.2% to 98.0% due to factors like temperature and nutrition [7].Cultivating these precocious crabs leads to high adult mortality rates (60% to 90%) and diminished market value due to their small size [8].Consequently, reducing their proportion is vital for productivity and profitability.However, despite various theories, no definitive method exists to eliminate precocious maturity [9].Hence, rigorous screening during species selection is crucial.
Traditionally, identifying one-year-old precocious Chinese mitten crabs relies on manual observation, using visual cues such as differences in abdomen morphology and velvet characteristics [10].However, this method is inefficient and prone to errors.With the advancement of computer vision technology, employing image processing and machine learning algorithms has emerged as a promising approach.These techniques enable automatic analysis of large image datasets, accurately identifying subtle biometric differences without human intervention [11].In summary, computer vision-based detection methods for precocious crabs offer practical feasibility and advantages [12].
Recent advancements in aquatic research have focused on the identification and localization of river crabs, leading to significant developments in technology.The use of the YOLOv3 algorithm on underwater images has provided reliable data for automated baiting boats [13], supporting precision feeding strategies through highly accurate predictions of crab biomass and location.Further refinements include the adoption of the lightweight Efficient Net, enabling the model to operate on feeding devices with minimal storage requirements while maintaining high detection accuracy and completeness [14].This is particularly useful for resource-constrained environments like automated baiting vessels.
Additionally, the focus has shifted towards the sex classification of crabs to optimize the benefits during the selling stage [15].Techniques leveraging the ResNet50 network to analyze abdominal features have demonstrated high accuracy and reliability in distinguishing between male and female crabs [16].However, the effectiveness of these algorithms can be compromised in production environments with complex backgrounds, where highquality image data are crucial for accurate classification [17].To address this, Chen et al. [18] proposed a lightweight crab detection and gender classification method using the improved YOLOv4, achieving remarkable accuracy.Gu et al. [19] constructed a dataset for Chinese mitten crab fries and achieved high accuracy in gender classification using an enhanced Faster R-CNN network, even with small samples and inconspicuous gender characteristics.
Identifying one-year-old precocious mitten crabs is challenging due to their juvenile stage, small size, and potential overlap.These crabs exhibit fewer distinct classification features compared to adult gender classification, with distinguishing features not confined to the abdominal region.To ensure accurate detection, deep learning-based target detection algorithms are utilized.These algorithms fall into two categories [20]: region-based methods like the R-CNN series, which generate candidate regions for feature extraction and classification but are computationally intensive, and single-stage detection methods like YOLO and TOOD [21], which directly predict the target category and location, offering more efficiency.Given the challenges of uneven size, feature overlap, and complex backgrounds, the TOOD algorithm is selected as the benchmark model for detecting one-year-old precocious crabs in this study.The main contributions of this paper are as follows: (1) Introducing ResNeXt as the backbone network, embedding a CBAM in the residual block, and using variable convolution in the last residual block to enhance feature extraction for sexually precocious Chinese mitten crabs.(2) Combining the ASFF module with the traditional FPN for integrated training to solve feature fusion mismatches and reduce the loss of small-target information.
(3) Conducting experiments on the anchor alignment metric formula to optimize the values of p and q, improving accuracy by adjusting the proportion of s.

Crab Dataset
Creating datasets is crucial for deep learning-based target detection [22].Given the limited availability of datasets for identifying one-year-old precocious Chinese mitten crabs, this study independently constructs one using images primarily from artificially cultured crabs in Yangcheng Lake.These images were collected over ten months, reflecting the morphological characteristics of one-year-old precocious crabs (Figure 1).The early male crabs have more velvet in the pincer area, while the female crabs have fuller and rounder abdomens (Figure 1A).To ensure relevance to practical needs, the images were captured during the critical period for crab juvenile selection.We placed one-year-old precocious crabs and normal samples into transparent tanks and photographed them from bottom to top using a camera.While variations in background and backlighting affected abdominal feature visibility, occlusion and overlapping further complicated image clarity.The preprocessing and data enhancement techniques addressed these challenges in subsequent steps.location, offering more efficiency.Given the challenges of uneven size, feature overlap, and complex backgrounds, the TOOD algorithm is selected as the benchmark model for detecting one-year-old precocious crabs in this study.The main contributions of this paper are as follows: (1) Introducing ResNeXt as the backbone network, embedding a CBAM in the residual block, and using variable convolution in the last residual block to enhance feature extraction for sexually precocious Chinese mitten crabs.(2) Combining the ASFF module with the traditional FPN for integrated training to solve feature fusion mismatches and reduce the loss of small-target information.
(3) Conducting experiments on the anchor alignment metric formula to optimize the values of  and , improving accuracy by adjusting the proportion of .

Crab Dataset
Creating datasets is crucial for deep learning-based target detection [22].Given the limited availability of datasets for identifying one-year-old precocious Chinese mitten crabs, this study independently constructs one using images primarily from artificially cultured crabs in Yangcheng Lake.These images were collected over ten months, reflecting the morphological characteristics of one-year-old precocious crabs (Figure 1).The early male crabs have more velvet in the pincer area, while the female crabs have fuller and rounder abdomens (Figure 1A).To ensure relevance to practical needs, the images were captured during the critical period for crab juvenile selection.We placed one-yearold precocious crabs and normal samples into transparent tanks and photographed them from bottom to top using a camera.While variations in background and backlighting affected abdominal feature visibility, occlusion and overlapping further complicated image clarity.The preprocessing and data enhancement techniques addressed these challenges in subsequent steps.

Data Preprocessing
This study acquired a total of 700 high-resolution images (4032 × 3024 pixels in JPG format) of one-year-old precocious Chinese mitten crabs, including 846 males and 928 females.These images were labeled using Labelme 3.16.5 software [23].To enhance the

Data Preprocessing
This study acquired a total of 700 high-resolution images (4032 × 3024 pixels in JPG format) of one-year-old precocious Chinese mitten crabs, including 846 males and 928 females.These images were labeled using Labelme 3.16.5 software [23].To enhance the visibility of features obscured by background effects, the exposure of the original images was increased.Adjusting the exposure to 90% significantly improved the visibility of the abdominal features, facilitating easier distinction of the crabs (Figure 2).visibility of features obscured by background effects, the exposure of the original images was increased.Adjusting the exposure to 90% significantly improved the visibility of the abdominal features, facilitating easier distinction of the crabs (Figure 2).

Data Augmentation
In computer vision, deep convolutional neural networks are highly regarded for object detection tasks.However, small sample datasets often lead to poor model generalization and increased risk of overfitting [24].Our labeled dataset revealed a significant lack of target frames for one-year-old precocious Chinese mitten crabs, impacting the experimental accuracy.To mitigate this, fine-tuning and optimizing the model parameters for different targets were crucial to minimize the global loss function and improve the detection efficiency.Given the challenge of collecting extensive data, enhancing the dataset through brightness adjustment, rotation, and scaling was undertaken.This resulted in a dataset of 3928 images, including 16,820 early crabs, with a ratio of 9:1 between the training and validation sets (Table 1).

Experimental Environment
In order to implement the methods in this paper, all experiments were carried out under the Ubuntu operating system and PyTorch deep learning framework.The server CPU model was Intel(R) Core i9-11900KF, 64 GB RAM, and an NVIDIA 3090 GPU (24 GB).The deep learning environment was composed of PyTorch 1.6.1,Python 3.7, CUDA 11.7, and cuDNN 8.5.0.

Overview of TOOD Framework
The Task-Aligned One-Stage Object Detection (TOOD) framework is an advanced single-stage object detection method designed to improve detection accuracy and speed [25].It addresses the limitations of traditional methods by enhancing task alignment and optimization efficiency.TOOD's key innovation is its task-aligned optimization strategy, which introduces a task alignment mechanism throughout feature extraction, candidate frame generation, classification, and localization.This strategy effectively coordinates classification and localization, significantly improving detection performance.
The original TOOD algorithm uses ResNet50 [26] for feature extraction and employs a Feature Pyramid Network (FPN) to integrate semantic information at different scales.Unlike traditional models with separate branches for classification and localization, TOOD features a unique task-pair flush (T-Head) structure (Figure 3).This structure

Data Augmentation
In computer vision, deep convolutional neural networks are highly regarded for object detection tasks.However, small sample datasets often lead to poor model generalization and increased risk of overfitting [24].Our labeled dataset revealed a significant lack of target frames for one-year-old precocious Chinese mitten crabs, impacting the experimental accuracy.To mitigate this, fine-tuning and optimizing the model parameters for different targets were crucial to minimize the global loss function and improve the detection efficiency.Given the challenge of collecting extensive data, enhancing the dataset through brightness adjustment, rotation, and scaling was undertaken.This resulted in a dataset of 3928 images, including 16,820 early crabs, with a ratio of 9:1 between the training and validation sets (Table 1).

Experimental Environment
In order to implement the methods in this paper, all experiments were carried out under the Ubuntu operating system and PyTorch deep learning framework.The server CPU model was Intel(R) Core i9-11900KF, 64 GB RAM, and an NVIDIA 3090 GPU (24 GB).The deep learning environment was composed of PyTorch 1.6.1,Python 3.7, CUDA 11.7, and cuDNN 8.5.0.

Overview of TOOD Framework
The Task-Aligned One-Stage Object Detection (TOOD) framework is an advanced single-stage object detection method designed to improve detection accuracy and speed [25].It addresses the limitations of traditional methods by enhancing task alignment and optimization efficiency.TOOD's key innovation is its task-aligned optimization strategy, which introduces a task alignment mechanism throughout feature extraction, candidate frame generation, classification, and localization.This strategy effectively coordinates classification and localization, significantly improving detection performance.
The original TOOD algorithm uses ResNet50 [26] for feature extraction and employs a Feature Pyramid Network (FPN) to integrate semantic information at different scales.Unlike traditional models with separate branches for classification and localization, TOOD features a unique task-pair flush (T-Head) structure (Figure 3).This structure enhances the interaction between classification and localization tasks, ensuring prediction consistency.The T-Head has two main parts: the task feature extraction module and the Task-Aligned Predictor (TAP).The T-Head first processes the semantic information from the FPN to compute task interaction features.These features are then fed into two parallel TAP structures, which measure the alignment between classification and localization tasks.
enhances the interaction between classification and localization tasks, ensuring prediction consistency.The T-Head has two main parts: the task feature extraction module and the Task-Aligned Predictor (TAP).The T-Head first processes the semantic information from the FPN to compute task interaction features.These features are then fed into two parallel TAP structures, which measure the alignment between classification and localization tasks.The single branch design in task interaction features can lead to conflicts between classification and localization tasks due to their differing goals and focuses on hierarchy and perceptual fields [27].In the TAP module, the layer attention mechanism processes task interaction features, dynamically optimizing them to improve classification and localization efficiency (Figure 4).The T-Head adjusts classification probability and localization prediction based on alignment metrics calculated by task alignment learning (TAL) during backpropagation, enabling it to provide expressive multi-scale features for both tasks.

Improvement of Feature Extraction Network
To enhance algorithm performance and address the issue of ResNet50 not effectively distinguishing between feature channels during feature fusion, this paper explores using an improved ResNeXt-50 as the feature extraction network [28].Each residual unit of Res-NeXt-50 incorporates a CBAM module [29], which adaptively recalibrates based on the importance of each feature channel, thereby enhancing the utilization of the residual units [30].Additionally, given the variability in size and gender features of Chinese mitten The single branch design in task interaction features can lead to conflicts between classification and localization tasks due to their differing goals and focuses on hierarchy and perceptual fields [27].In the TAP module, the layer attention mechanism processes task interaction features, dynamically optimizing them to improve classification and localization efficiency (Figure 4).The T-Head adjusts classification probability and localization prediction based on alignment metrics calculated by task alignment learning (TAL) during backpropagation, enabling it to provide expressive multi-scale features for both tasks.
consistency.The T-Head has two main parts: the task feature extraction module and the Task-Aligned Predictor (TAP).The T-Head first processes the semantic information from the FPN to compute task interaction features.These features are then fed into two parallel TAP structures, which measure the alignment between classification and localization tasks.The single branch design in task interaction features can lead to conflicts between classification and localization tasks due to their differing goals and focuses on hierarchy and perceptual fields [27].In the TAP module, the layer attention mechanism processes task interaction features, dynamically optimizing them to improve classification and localization efficiency (Figure 4).The T-Head adjusts classification probability and localization prediction based on alignment metrics calculated by task alignment learning (TAL) during backpropagation, enabling it to provide expressive multi-scale features for both tasks.

Improvement of Feature Extraction Network
To enhance algorithm performance and address the issue of ResNet50 not effectively distinguishing between feature channels during feature fusion, this paper explores using an improved ResNeXt-50 as the feature extraction network [28].Each residual unit of Res-NeXt-50 incorporates a CBAM module [29], which adaptively recalibrates based on the importance of each feature channel, thereby enhancing the utilization of the residual units [30].Additionally, given the variability in size and gender features of Chinese mitten

Improvement of Feature Extraction Network
To enhance algorithm performance and address the issue of ResNet50 not effectively distinguishing between feature channels during feature fusion, this paper explores using an improved ResNeXt-50 as the feature extraction network [28].Each residual unit of ResNeXt-50 incorporates a CBAM module [29], which adaptively recalibrates based on the importance of each feature channel, thereby enhancing the utilization of the residual units [30].Additionally, given the variability in size and gender features of Chinese mitten crabs, a deformable convolutional network (DCN) replaces the standard convolution in the last residual block of ResNeXt-50 [31].This adjustment allows the network to adapt to target variations through trainable offsets, significantly improving the robustness of target detection.The modified ResNeXt-50 network structure is depicted in Figure 5.
crabs, a deformable convolutional network (DCN) replaces the standard convolution in the last residual block of ResNeXt-50 [31].This adjustment allows the network to adapt to target variations through trainable offsets, significantly improving the robustness of target detection.The modified ResNeXt-50 network structure is depicted in Figure 5.

ResNeXt Network
ResNeXt, introduced in 2017, innovates CNN architecture with grouped convolution, dividing input into subsets for independent convolution operations before merging outputs.This approach broadens the network structure without increasing its depth or complexity.The basic ResNeXt module divides the convolution kernel into 32 groups, reducing input feature maps to 4 channels, processing with 3 × 3 convolution, and then increasing the channels to 256 and summing the results with residual connections (Figure 6).This method reduces model parameters while enhancing representation and generalization performance, making ResNeXt suitable for complex data tasks like identifying one-yearold precocious Chinese mitten crabs.

Convolutional Block Attention Module
Attention mechanisms [32], which are crucial in deep learning for computer vision, are exemplified by the Convolutional Block Attention Module (CBAM).The CBAM enhances model performance by emphasizing key parts of the input feature map.It consists of two sub-modules: the Channel Attention Module (CAM) for identifying important channel features and the Spatial Attention Module (SAM) for highlighting relevant locations in the feature map.This sequential processing improves the model's focus on taskrelevant features (Figure 7).

ResNeXt Network
ResNeXt, introduced in 2017, innovates CNN architecture with grouped convolution, dividing input into subsets for independent convolution operations before merging outputs.This approach broadens the network structure without increasing its depth or complexity.The basic ResNeXt module divides the convolution kernel into 32 groups, reducing input feature maps to 4 channels, processing with 3 × 3 convolution, and then increasing the channels to 256 and summing the results with residual connections (Figure 6).This method reduces model parameters while enhancing representation and generalization performance, making ResNeXt suitable for complex data tasks like identifying one-year-old precocious Chinese mitten crabs.
crabs, a deformable convolutional network (DCN) replaces the standard convolution in the last residual block of ResNeXt-50 [31].This adjustment allows the network to adapt to target variations through trainable offsets, significantly improving the robustness of target detection.The modified ResNeXt-50 network structure is depicted in Figure 5.

ResNeXt Network
ResNeXt, introduced in 2017, innovates CNN architecture with grouped convolution, dividing input into subsets for independent convolution operations before merging outputs.This approach broadens the network structure without increasing its depth or complexity.The basic ResNeXt module divides the convolution kernel into 32 groups, reducing input feature maps to 4 channels, processing with 3 × 3 convolution, and then increasing the channels to 256 and summing the results with residual connections (Figure 6).This method reduces model parameters while enhancing representation and generalization performance, making ResNeXt suitable for complex data tasks like identifying one-yearold precocious Chinese mitten crabs.

Convolutional Block Attention Module
Attention mechanisms [32], which are crucial in deep learning for computer vision, are exemplified by the Convolutional Block Attention Module (CBAM).The CBAM enhances model performance by emphasizing key parts of the input feature map.It consists of two sub-modules: the Channel Attention Module (CAM) for identifying important channel features and the Spatial Attention Module (SAM) for highlighting relevant locations in the feature map.This sequential processing improves the model's focus on taskrelevant features (Figure 7).

Convolutional Block Attention Module
Attention mechanisms [32], which are crucial in deep learning for computer vision, are exemplified by the Convolutional Block Attention Module (CBAM).The CBAM enhances model performance by emphasizing key parts of the input feature map.It consists of two sub-modules: the Channel Attention Module (CAM) for identifying important channel features and the Spatial Attention Module (SAM) for highlighting relevant locations in the feature map.This sequential processing improves the model's focus on task-relevant features (Figure 7).The Channel Attention Module (CAM) of the CBAM uses global average and maximum pooling to extract channel-wide information and learn inter-channel dependencies, creating a channel attention map.The Spatial Attention Module (SAM) employs these pooling values to highlight important spatial locations, producing spatial attention maps through small convolutional operations.The CBAM's main advantages are its versatility and minimal computational impact, allowing seamless integration into various CNN architectures, like ResNeXt.Embedding a CBAM in ResNeXt's residual blocks enhances feature extraction [33,34], enabling deeper learning of male and female precocious crab features while reducing disturbances such as lighting, background, and angle, thus improving the model's adaptability.

Deformable Convolutional Networks
In traditional ResNeXt models, convolution samples are in fixed positions on the feature map, limiting its ability to handle geometric variations from occlusions, distance changes, etc., which affects localization accuracy.Deformable convolutional kernels, however, adjust dynamically based on the current image content, allowing sampling points to adaptively shift positions in response to the size and position changes of targets like Chinese mitten crabs against varying backgrounds.This method involves learning offsets for the sampling points, enabling the convolution kernel to focus on regions or objects of interest rather than fixed locations.Despite adding minimal parameter and computational overheads, deformable convolution significantly enhances target detection accuracy (Figure 8).The Channel Attention Module (CAM) of the CBAM uses global average and maximum pooling to extract channel-wide information and learn inter-channel dependencies, creating a channel attention map.The Spatial Attention Module (SAM) employs these pooling values to highlight important spatial locations, producing spatial attention maps through small convolutional operations.The CBAM's main advantages are its versatility and minimal computational impact, allowing seamless integration into various CNN architectures, like ResNeXt.Embedding a CBAM in ResNeXt's residual blocks enhances feature extraction [33,34], enabling deeper learning of male and female precocious crab features while reducing disturbances such as lighting, background, and angle, thus improving the model's adaptability.

Deformable Convolutional Networks
In traditional ResNeXt models, convolution samples are in fixed positions on the feature map, limiting its ability to handle geometric variations from occlusions, distance changes, etc., which affects localization accuracy.Deformable convolutional kernels, however, adjust dynamically based on the current image content, allowing sampling points to adaptively shift positions in response to the size and position changes of targets like Chinese mitten crabs against varying backgrounds.This method involves learning offsets for the sampling points, enabling the convolution kernel to focus on regions or objects of interest rather than fixed locations.Despite adding minimal parameter and computational overheads, deformable convolution significantly enhances target detection accuracy (Figure 8).Ordinary 2D convolution consists of two key steps: first, the input feature map x is sampled over a regular grid ℛ, and subsequently, the sampled values are weighted with the convolutional layer parameters .The grid ℛ defines the size and degree of expansion of the receiver domain, as shown in Equation ( 1): ℛ = {(−1, −1), (−1,0), . . ., (0,1), (1,1)} (1) Ordinary 2D convolution consists of two key steps: first, the input feature map x is sampled over a regular grid R, and subsequently, the sampled values are weighted with the convolutional layer parameters w.The grid R defines the size and degree of expansion of the receiver domain, as shown in Equation ( 1): R = {(−1, −1), (−1, 0), . . ., (0, 1), (1, 1)} Here, a 3 × 3 convolution kernel with a dilation of 1 is defined.In the standard convolution process, each position y(p 0 ) of the output feature map y is computed as shown in Equation (2): where p n exhausts all the sampled positions in the regular grid R, while p 0 denotes each position in the input feature map for extracting semantic information near it.And, in deformed convolution, the regular grid R is augmented with the offset {∆p n | n = 1, . . ., N}, where N =|R|, and the above formula has been changed to Equation (3): In the formula, ∆p n denotes the offset of the sampling point.It can be seen that deformable convolution introduces an offset ∆p n of the sampling point on top of the traditional convolution operation, thus adjusting the sampling positions of the key elements.Now, the sampling occurs at the irregular and offset positions p n + ∆p n .Since the offset ∆p n is usually of a fractional order, methods such as bilinear interpolation need to be used to obtain the corresponding values.
In this equation, ∆m n denotes the weight of each offset point, and its value range is [0, 1].The input feature mapping contains N channels, and the number of channels corresponding to the offset portion of the sampling points is 2N (Figure 9).Because of the introduction of the weight coefficients, an additional N channels are added as the number of channels of the weighting network, and therefore, the number of channels of the final prediction result totals to 3N. (3): In the formula, Δ denotes the offset of the sampling point.
It can be seen that deformable convolution introduces an offset Δ of the sampling point on top of the traditional convolution operation, thus adjusting the sampling positions of the key elements.Now, the sampling occurs at the irregular and offset positions  + Δ .Since the offset Δ is usually of a fractional order, methods such as bilinear interpolation need to be used to obtain the corresponding values.
In this equation, Δ denotes the weight of each offset point, and its value range is [0, 1].The input feature mapping contains N channels, and the number of channels corresponding to the offset portion of the sampling points is 2N (Figure 9).Because of the introduction of the weight coefficients, an additional N channels are added as the number of channels of the weighting network, and therefore, the number of channels of the final prediction result totals to 3N.In this experiment, the feature pyramid structure utilizes a bottom-up fusion approach [35], closely linking feature extraction with the output of underlying residual blocks to improve adaptability to various sensory field sizes, thus enhancing the model's localization accuracy.However, integrating numerous deformable convolutions increases the model's complexity and may slow down the detection speed.To balance speed and accuracy, this study replaces the standard 3 × 3 convolution in the last residual block of ResNeXt with deformable convolution.This adjustment optimizes the balance between the parameter count and effective feature extraction.

Improvement of Feature Fusion Module 2.5.1. ASFPN
Feature Pyramid Networks (FPNs) are commonly used for multi-scale fusion in deep convolutional neural networks to address multi-scale challenges in target detection [36].Traditional FPNs up-sample high-level feature maps and fuse them with low-level maps, introducing high-level semantic information but often losing detail, particularly for small targets, due to the low resolution and limitations of the up-sampling process.Additionally, the typical methods of fusing feature maps through addition or splicing may not adequately preserve or utilize the detail information of small targets.To enhance the detection of small targets like the Chinese mitten crab in this experiment, we replaced the PA-Net in the feature fusion layer with an ASFF-based network [37].This adjustment focused more clearly on key information and improved the detection performance of the samples (Figure 10).ally, the typical methods of fusing feature maps through addition or splicing may not adequately preserve or utilize the detail information of small targets.To enhance the detection of small targets like the Chinese mitten crab in this experiment, we replaced the PA-Net in the feature fusion layer with an ASFF-based network [37].This adjustment focused more clearly on key information and improved the detection performance of the samples (Figure 10).

Principle of ASFF Module
Adaptive spatial feature fusion (ASFF) targets specific layers which vary in resolution and channel number within the feature fusion pyramid.ASFF first standardizes the resolution and channel number across different layers and then integrates these features to optimize the fusion strategy.This process adaptively fuses different layers, effectively filtering out conflicting information while retaining and enhancing discriminative details.For example, in ASFF-3, the fused feature map is produced by applying the learned weights  3 ， 3 ， 3 ，and  3 to multiply and sum features from Level 1, Level 2, Level 3, and Level 4. The computational formula is outlined in Equation ( 5), and the general flow of ASFF is illustrated in the dashed box of Figure 10.
In this equation,  denotes the (, )-th vector of the output feature mapping  between channels;  ， ， ，and  denote the learnable weights of the feature maps of the four different levels up to .  → ， → ， → ，and  → denote the output of a certain output of the location feature map.
In computing the fused ASFF-3, we used a summation method, so that the Level 1 to Level 4 layers had to have the same feature size and number of channels in the summation.For this reason, we adjusted the number of channels by up-sampling or down-sampling the features at different levels.The feature maps of Level 1 to Level 4 were convolved at 1

Principle of ASFF Module
Adaptive spatial feature fusion (ASFF) targets specific layers which vary in resolution and channel number within the feature fusion pyramid.ASFF first standardizes the resolution and channel number across different layers and then integrates these features to optimize the fusion strategy.This process adaptively fuses different layers, effectively filtering out conflicting information while retaining and enhancing discriminative details.For example, in ASFF-3, the fused feature map is produced by applying the learned weights a 3 , β 3 , r 3 , and δ 3 to multiply and sum features from Level 1, Level 2, Level 3, and Level 4. The computational formula is outlined in Equation ( 5), and the general flow of ASFF is illustrated in the dashed box of Figure 10.
In this equation, y l ij denotes the (i, j)-th vector of the output feature mapping y l between channels; α l ij , β l ij , γ l ij , and δ l ij denote the learnable weights of the feature maps of the four different levels up to l. x 1→l ij , x 2→l ij , x 3→l ij , and x 4→l ij denote the output of a certain output of the location feature map.
In computing the fused ASFF-3, we used a summation method, so that the Level 1 to Level 4 layers had to have the same feature size and number of channels in the summation.For this reason, we adjusted the number of channels by up-sampling or down-sampling the features at different levels.The feature maps of Level 1 to Level 4 were convolved at 1 s × 1 to obtain the weighting parameters a 3 , β 3 , r 3 , and δ 3 .These weighting parameters were spliced together and normalized by the Softmax function.The specific calculation formula is shown in Equation ( 6): In this equation, e γ ij , and e λ l δ ij are the results obtained from the four feature maps after feature scaling using 1 × 1 convolution.These weight parameters satisfy 2.6.Improvement of Head Detection 2.6.1.T-Head Feature Alignment The TOOD algorithm introduces the Task-Aligned Head (T-Head) to enhance onestage object detection.Unlike Fully Convolutional One-Stage Object Detection (FCOS), which lacks a distinct localization branch and aligns tasks by weight sharing, the T-Head optimizes interaction between classification and localization.It calculates task interaction features, predicts alignment, and adjusts spatial predictions based on learning signals.The T-Head uses multiple convolutional layers to process FPN outputs, providing multi-level features and effective sensory fields for both tasks across scales.The specific formulas are detailed in Equation ( 7): The equation defines conv k as the k-th convolutional layer with ReLU activation δ.The TAL task alignment learning method establishes consistency metrics for classification and localization tasks, including anchor alignment and loss function, ensuring convergence.During backpropagation, the T-Head adjusts classification probabilities and localization predictions based on TAL's learning signals, optimizing their distribution.A feature extractor learns task-interactive features from multiple convolutional layers, generating task interaction features.

Anchor Alignment Metric
After T-Head and TAL processing, prediction results and feature alignment are explicitly output.To evaluate these predictions, an anchor alignment metric measures how well anchors align with the task, influencing sample allocation and refining anchor predictions dynamically.This metric prioritizes high-quality anchors based on classification scores and co-localization accuracy while reducing misaligned anchors.Equation ( 8) provides the mathematical expression for this metric: In this equation, s represents the classification score, u represents the IoU value, and p and q are weighting coefficients controlling the influence of classification and localization tasks on task alignment.The parameter t in Equation (8) jointly optimizes both tasks for classification and detection alignment.For each instance, the t of the generated anchors is calculated, keeping the maximum value as a positive sample and the other anchors as negative samples.TAL explicitly measures task alignment at the anchor level, filtering out anchors with the best performance in both tasks to achieve alignment.
The original algorithm enhances classification and localization interaction through the T-Head and explicitly aligns tasks via TAL, improving alignment issues.In this experiment, focusing on detecting one-year-old sexually precocious Chinese mitten crabs, we adjusted p and q to prioritize IoU optimization in localization.Ultimately, we set p to 0.4 and q to 6 in Equation (8).

Evaluation Indicators
In order to accurately assess the model's localization recognition and accuracy for oneyear-old sexually mature Chinese mitten crabs, we needed to evaluate the corresponding metrics.For this purpose, we calculated the precision (P), recall (R), F1-score, and mAP to measure the model's recognition and classification performance on the experimental samples.The higher the value of these metrics, the better the detection effect of the model.The specific formulas for each evaluation metric are as follows: In the formulas, TP stands for true positive samples, FP stands for false positive samples; FN stands for false negative samples; M stands for the total number of categories of the detection target; and AP(k) stands for the average precision value of the k-th category.

Model Training and Results
The models are trained uniformly using self-constructed datasets and employ stochastic gradient descent (SGD) for gradient updating.Training involves 50 rounds, each with 100 batches.As shown in Figure 11

Comparison of Different Feature Extraction Network Modules
We enhanced feature extraction for hard-to-recognize individuals by designing a more complex feature extraction network.Integrating these modules and comparing them with the benchmark network ResNet50, we observed significant improvements in detection performance.The mAP50 metric increased by 3.41 percentage points, while mAP and mAP75 improved by 2.68 and 4.12 percentage points, respectively (Table 2).These results validate the effectiveness of our improvement measures in optimizing the final detection outcome.The ASFPN module efficiently integrates and trains features, enhancing the accuracy of the localization and classification of one-year-old sexually precocious Chinese mitten crabs by fusing high-level semantic information and low-level features selectively and optimally.Comparative experiments with mainstream multi-scale feature fusion methods (e.g., FPN, PAFPN, and BiFPN) confirm the superiority of ASFPN.During the initial ten epochs, while all methods show a rapid increase in mAP50 metrics, ASFPN maintains stability and achieves excellent performance without a noticeable decrease in metrics (Figure 12).

Comparison of Different Feature Extraction Network Modules
We enhanced feature extraction for hard-to-recognize individuals by designing a more complex feature extraction network.Integrating these modules and comparing them with the benchmark network ResNet50, we observed significant improvements in detection performance.The mAP50 metric increased by 3.41 percentage points, while mAP and mAP75 improved by 2.68 and 4.12 percentage points, respectively (Table 2).These results validate the effectiveness of our improvement measures in optimizing the final detection outcome.The ASFPN module efficiently integrates and trains features, enhancing the accuracy of the localization and classification of one-year-old sexually precocious Chinese mitten crabs by fusing high-level semantic information and low-level features selectively and optimally.Comparative experiments with mainstream multi-scale feature fusion methods (e.g., FPN, PAFPN, and BiFPN) confirm the superiority of ASFPN.During the initial ten epochs, while all methods show a rapid increase in mAP50 metrics, ASFPN maintains stability and achieves excellent performance without a noticeable decrease in metrics (Figure 12).
The ASFPN module efficiently integrates and trains features, enhancing the accuracy of the localization and classification of one-year-old sexually precocious Chinese mitten crabs by fusing high-level semantic information and low-level features selectively and optimally.Comparative experiments with mainstream multi-scale feature fusion methods (e.g., FPN, PAFPN, and BiFPN) confirm the superiority of ASFPN.During the initial ten epochs, while all methods show a rapid increase in mAP50 metrics, ASFPN maintains stability and achieves excellent performance without a noticeable decrease in metrics (Figure 12).

Comparative Analysis with Other Models
In order to highlight the superiority of this paper's algorithm in the target detection task of one-year-old precocious Chinese mitten crabs, we compared our R-TNET model with other mainstream target detection algorithms, including Faster RCNN, SSD, YOLOv5, and Cascade-RCNN, using the same experimental settings and datasets.R-TNET outperformed these algorithms on our self-built dataset, achieving an 88.78% mAP and a 96.14% mAP50.Specifically, it surpassed YOLOv5 and Cascade-RCNN by 4.19 and 5.43 percentage points in mAP50, respectively (Table 3).R-TNET's F1-Score exceeded Faster RCNN, SSD, YOLOv5, and Cascade-RCNN by 3.1, 5.44, 1.77, and 2.65 percentage points, respectively.These results demonstrate R-TNET's effectiveness in addressing challenges in one-year-old precocious Chinese mitten crab detection, significantly improving recognition accuracy to meet production needs.In summary, our method outperforms mainstream target detection methods that use various backbone networks, achieving higher accuracy and significant advantages in both F1-scores and mAP indexes compared to the other four methods discussed.

Model Visualization Analysis
To realistically simulate the detection effect of the model, we recorded a new video of the crab selection process, extracting 200 key frames for analysis using optimally trained weights.The results demonstrate that our algorithm effectively handles complex backgrounds and overlapping crab shells, maintaining robust performance even under challenging conditions (Figure 13).This is particularly true in image (d), where despite severe occlusion and lower confidence levels, the model accurately identified targets, meeting basic operational requirements.Additionally, it is worth noting that our model only takes 6 s to process these data, which is many times faster than manual processing.In terms of accuracy, the number of individuals who missed or missed detections was significantly lower than expected.Ultimately, this model was completed with high accuracy in a very short time, which required a long time for professionals to work on.meeting basic operational requirements.Additionally, it is worth noting that our model only takes 6 s to process these data, which is many times faster than manual processing.In terms of accuracy, the number of individuals who missed or missed detections was significantly lower than expected.Ultimately, this model was completed with high accuracy in a very short time, which required a long time for professionals to work on.

Discussion
In this paper, a task-aligned R-TNET target detection algorithm was proposed for the difficult problem of recognizing and selecting one-year-old precocious crabs in Chinese mitten crab cultures, aiming at solving the problems of difficult recognition and time-consuming and laborious selection.The experimental results showed that these improvements significantly enhance the detection accuracy of one-year-old precocious Chinese mitten crabs, and the mAP50 index reached 96.14%, which is far beyond the current mainstream target detection algorithms.Meanwhile, in order to meet the actual needs of farmers, we carried out on-site shooting and processing to verify the visualization and identification effect of the model on one-year-old sexually precocious crab juveniles in practical applications, and the results were in line with expectations.

Conclusions
The algorithm not only detects and classifies one-year-old precocious crab juveniles but also aids in quality grading and quantity prediction during selection.This method shows promise for automating Chinese mitten crab cultivation, reducing labor costs, and boosting economic returns in aquaculture.However, acquiring image data of crabs' specific features remains challenging, necessitating specialized engineering equipment.Continuous optimization of our model is crucial to enhance its practicality and effectiveness in real farming scenarios.

Figure 1 .
Figure 1.The presentation of crab datasets.(A) Difference between precocious crabs and normal samples.(B) Raw dataset with unclear features.

Figure 1 .
Figure 1.The presentation of crab datasets.(A) Difference between precocious crabs and normal samples.(B) Raw dataset with unclear features.

Figure 4 .
Figure 4. Diagram of TAP module structure.

Figure 4 .
Figure 4. Diagram of TAP module structure.

Figure 4 .
Figure 4. Diagram of TAP module structure.

Figure 7 .
Figure 7. Illustration of the network structure of CBAM.

Figure 7 .
Figure 7. Illustration of the network structure of CBAM.

Figure 7 .
Figure 7. Illustration of the network structure of CBAM.

Figure 8 .
Figure 8. Illustration of the sampling locations in 3 × 3 standard and deformable convolutions.(a) The conventional sampling method of standard convolution.(b) Convolutional kernel with added offset.(c) Horizontal and vertical transformations of convolutional kernel after adding offsets.(d) Rotation transformation of convolution kernel after adding offset.

Figure 8 .
Figure 8. Illustration of the sampling locations in 3 × 3 standard and deformable convolutions.(a) The conventional sampling method of standard convolution.(b) Convolutional kernel with added offset.(c) Horizontal and vertical transformations of convolutional kernel after adding offsets.(d) Rotation transformation of convolution kernel after adding offset.

Figure 9 .
Figure 9. Realization of deformable convolution network.In this experiment, the feature pyramid structure utilizes a bottom-up fusion approach [35], closely linking feature extraction with the output of underlying residual blocks to improve adaptability to various sensory field sizes, thus enhancing the model's localization accuracy.However, integrating numerous deformable convolutions increases the model's complexity and may slow down the detection speed.To balance speed and accuracy, this study replaces the standard 3 × 3 convolution in the last residual block of

Figure 12 .
Figure 12.Plot of mAP50 values under different feature fusion methods.Figure 12. Plot of mAP50 values under different feature fusion methods.

Figure 12 .
Figure 12.Plot of mAP50 values under different feature fusion methods.Figure 12. Plot of mAP50 values under different feature fusion methods.

Figure 13 .
Figure 13.Detection results of one-year-old precocious Chinese mitten crabs (a-d).Figure 13.Detection results of one-year-old precocious Chinese mitten crabs (a-d).

Figure 13 .
Figure 13.Detection results of one-year-old precocious Chinese mitten crabs (a-d).Figure 13.Detection results of one-year-old precocious Chinese mitten crabs (a-d).

Table 1 .
Dataset details of crabs.

Table 1 .
Dataset details of crabs.

Table 2 .
Comparison of recognition effects using different feature extraction networks.

Table 2 .
Comparison of recognition effects using different feature extraction networks.

Table 3 .
Comparison of recognition effects of different detection models on self-built datasets.