MFCF-Gait: Small Silhouette-Sensitive Gait Recognition Algorithm Based on Multi-Scale Feature Cross-Fusion

Gait recognition based on gait silhouette profiles is currently a major approach in the field of gait recognition. In previous studies, models typically used gait silhouette images sized at 64 × 64 pixels as input data. However, in practical applications, cases may arise where silhouette images are smaller than 64 × 64, leading to a loss in detail information and significantly affecting model accuracy. To address these challenges, we propose a gait recognition system named Multi-scale Feature Cross-Fusion Gait (MFCF-Gait). At the input stage of the model, we employ super-resolution algorithms to preprocess the data. During this process, we observed that different super-resolution algorithms applied to larger silhouette images also affect training outcomes. Improved super-resolution algorithms contribute to enhancing model performance. In terms of model architecture, we introduce a multi-scale feature cross-fusion network model. By integrating low-level feature information from higher-resolution images with high-level feature information from lower-resolution images, the model emphasizes smaller-scale details, thereby improving recognition accuracy for smaller silhouette images. The experimental results on the CASIA-B dataset demonstrate significant improvements. On 64 × 64 silhouette images, the accuracies for NM, BG, and CL states reached 96.49%, 91.42%, and 78.24%, respectively. On 32 × 32 silhouette images, the accuracies were 94.23%, 87.68%, and 71.57%, respectively, showing notable enhancements.


Introduction
Gait recognition refers to the technique of identifying individuals or analyzing behaviors by analyzing their walking or movement patterns.It utilizes unique motion patterns and biological characteristics generated by the human body during walking, converting them into identifiable features [1].Due to its non-invasive and difficult-to-disguise nature, gait recognition holds significant application potential in forensic and security domains.However, it faces numerous challenges in practical applications due to the complexity of real-world scenarios.
From a data perspective, current research in gait recognition can be categorized into model-based and silhouette-based approaches [2].Model-based methods construct models using relationships such as the positions of human body keypoints, utilizing 2D, 3D, or SMPL models to extract gait features through data learning [3].Silhouette-based methods focus on extracting gait features from gait silhouette images.With the rise in deep learning, research in gait recognition has also integrated deep learning techniques.Among modelbased methods, PoseGait [4] utilizes 3D body pose modeling, GaitGraph [5] employs graph convolutional networks on 2D skeletal models, and HMRGait [6] achieves end-to-end gait recognition using pre-trained human body mesh models based on the SMPL model [7].Model-based methods exhibit robustness against noise from different clothing but suffer Sensors 2024, 24, 5500 3 of 15

Related Works
Our work primarily focuses on two aspects: dataset preprocessing and model design in silhouette-based gait recognition.Currently, the mainstream method for preprocessing in silhouette-based gait recognition is the Takemura method.The purpose of this method is to segment the original gait silhouette images into appropriate sizes while attempting to remove invalid information and unqualified images as much as possible.These processed images are then used for training or testing purposes.In addition to the Takemura method, our approach includes an additional step during preprocessing using super-resolution interpolation algorithms to standardize the input data to the same resolution.This process aims to enhance the detail information of smaller silhouette images.Regarding model design, we employ set pooling as the primary feature aggregation method.This approach has been validated for its effectiveness in numerous studies.The translation maintains the technical detail and clarity of the original text, ensuring that the key concepts related to dataset preprocessing and model design in silhouette-based gait recognition are accurately conveyed.

Takemura Method
Takemura method, introduced in 2018 by Noriko Takemura in collaboration with the OU-ISIR MVLP dataset, refers to a method for normalizing gait silhouette images [12].Specifically, the method starts by identifying the top and bottom of the silhouette image to remove excess background.It then determines the horizontal center by statistically analyzing pixel values horizontally and adjusts the image height to 64 pixels while proportionally adjusting the width.Finally, adhering to symmetry principles, the silhouette image is cropped into a square shape.The normalization process concludes by resizing the silhouette image to either 64 × 64 or 128 × 128 resolution, making it suitable for deep learning applications.Figure 1 describes the process.During this normalization process, inevitably, the use of super-resolution algorithms is involved.However, the original Takemura method paper does not specify which super-resolution algorithm was used.This lack of specification may have led to inconsistent data processing across different models, resulting in varying results when the same model is reproduced in different studies.In this paper's experimental section, we will compare and discuss the Takemura method's performance when using different super-resolution techniques.The continuation provides a detailed explanation of the Takemura method's steps and addresses the potential variability introduced by unspecified super-resolution algorithms in different research contexts.
advantages.This study contributes to advancing gait recognition technology, particularly in handling challenges posed by smaller resolution gait silhouettes in practical applications.

Related Works
Our work primarily focuses on two aspects: dataset preprocessing and model design in silhouette-based gait recognition.Currently, the mainstream method for preprocessing in silhouette-based gait recognition is the Takemura method.The purpose of this method is to segment the original gait silhouette images into appropriate sizes while attempting to remove invalid information and unqualified images as much as possible.These processed images are then used for training or testing purposes.In addition to the Takemura method, our approach includes an additional step during preprocessing using super-resolution interpolation algorithms to standardize the input data to the same resolution.This process aims to enhance the detail information of smaller silhouette images.Regarding model design, we employ set pooling as the primary feature aggregation method.This approach has been validated for its effectiveness in numerous studies.The translation maintains the technical detail and clarity of the original text, ensuring that the key concepts related to dataset preprocessing and model design in silhouette-based gait recognition are accurately conveyed.

Takemura Method
Takemura method, introduced in 2018 by Noriko Takemura in collaboration with the OU-ISIR MVLP dataset, refers to a method for normalizing gait silhouette images [12].Specifically, the method starts by identifying the top and bottom of the silhouette image to remove excess background.It then determines the horizontal center by statistically analyzing pixel values horizontally and adjusts the image height to 64 pixels while proportionally adjusting the width.Finally, adhering to symmetry principles, the silhouette image is cropped into a square shape.The normalization process concludes by resizing the silhouette image to either 64 × 64 or 128 × 128 resolution, making it suitable for deep learning applications.Figure 1 describes the process.During this normalization process, inevitably, the use of super-resolution algorithms is involved.However, the original Takemura method paper does not specify which super-resolution algorithm was used.This lack of specification may have led to inconsistent data processing across different models, resulting in varying results when the same model is reproduced in different studies.In this paper's experimental section, we will compare and discuss the Takemura method's performance when using different super-resolution techniques.The continuation provides a detailed explanation of the Takemura method's steps and addresses the potential variability introduced by unspecified super-resolution algorithms in different research contexts.Furthermore, when using the Takemura method for processing gait silhouette images in gait recognition, images with smaller silhouettes are often filtered out for various Sensors 2024, 24, 5500 4 of 15 reasons.Figure 2 presents examples of potential elements that may be removed during the preprocessing stage.Specifically, during the filtering process, the cumulative value of the entire image is computed.Given that gait datasets are typically binary images, this cumulative value effectively reflects the size of the silhouette within the image.Images with a cumulative value below a certain threshold are removed, and the remaining images are then processed by the Takemura method to be resized to 64 × 64 pixels.Although the Takemura method inherently loses details during scaling, in practical applications, there is a possibility that the distance between the person and the camera may be too great, resulting in silhouette images significantly smaller than 64 × 64 pixels.This further loss of detail can severely impact the accuracy of the model.Furthermore, when using the Takemura method for processing gait silhouette images in gait recognition, images with smaller silhouettes are often filtered out for various reasons.Figure 2 presents examples of potential elements that may be removed during the preprocessing stage.Specifically, during the filtering process, the cumulative value of the entire image is computed.Given that gait datasets are typically binary images, this cumulative value effectively reflects the size of the silhouette within the image.Images with a cumulative value below a certain threshold are removed, and the remaining images are then processed by the Takemura method to be resized to 64 × 64 pixels.Although the Takemura method inherently loses details during scaling, in practical applications, there is a possibility that the distance between the person and the camera may be too great, resulting in silhouette images significantly smaller than 64 × 64 pixels.This further loss of detail can severely impact the accuracy of the model.

Super-Resolution
For deep learning models, input data are crucial.On one hand, the dataset must comprehensively reflect the problem to facilitate effective model training.On the other hand, strict adherence to data formatting requirements is essential.For silhouette-based gait recognition models, different resolutions of gait silhouettes must be standardized to the same size for input into the model.Failure to match the expected data format with that of the model can lead to training failures or errors [17].Super-resolution algorithms effectively address this issue.Another advantage of using this method to standardize gait silhouette sizes is that it eliminates the need for modifying the model when faced with silhouettes of varying sizes.Consequently, the model only needs to be trained once to handle gait silhouettes of all sizes, thereby resulting in a robust model.
Super-resolution algorithms are image processing techniques aimed at recovering high-resolution details from low-resolution images or videos.Their primary objective is to infer missing high-frequency details from limited data by leveraging inherent image information, structure, statistical regularities, and possible prior knowledge to enhance image quality.However, mathematically, super-resolution algorithms deal with ill-posed problems.The irreversible loss of high-frequency detail in original low-resolution images or videos means any algorithm can only attempt to compensate for this loss, leading to uncertainties in restoring high-resolution images.It is impossible to completely and accurately reconstruct the original high-resolution image from low-resolution data.Consequently, the performance of different super-resolution algorithms varies [18].
This study observes that the reproducibility of results for the same model often varies across many gait recognition-related papers [11].Upon a comparative analysis, it was found that besides factors such as training environment and adjustments to

Super-Resolution
For deep learning models, input data are crucial.On one hand, the dataset must comprehensively reflect the problem to facilitate effective model training.On the other hand, strict adherence to data formatting requirements is essential.For silhouette-based gait recognition models, different resolutions of gait silhouettes must be standardized to the same size for input into the model.Failure to match the expected data format with that of the model can lead to training failures or errors [17].Super-resolution algorithms effectively address this issue.Another advantage of using this method to standardize gait silhouette sizes is that it eliminates the need for modifying the model when faced with silhouettes of varying sizes.Consequently, the model only needs to be trained once to handle gait silhouettes of all sizes, thereby resulting in a robust model.
Super-resolution algorithms are image processing techniques aimed at recovering high-resolution details from low-resolution images or videos.Their primary objective is to infer missing high-frequency details from limited data by leveraging inherent image information, structure, statistical regularities, and possible prior knowledge to enhance image quality.However, mathematically, super-resolution algorithms deal with ill-posed problems.The irreversible loss of high-frequency detail in original low-resolution images or videos means any algorithm can only attempt to compensate for this loss, leading to uncertainties in restoring high-resolution images.It is impossible to completely and accurately reconstruct the original high-resolution image from low-resolution data.Consequently, the performance of different super-resolution algorithms varies [18].
This study observes that the reproducibility of results for the same model often varies across many gait recognition-related papers [11].Upon a comparative analysis, it was found that besides factors such as training environment and adjustments to hyperparameters, the choice of super-resolution algorithm significantly influences model training outcomes.In certain deep learning models, simply substituting a superior super-resolution algorithm has shown to improve accuracy compared to the original study.This phenomenon is attributed to the heightened importance of geometric details supplemented by super-resolution when silhouette image resolutions are notably low.
Sensors 2024, 24, 5500 5 of 15 The most common types of super-resolution algorithms are various interpolation methods, such as nearest neighbor interpolation [19], bilinear interpolation [20], and bicubic interpolation [21].Interpolation algorithms, compared to other types of super-resolution methods, offer greater flexibility and can be applied at each stage of gait recognition data processing, effectively enhancing model performance.Consequently, this paper primarily discusses the impact of various interpolation-based super-resolution algorithms on gait recognition.
Nearest neighbor interpolation is a relatively straightforward super-resolution algorithm that operates by duplicating each pixel from a low-resolution image to its corresponding position in a high-resolution image using the value of the nearest neighbor pixel for filling.This method is simple and intuitive with fast computation speed, but it lacks numerical smoothing and may result in images with jagged edges.The formula is as follows: In this context, where u, v are decimal fractions within the range [0, 1), G(i, j) represents the value at point (i, j) in the low-resolution image.
Bilinear interpolation considers the weighted relationships between the adjacent four pixels, in contrast to nearest neighbor interpolation.Consequently, it often produces smoother and more accurate results.During the process of image enlargement, bilinear interpolation typically yields smoother and more natural outcomes.The formula is as follows: where u, v are decimal numbers in the interval [0, 1) and G(i, j) denotes the value at the low-resolution image point (i, j).Bicubic interpolation, which we ultimately adopted, represents a relatively advanced interpolation method compared to bilinear interpolation.It constitutes a further improvement over bilinear interpolation by involving more neighboring pixels in the interpolation calculation, thereby enhancing the quality and detail enhancement capability of the results.This method is capable of producing smoother and clearer enlarged images.It employs a cubic polynomial in the interpolation formula for numerical transformation, defined as follows: The interpolation formula is as follows: where the meanings of A, B, and C are as follows: Sensors 2024, 24, 5500 6 of 15

Set Pooling
The concept of set pooling was first introduced in the GaitSet, a classic model for gait recognition.This method takes into account that the number of gait silhouette images for a person can vary arbitrarily, sometimes resulting in significantly different lengths of gait sequences.Therefore, using the maximum pooling function along the temporal dimension to extract gait features can capture more generalized information.In recent years, this approach has been validated for its effectiveness across various studies.The rationale for choosing set pooling lies in several unique advantages: It treats gait sequences as sets, thus eliminating the strict temporal order requirement and reducing computational burden on hardware.Furthermore, with the increasing richness of gait recognition datasets, particularly in outdoor environments, set pooling has exceeded expectations, demonstrating robustness against visual noise and varying lengths of gait sequences [22].It has outperformed many subsequent methods in complex real-world scenarios.The specific formula is as follows: where "Feature" denotes the feature matrix and n, c, h, w, respectively, represent the number of feature maps, channels, height, and width of the feature maps."Max(Feature(n, c, h, w), 0)" signifies aggregating the feature matrix along the dimension of feature map number n by retaining the maximum values, resulting in the aggregated feature as "Feature(1, c, h, w)".

Dataset
The CASIA-B dataset [23] is one of the most widely used datasets for gait analysis.It includes RGB and silhouette multi-view gait data from 124 participants, captured at 11 different angles ranging from 0 • to 180 • with an interval of 18 • .This dataset incorporates three walking conditions: normal (NM), in a coat (CL), with a bag (BG).For each participant at each viewpoint, there are 6 sequences of normal walking gaits, 2 sequences of walking with a coat, and 2 sequences of walking with a bag, resulting in a total of 110 gait sequences per person across the 11 viewpoints.

Dataset Splitting and Rank-1
The most commonly used testing protocol for the CASIA-B dataset is the subjectindependent protocol, where individuals in the training set do not overlap with those in the testing set.A typical approach involves assigning data from the first 74 participants to the training set, leaving the data from the remaining 50 participants for testing.Within the testing set, gait sequences are categorized into probe set and gallery set [24].
In gait recognition systems, "Probe" typically refers to the probe set (or verification set), while "Gallery" can be understood as the gallery set (or enrollment set).In this system, for each known individual, there existed a corresponding set of gait sequences, which were combined into a gallery set.In the trained gait model, all sequences from the gallery set were inputted, resulting in the generation and storage of vector representations.During identification of a new gait sequence for verification, this sequence was inputted into the trained gait model, producing a new vector representation.The system calculated the Euclidean distance between the outputted vector and all vectors in the gallery set, subsequently computing the Rank-1 identification rate.If the new sequence matched the closest known individual in terms of Euclidean distance, the identification was considered successful; otherwise, it was deemed a failure.The Rank-1 identification rate was computed to assess the performance of the gait model.The computational details are illustrated in Figure 3.
Sensors 2024, 24, 5500 7 of 15 subsequently computing the Rank-1 identification rate.If the new sequence matched the closest known individual in terms of Euclidean distance, the identification was considered successful; otherwise, it was deemed a failure.The Rank-1 identification rate was computed to assess the performance of the gait model.The computational details are illustrated in Figure 3.The reason for adopting such a testing protocol was because it is impractical in realworld applications to pre-train the identities of individuals to be recognized.This approach allowed researchers to better simulate scenarios where gait recognition technology is applied in the real world, thus more reasonably assessing the performance of gait recognition algorithms on untrained individuals.

Model Overview
As shown in Figure 4, the model in this paper was primarily divided into four parts: super-resolution processing, backbone network, multi-scale feature cross-fusion module (MFCF), and single-scale horizontal pyramid mapping (SHPM).The reason for adopting such a testing protocol was because it is impractical in real-world applications to pre-train the identities of individuals to be recognized.This approach allowed researchers to better simulate scenarios where gait recognition technology is applied in the real world, thus more reasonably assessing the performance of gait recognition algorithms on untrained individuals.

Model Overview
As shown in Figure 4, the model in this paper was primarily divided into four parts: super-resolution processing, backbone network, multi-scale feature cross-fusion module (MFCF), and single-scale horizontal pyramid mapping (SHPM).
subsequently computing the Rank-1 identification rate.If the new sequence matched the closest known individual in terms of Euclidean distance, the identification was considered successful; otherwise, it was deemed a failure.The Rank-1 identification rate was computed to assess the performance of the gait model.The computational details are illustrated in Figure 3.The reason for adopting such a testing protocol was because it is impractical in realworld applications to pre-train the identities of individuals to be recognized.This approach allowed researchers to better simulate scenarios where gait recognition technology is applied in the real world, thus more reasonably assessing the performance of gait recognition algorithms on untrained individuals.

Model Overview
As shown in Figure 4, the model in this paper was primarily divided into four parts: super-resolution processing, backbone network, multi-scale feature cross-fusion module (MFCF), and single-scale horizontal pyramid mapping (SHPM).The role of super-resolution processing is twofold: First, it aims to compensate for lost image details in small silhouette images to enhance recognition rates [15].Second, deep learning imposes strict requirements on input data formatting to avoid computational errors, necessitating uniform formatting.Furthermore, our experiments found that selecting a superior super-resolution algorithm also improved recognition rates for 64 × 64 silhouette images.The backbone network was constructed using the classic VGGNet [25], employing Sensors 2024, 24, 5500 8 of 15 alternating convolutional and pooling layers to extract gait features, followed by set pooling (SP) to aggregate these features.The MFCF module integrated shallow and deep features from the backbone network.Shallow features contain rich geometric details but less semantic information, whereas deep features possess more semantic information but fewer geometric details.By collecting and fusing information from different layers, this module compensated for each other's deficiencies [26].Compared to other multi-scale feature fusion methods, MFCF ensures sufficient information exchange through iterative fusion of features across scales, minimizing information loss caused by pooling layers and making the model more sensitive to small silhouette images.
The SHPM includes the crucial component SHPP (Single-scale Horizontal Pyramid Pooling), adapted from Horizontal Pyramid Pooling (HPP) [27], initially used in pedestrian re-identification and later adopted in gait recognition.HPP achieves this by segmenting the feature maps from each module into strip-like regions at four different scales, thereby encouraging neural network to focus on features of various sizes.The three-dimensional strip-like feature maps are then compressed into one-dimensional features through global max pooling and global average pooling, followed by summation.Finally, the pooled features are mapped and classified via a fully connected layer.HPP divides feature maps from various modules into strips at four different scales to encourage the neural network to focus on features of different sizes.It then uses global max pooling and global average pooling to compress the three-dimensional strip features into one-dimensional features, which are summed.Finally, the pooled features are mapped and classified through fully connected layers.However, our study found that the multi-scale segmentation method employed by HPP did not significantly benefit our model.Based on experimental results, a single scale proved effective.Therefore, we simplified HPP into the single-scale SHPP, reducing model parameters while preserving functionality.The specific structure of SHPP is shown in Figure 5.
The role of super-resolution processing is twofold: First, it aims to compensate for lost image details in small silhouette images to enhance recognition rates [15].Second, deep learning imposes strict requirements on input data formatting to avoid computational errors, necessitating uniform formatting.Furthermore, our experiments found that selecting a superior super-resolution algorithm also improved recognition rates for 64 × 64 silhouette images.The backbone network was constructed using the classic VGGNet [25], employing alternating convolutional and pooling layers to extract gait features, followed by set pooling (SP) to aggregate these features.The MFCF module integrated shallow and deep features from the backbone network.Shallow features contain rich geometric details but less semantic information, whereas deep features possess more semantic information but fewer geometric details.By collecting and fusing information from different layers, this module compensated for each other's deficiencies [26].Compared to other multi-scale feature fusion methods, MFCF ensures sufficient information exchange through iterative fusion of features across scales, minimizing information loss caused by pooling layers and making the model more sensitive to small silhouette images.
The SHPM includes the crucial component SHPP (Single-scale Horizontal Pyramid Pooling), adapted from Horizontal Pyramid Pooling (HPP) [27], initially used in pedestrian re-identification and later adopted in gait recognition.HPP achieves this by segmenting the feature maps from each module into strip-like regions at four different scales, thereby encouraging the neural network to focus on features of various sizes.The threedimensional strip-like feature maps are then compressed into one-dimensional features through global max pooling and global average pooling, followed by summation.Finally, the pooled features are mapped and classified via a fully connected layer.HPP divides feature maps from various modules into strips at four different scales to encourage the neural network to focus on features of different sizes.It then uses global max pooling and global average pooling to compress the three-dimensional strip features into one-dimensional features, which are summed.Finally, the pooled features are mapped and classified through fully connected layers.However, our study found that the multi-scale segmentation method employed by HPP did not significantly benefit our model.Based on experimental results, a single scale proved effective.Therefore, we simplified HPP into the single-scale SHPP, reducing model parameters while preserving functionality.The specific structure of SHPP is shown in Figure 5.

Multi-Scale Feature Cross-Fusion Module
In convolutional neural networks (CNNs), the utilization of pooling layers is a common operation known to significantly impact various aspects of model performance, such as feature extraction, enlarging receptive fields, enhancing translational invariance, and reducing computational complexity [28].However, pooling layers also introduce irreversible loss of geometric details in feature maps, ultimately leading to high-level feature

Multi-Scale Feature Cross-Fusion Module
In convolutional neural networks (CNNs), the utilization of pooling layers is a common operation known to significantly impact various aspects of model performance, such as feature extraction, enlarging receptive fields, enhancing translational invariance, and reducing computational complexity [28].However, pooling layers also introduce irreversible loss of geometric details in feature maps, ultimately leading to high-level feature maps closer to discriminative modules lacking geometric details, thereby affecting the accuracy of smaller targets.Consequently, a method to address this issue involves fusing lowerlevel feature maps rich in geometric details with higher-level feature maps.Traditional multi-scale feature fusion methods either incorporate pooling layers within fusion modules, resulting in loss of geometric details, or fail to sufficiently integrate information across different scales, neglecting further exploration of fused features, thereby hindering the model's ability to learn from small contour images [29].
In consideration of these challenges, this paper introduced a multi-scale feature crossfusion (MFCF) module.The two branches of this module extract feature maps of specific sizes from different layers of the model.After a series of processing steps, these feature maps were combined with the output of the backbone network and fed into the SHPM (Scale-aware Hierarchical Pooling Module).The key characteristic of these specific layer feature maps was that after pooling, they resembled lower-resolution images that had lost some geometric details.Overall, after branch-wise inference and fusion, these feature maps were combined with the backbone network results and entered the discrimination stage.This process enhanced the weight of low-resolution image features in the final results, thereby guiding the model to focus on learning finer details of small silhouette features and ultimately improving the recognition accuracy of small-sized silhouette images.This module extracted feature maps from different levels of the model without internal pooling operations to preserve original resolutions and continued feature computation using convolutional kernels to augment semantic information of fused features [30].Inter-branch operations involved cross-fusion through multiple stages with various convolutional kernels, employing techniques such as upsampling and downsampling to ensure comprehensive information exchange.Applying this approach to high-resolution image features yielded additional semantic information relevant to gait, which facilitated more accurate discrimination during the classification phase.Conversely, for low-resolution image features, this method recovered more of the lost geometric details, thereby contributing to improved accuracy during the discrimination stage.
Consequently, MFCF retained and explored detailed information while minimizing computational overhead, transmitting operation results alongside backbone network outcomes to discriminative modules to enhance the significance of multi-scale feature maps, thereby improving recognition rates for low-resolution gait silhouettes.

Loss
This paper employed joint training of the model using Triplet Loss [31] and Cross Entropy Loss [32].Triplet Loss is a commonly used loss function in gait recognition models, particularly advantageous for learning fine-grained features, as evidenced by numerous studies in the field.However, Triplet Loss may suffer from imbalanced sample quantities between classes during training, where some classes have fewer samples than others.This imbalance can lead to insufficient attention to samples from certain classes during training, resulting in unstable model training and slow convergence.Combining Triplet Loss with other loss functions in joint training can constrain these issues and compensate for these drawbacks.The formulation of the joint loss function is as follows: The coefficients a and b correspond to the respective weighting factors for the two loss functions.
The objective of Triplet Loss is to learn to map samples from the same class into a compact feature space while differentiating samples from different classes into separate regions.In Triplet Loss, each training sample consists of three samples: an anchor sample, a positive sample, and a negative sample.Here, the anchor sample and the positive sample belong to the same class, whereas the negative sample belongs to a different class.The goal of the loss function is to minimize the distance between the anchor and the positive sample while maximizing the distance between the anchor and the negative sample.By adjusting the feature representations of these samples, the model can effectively distinguish between different classes.The computation of Triplet Loss is defined as follows: where a denotes the feature representation of the anchor sample, p represents the feature representation of a positive sample from the same class as the anchor, and n represents the feature representation of a negative sample from a different class than the anchor.The function d(x, y) denotes the Euclidean distance or cosine similarity between features x, y. margin is a hyperparameter used to control the margin boundary, ensuring that the distance between samples from the same class is at least marginally greater than the distance between samples from different classes.The objective of Cross Entropy Loss is to minimize the discrepancy between the predicted values of the model and the actual labels, ultimately reducing the gap between predicted results and true class categories.The computation of Cross Entropy Loss is as follows: where m represents the total number of samples, n denotes the total number of classes, p ij equals 1 if the sample belongs to the class and 0 otherwise, and q ij represents the predicted probability that the sample belongs to the class.

Implementation Details
The experimental setup and hyperparameters are crucial in deep learning.The experimental setup determines the platform on which the model runs, while appropriate hyperparameter settings greatly assist in model convergence.The experimental setup of this study is shown in Table 1, and the experimental hyperparameters are listed in Table 2.

Super-Resolution Algorithm Comparison Experiments
In silhouette-based gait recognition models, it is necessary to standardize gait silhouette images of varying sizes to a uniform resolution using super-resolution algorithms before processing.The most common practice is to adjust the resolution to 64 × 64 pixels.This study investigates the impact of different super-resolution algorithms on model accuracy during training.For gait silhouette images with larger original sizes and finer details in the CASIA-B dataset, this study directly compresses them to a resolution of 64 × 64 pixels using various super-resolution algorithms.
In addressing the case where gait silhouette images are smaller than 64 × 64 in practical applications, this paper employs the simplest interpolation method, namely nearest neighbor interpolation, to compress gait silhouette images from the CASIA-B dataset to 32 × 32.This simulates the loss of image details due to excessive distance in real scenarios.For silhouette images smaller than 64 × 64, the experimental design involves using various super-resolution algorithms to upscale them to a 64 × 64 resolution in order to investigate the effectiveness of different super-resolution algorithms in recovering image details.During training, the models uniformly use gait silhouette images compressed directly to 64 × 64 resolution.During testing, trained models process the corresponding directly compressed data and compress then upscale the data separately, yielding respective accuracy results.The experimental results are shown in Table 3.Based on the data analysis from the experimental results table, the following conclusions can be drawn: After compressing larger silhouette images to a resolution of 64 × 64, the highest accuracy is achieved using bicubic interpolation, followed by bilinear interpolation, with nearest neighbor interpolation yielding the lowest accuracy.Regarding silhouette images at 32 × 32 resolution, bicubic interpolation exhibited the best accuracy in this study.The reason bicubic interpolation performs best in model recognition, as suggested by this paper, lies in its ability to fill in more local details, enabling the model to comprehensively capture feature information crucial for recognition.This approach allows the model to more accurately differentiate and identify differences in various gait features.Additionally, silhouette images processed through bicubic interpolation exhibit smooth grayscale transitions, providing probabilistic information beyond simple binary black-and-white images.The gradient transitions between high brightness areas and key points of the human body represent a confidence distribution, allocating more weight to regions with higher likelihoods.This enhances the reliability and reference capability of gait recognition models.The gait silhouettes processed by different super-resolution algorithms are illustrated in Figure 6.This achievement offers valuable guidance and insights for future gait recognition research and practical applications.Further exploration and enhancement of super-resolution algorithms, coupled with the fusion of local details and probabilistic information, hold promise for improving the performance and accuracy of gait recognition.

Feature Fusion Module Ablation Experiments
To validate the effectiveness of the proposed multi-scale feature cross-fusion module, particularly in exploring the accuracy on 32 × 32 silhouette images, experiments were conducted by modifying the base model of this paper to exclude this module and instead integrate other multi-scale feature fusion modules for comparison.The experimental setup involved four approaches: utilizing a unified super-resolution algorithm without additional modules, integrating the MGB module [8] from the original GaitSet model, incorporating the FPN module, and employing the method proposed in this paper.The MGB module, derived from the GaitSet model, aims to collect and fuse features from different depths to enhance representation.FPN [33] (Feature Pyramid Network) is a classic feature extraction structure that performs feature fusion across different levels of feature maps through both top-down and bottom-up pathways, effectively extracting feature representations with rich semantics and multi-scale information.The experimental results are presented in the data table.The experimental data for the different modules are presented in Table 4. Based on the data presented above, it is evident that our proposed method not only outperforms others in accuracy on 64 × 64 silhouette images but also significantly surpasses alternative modules in performance on 32 × 32 silhouette images, making it particularly suitable for recognizing small-scale silhouettes.

Comparison with the Existing Methods
In comparison with current mainstream silhouette-based gait recognition models, our model demonstrates superior performance, particularly when handling small-scale silhouettes.Under identical hardware configurations and utilizing a unified super-

Feature Fusion Module Ablation Experiments
To validate the effectiveness of the proposed multi-scale feature cross-fusion module, particularly in exploring the accuracy on 32 × 32 silhouette images, experiments were conducted by modifying the base model of this paper to exclude this module and instead integrate other multi-scale feature fusion modules for comparison.The experimental setup involved four approaches: utilizing a unified super-resolution algorithm without additional modules, integrating the MGB module [8] from the original GaitSet model, incorporating the FPN module, and employing the method proposed in this paper.The MGB module, derived from the GaitSet model, aims to collect and fuse features from different depths to enhance representation.FPN [33] (Feature Pyramid Network) is a classic feature extraction structure that performs feature fusion across different levels of feature maps through both top-down and bottom-up pathways, effectively extracting feature representations with rich semantics and multi-scale information.The experimental results are presented in the data table.The experimental data for the different modules are presented in Table 4. Based on the data presented above, it is evident that our proposed method not only outperforms others in accuracy on 64 × 64 silhouette images but also significantly surpasses alternative modules in performance on 32 × 32 silhouette images, making it particularly suitable for recognizing small-scale silhouettes.

Comparison with the Existing Methods
In comparison with current mainstream silhouette-based gait recognition models, our model demonstrates superior performance, particularly when handling small-scale silhouettes.Under identical hardware configurations and utilizing a unified super-resolution algorithm, our model is compared with GaitSet, GaitPart, and GaitGL on 64 × 64 and 32 × 32 silhouette images.The experimental results are presented in Table 5. super-resolution algorithms capable of enhancing image details with smooth transitional properties significantly improves model accuracy, particularly noticeable in the recognition of small-scale silhouette images.Furthermore, the paper proposes a novel multi-scale fusion module in the model architecture.This module is designed to better leverage multiscale information, focusing on learning from small-scale details, thereby further enhancing the accuracy and robustness of the model towards small-scale silhouette images.Through these investigations, the effectiveness of model modifications and super-resolution algorithms is validated, demonstrating their crucial role in enhancing model performance and robustness.This advancement contributes to bridging the gap between gait recognition research and practical applications.

Figure 1 .
Figure 1.The Takemura method workflow: (a) begin with the original image, (b) trim excess background from the top and bottom, (c) resize height to 64 pixels and find the horizontal center of the contour image, and (d) crop the image width to 64 pixels.

Figure 1 .
Figure 1.The Takemura method workflow: (a) begin with the original image, (b) trim excess background from the top and bottom, (c) resize height to 64 pixels and find the horizontal center of the contour image, and (d) crop the image width to 64 pixels.

Figure 2 .
Figure 2. Some images removed during the preprocessing of the CASIA-B dataset.

Figure 2 .
Figure 2. Some images removed during the preprocessing of the CASIA-B dataset.

Figure 3 .
Figure 3.The method for calculating recognition rates in gait recognition algorithms.

Figure 3 .
Figure 3.The method for calculating recognition rates in gait recognition algorithms.

Figure 6 .
Figure 6.The gait silhouette images processed by different super-resolution algorithms.The first row shows images at a resolution of 64 × 64 pixels, while the second row displays images at a resolution of 32 × 32 pixels.(a) Nearest neighbor interpolation, (b) bilinear interpolation, (c) bicubic interpolation.

Figure 6 .
Figure 6.The gait silhouette images processed by different super-resolution algorithms.The first row shows images at a resolution of 64 × 64 pixels, while the second row displays images at a resolution of 32 × 32 pixels.(a) Nearest neighbor interpolation, (b) bilinear interpolation, (c) bicubic interpolation.

Table 4 .
Feature fusion module ablation experiments.

Table 4 .
Feature fusion module ablation experiments.