LHFFNet: A hybrid feature fusion method for lane detection

Lane line images have the essential attribute of large-scale variation and complex scene information, and the similarity between adjacent lane lines is high, which can easily cause classification errors. And remote lane lines are difficult to recognize due to visual angle changes in width. To address this issue, this paper proposes an effective lane detection framework, which is a hybrid feature fusion network that enhances multiple spatial features and distinguishes key features throughout the entire lane line segment. It enhances and fuses lane line features at multiscale to enhance the feature representation of lane line images, especially at the far end. Firstly, in order to enhance the correlation of multiscale lane features, a multi-head self attention is used to construct a multi-space attention enhancement module for feature enhancement in multispace. Secondly, a spatial separable convolutional branch is designed for the jumping layer structure connecting multiscale lane line features. While retaining feature information of different scales, important lane areas in multiscale feature information are emphasized through the allocation of spatial attention weights. Finally, considering that lane lines are elongated areas in the image, and the background information in the image is much more abundant than lane line information, the flexibility of traditional pooling operations in capturing widely existing anisotropic contexts in actual environments is limited. Therefore, before embedding feature output branches, strip pooling is introduced to refine the representation of lane line information and optimize model performance. The experimental results show that the accuracy on the TuSimple dataset reaches 96.84%, and the F1 score on the CULane dataset reaches 75.9%.

lane detection model using SHN as the backbone network and achieved excellent detection results.However, the construction of PINet has two drawbacks.Firstly, there is information redundancy, which makes it difficult to distinguish key features during the integration process.PINet fuses the upsampling of deep features with shallow features through jumping layers, which can lead to insufficient fusion and the redundant background information contained in deep features can interfere with accurate target localization.Secondly, PINet is simply constructed based on the structure of the backbone network.In actual lane line images, the width of lane lines may vary, and shallow extracted local features are insufficient to describe target features.Properly increasing the receptive field can obtain more detailed information and improve recognition performance.
In response to the above issues, this paper integrates multiscale local information and high-level and lowlevel global information to improve the recognition accuracy of existing CNN models.A Hybrid Feature Fusion Network for Lane (LHFFNet) with multi-space feature enhancement and multi-scale key feature differentiation is proposed to explore the impact of enhancing multiscale feature representation on improving the performance of network in detecting lane lines in complex scenes.
For different scene elements in lane line images, this paper enhances their feature representation through Multi-Head Self Attention mechanism (MHSA) 17 after the output of each downsampling layer, enhances the correlation between adjacent layer features through inter layer transfer, and deepens the differentiation between different features to enhance the diversity of information in each layer.Thus, a Multi-space Attention Enhancement Module (MAEM) is proposed, which encodes the output of each sampling layer in multiple scales of features in multiple spaces.This can effectively enhance the output logic of subsequent feature fusion, making the output feature map more expressive.Considering the variability of spatial distribution in fused feature maps, a Space Concern Distinguish Module (SCDM) is developed to further emphasize key information regions in multiscale feature maps.In addition, in order to eliminate the problem of inaccurate recognition of long-distance lane lines due to the decrease in lane pixel width caused by distance changes as much as possible, this paper introduces the Strip Pooling Module (SPM) 18 structure to further refine the features before output.To investigate the feature extraction ability of this model for lane areas, this paper uses SHN as the backbone network and conducts experiments on the lane detection datasets TuSimple and CULane to test the overall performance of the model.The experimental results demonstrate that feature enhancement and key feature differentiation in the multi-space of the multi-scale features make the model more accurate in recognizing lane lines.In order to demonstrate the advancement of the proposed model, this paper compares the proposed LHFFNet with the current advanced methods, and obtains the best performance.The main contributions of this paper are summarized as follows: (1) To accurately encode the spatial information of each scale feature and carry out progressive feature enhancement, the MAEM method is proposed to enhance the output information of the last layer of each lower sampling layer in order to further capture spatial features.
(2) As the spatial distribution of each scale feature will be different after the MHSA,the SCDM re-encodes the spatial information of each scale to further distinguish the key regions.
(3) As the pixel width of the lane becomes narrower due to the change of the viewing angle at the far and near end of the lane, the SPM is introduced to flexibly obtain the lane marks at the far and near end, so that the lane can be identified more accurately.

Related work Traffic line detection
A large portion of the previous techniques for lane detection relied on handcrafted low-level features to detect road markings 19,20 .These methods required complex feature extraction processes and were unable to effectively handle various complex scenarios due to the diversity of road scenes.Although some recent methods have addressed some limitations of traditional lane detection methods with more robust approaches, there is still room for improvement.
In recent times, deep learning has gained prominence as the prevailing method in computer vision research.According to the modeling methods of lane lines, current deep learning based lane line detection can be roughly divided into four categories: segmentation based methods, anchor based methods, keypoint based methods, and parameter prediction based methods.The segmentation based method treats the lane detection task as a pixel by pixel classification problem at the pixel level, with each pixel divided into two regions: lane lines and background.Usually, background pixel information far exceeds lane line information.In order to distinguish between lane lines, background information, and different types of lane lines, Pan et al. 6 proposed Spatial CNN (SCNN), which treats lane line detection as a multi class segmentation task, treats different lane lines as different categories, and proposes a message passing mechanism to transfer information between adjacent rows or columns of the feature map.However, long-distance information dissemination can cause the loss of lane information, and the information transmission in SCNN is very time-consuming, resulting in slow inference speed.In response to the problem of low computational efficiency in SCNN, Zheng et al. 21proposed a Recurrent Feature-Shift Aggregator (RESA), which greatly improves computational efficiency through parallel information flow.Moreover, multiple step size information aggregation forms are more efficient and have lower losses compared to sequential aggregation in SCNN.Hou et al. 22 applied the Self Attention Distillation (SAD) mechanism in ENet-SAD to capture global contextual information, achieving the aggregation and acquisition of rich contextual information by the network.Neven et al. 23 constructed LaneNet, a multi task lane detection model using a branch structure, which includes two branches: binary segmentation and instance segmentation.Lane detection is considered as an instance segmentation problem, and clustering algorithms are used to assign corresponding lane line pixels to different lane lines to obtain each lane line.
The anchor based lane detection method first designs linear anchors, then calculates the offset between the sampling points and predefined anchors, and then uses Non Maximum Suppression (NMS) to filter out the lane lines with the highest confidence.Li et al. 1 innovatively proposed a novel representation method for lane line anchors, which uses rays from different angles emitted from the left, bottom, and right boundaries of the image as anchors.Through this anchor, each lane line is divided into positive and negative samples, and the loss is calculated and the model parameters are updated.Qin et al. 7 proposed a simple and efficient lane detection method, which transforms the lane detection task into finding a set of specific lane line positions in the image by selecting and classifying positions in the direction of the image.This method reduces the number of classifications, resulting in faster detection speed, but lacks robustness and generalization ability, resulting in average performance in complex scenarios.
Inspired by human pose estimation, some works consider lane detection tasks as key point estimation and correlation problems.Ko et al. 16 proposed PINet for lane detection.This method uses the SHN as backbone network, based on keypoint estimation and instance segmentation methods to predict keypoint positions and embedded features, and uses clustering algorithms to classify different lane lines based on the similarity of embedded features.Qu et al. 24 used a lightweight segmentation network that includes two branches, one for outputting heatmaps indicating whether pixels are keypoints, and the other for outputting offsets to precisely adjust the position of keypoints.Then, association algorithms were used to complete local to global curve correlation.
The method based on parameter prediction uses parameters to model lane lines and regresses these parameters to achieve the goal of lane line detection.Liu et al. 25 proposed to output lane detection as model parameters related to the shape of lane lines, and established a lane line shape model based on road structure and camera posture, thereby providing reliable physical explanations for the parameters output by the network.This method uses the Transformer 17 model to enhance feature expression and interaction in visual features, enabling the model to capture global contextual information in slender lane structures.This type of method relies on the accuracy of input parameters, and the instability of parameters can have a significant impact on lane modeling, leading to suboptimal detection performance.

Feature representation enhancement
It is generally believed that the feature representation enhancement strategy should be universal for different types of images, including lane images, and many models use aggregate context information as one of the feature representation enhancement strategies.Feature representation enhancement is one of the main contents of this paper, and its related research is reviewed in this paper.
Jaderberg et al. 26 proposed a model called spatial transformer network, which re encodes spatial information and enhances the representation of feature units to improve the detection performance of tasks such as image classification and object detection.SAGANs 27 proposed a generative adversarial network model that utilizes self attention mechanism to aggregate contextual information in spatial images.This method significantly improves the quality of generated images by effectively combining self attention.Wang et al. 28 proposed a method of using non local blocks to enhance the overall representation ability of the model for feature extraction units, providing a new approach for scene understanding.SCNN 6 transforms traditional layer by layer convolution into layer by layer convolution in feature maps, thereby effectively extracting and utilizing spatial information across rows and columns in traffic scenes.However, this model requires a large amount of training resources for optimization, and information loss may occur in scenarios involving remote information transmission.RESA 21 is a module used for aggregating spatial information, which captures local and global features through recursive feature aggregation, allowing features to be fully described.However, this structure requires computing and transmitting features at each layer, resulting in high computational complexity.Hou et al. 22 applied self attention distillation (SAD) mechanism in Enet-SAD for global feature enhancement, achieving the aggregation and acquisition of rich contextual information in the network.

Method Overall architecture
This section presents the LHFFNet framework, illustrated in Fig. 2. The network is divided into two parts: resizing network and predicting network.
In LHFFNet, there are four hourglass modules and one adjustment network.Firstly, MAEM is used to extract lane features while enhancing attention to feature information at various scales, enhancing feature representation, optimizing information transmission within the hourglass module, and ensuring that information is not blurred or lost during transmission across hourglass modules.Then, each hourglass module utilizes the SCDM to retain feature information at different scales.In the original jumping layer of the SHN, due to equal pixel weights, the original jumping layers in SHN cannot effectively distinguish between key and non key regions in the multiscale feature maps, which has a negative impact on the distinguishability of the feature extraction.To address this issue, this paper designs SCDM to emphasize the structure of key regions with high weights, enabling the network to balance feature maps with different weights, in order to better distinguish lane lines from the background, expand the output logic of MAEM, and make information at different scales more distinguishable.Finally, the SPM is used in the output branch to further refine the extraction of lane information, flexibly encoding the remote lane information in space, and improving the ability that model detects lane lines at the remote position.
Specifically, after the image is input into the network, the network is first resized for processing operations, reducing the size of the input image to save storage space and inference time, while also reducing computational complexity.Afterwards, the output of the resizing network will be input into the predicting network for further feature extraction, and the prediction results will be output.The MAEM enhances the feature extraction capability of the model.After using the residual module to extract features, the obtained feature map is enhanced by using MHSA for multi-space feature expression, guiding the network to obtain global contextual information and global semantic features, establishing strong correlation between feature information, and reducing information loss between hourglass modules.At the same time, the SCDM connects feature maps of different scales in the backbone network, and introduces new spatially separable convolution branches into the original jumping layer, giving each pixel region a separate weight.This can effectively emphasize the key areas around the lane features, enhance the importance of spatial features, and enhance the ability of model to learn spatial positions.Then, the output of MAEM is processed by distillation layers and fused with the output features of SCDM for upsampling.The outputs of each scale during the upsampling process are fused with the output results of SCDM for feature fusion.Subsequently, subsequent operations such as upsampling are performed to predict confidence, offset, and embedded features refined by SPM.Based on the output results, the points on the lane are predicted and then fitted.

Multi-space attention enhancement module
Due to excessive noise in the lane lines images, the lane markings are not clear.In the process of extracting lane line features in the network, there are certain differences in the information extracted in each channel of the feature map at each scale.Conventional convolution operations can only extract limited local features of the lane, lacking a global perspective.As the network goes deeper, the interesting areas in the network gradually change from local textures to global contours, and the local semantic information contained in the shallow layer www.nature.com/scientificreports/ is equally important for determining the location of lane lines that accurately locate important information in the scene, so it also needs to be emphasized.MHSA is used for feature representation enhancement after residual structure in MAEM, so that both shallow local important information and deep global semantic information can be grasped by the model.The study by Srinivas et al. 29 shows that the hybrid design mechanism of convolution and self attention enables the model to effectively capture local and global information and establish remote dependencies, collect and correlate scene information, obtain global features of data, enhance relationships between objects, and establish strong correlations between data.This provides strong theoretical support for the design of MAEM.The structure diagram of MAEM is shown in Fig. 3, which consists of two components.The first component is a residual structure designed to extract feature information and generate feature maps of a specific size.The operation process can be expressed as: where x i−1 is the original input feature map, and W 1 c , W 2 c and W 3 c represent the parameters of the convolution branch and residual branch respectively.Y 1 and Y 2 represent the output of two branches.x i is the sum of the output of the two branches.
The second part is MHSA containing relative position encoding 30 .MHSA takes the output of last convolutional layer in the residual structure as input and transforms the input features using linear transformation matrices W Q , W K and W V to obtain query, key and value.Relative position encoding is represented by R H and R W to improve the position perception of MHSA.Previous studies have shown that relative position encoding is beneficial for visual tasks 31,32 , as it allows attention to be associated with the relative distance between different positional features.Sum R H and R W , and perform dot multiplication with the query to obtain the correlation score of the content position.Perform dot multiplication with the key and query to obtain the similarity score of the content.Sum the two scores and generate weights using the Softmax function.Finally, apply the weights to the value vectors corresponding to the feature map to generate attention.If the input feature map is represented as x i , the attention generation process can be summarized as follows: where the query The W Q , W K and W V are parameter matrices that are independently learned for each attention head.They are used to perform linear transformations on the query, key and value.Instead of performing a single attention function, MHSA allows each attention to learn different feature representations.This increases the capacity and generalization of model by learning multiple features.The working principle of MHSA is to divide the input features into N groups and calculate single-head attention for each group using different transformation matrices and positional encodings.The attention mechanism is applied in parallel for each single-head attention calculation.The outputs are then concatenated and projected again to generate the final output, capturing information from different subspaces.The computation for each attention head can be written as: (1) www.nature.com/scientificreports/where x i i represents the i-th group obtained by dividing the input feature mapping into N groups, W Q i , W K i and W V i are the linear transformation matrices for the i-th group, R i is the positional encoding for the i-th head, and Head i represents the attention of the i-th head.
After obtaining the attention weights for each spatial dimension, the formula for calculating the multi-space Attention(Q, K, V , R) concatenation can be described as: where W * is the matrix used for linear transformation, and MHSA(Q, K, V , R) is the output of the MAEM.

Space concern distinguish module
In PINet, the original jumping layer of SHN consists of a convolution branch and a residual branch.Its structure is given in Fig. 4. As discussed before, the two branches in the original jumping layer can integrate feature information from various scales, but they can not effectively emphasize the crucial regions around lane features.As a result, the original design of SHN cannot adequately recognize and extract lane information.In order to solve the the problem, a spatial separable convolution branch is introduced into the SCDM.The SCDM consists of three main parts.Its structure is given in Fig. 5.The first part is the regular convolutional branch, which aims to extract features and adjust the input feature map to an appropriate size for subsequent fusion operations.Its output can be described as follows: where x i represents the input feature map, W i is the parameters in the convolution branch.
The second part is the residual block branch, which consists of four convolutional operations.Its output can be represented as follows:  where W r = W 1 r , W 2 r , W 3 r , W 4 r are the parameters in the residual branch.The third part is the spatial separable convolution branch, inspired by Hou et al. 15 , the spatial separable convolution branch adopts a "split-merge" framework structure, using consecutive 1 × 3 and 3 × 1 spatial separable convolutions to spatially encode multi-scale feature information.The original data is input into two branches respectively for convolution, and then pixel level fusion is carried out.The map after fusion is performed by the Sigmoid function to obtain the normalized result.Then, channel reduction operation is performed using convolution.This results in the final output feature map O(x i , W c ) .In fact, this branch can be seen an attention module, where the output feature map O(x i , W c ) can be viewed as a weight map.The process of obtaining O(x i , W c ) can be described as follows: where c are the parameters in the spatial separable convolution branch, Z 1 and Z 2 are the outputs of two separble convolutions, σ is the sigmoid function, and conv represents the convolutional block for channel reduction.
Then the obtained weight mapping is multiplied with the original input feature map, allowing each pixel region to have a separate weight.The purpose is to differentiate different regions on the original feature map and assign larger weights to key regionsduring subsequent backpropagation, thereby emphasizing the key areas.In SCDM, the output feature dimensions height and width for each x i+1 are determined to be half of the original input feature map.As the size dimensions of O(x i , W c ) are identical to those of the original feature map, another convolution operation is performed to adjust its dimensionality for the subsequent final fusion operation.The convolution process that reduces the size of height and width can be writeen as follows: where C(O(x i , W c ), W j ) is the output feature map obtained by convolution, x i O(x i , W c ) is the result of fusing x i and the feature map obtained from spatial separable convolution through matrix multiplication, and W j represents the adjustable parameters of this convolution.The x i+1 is obtained by fusing these three outputs, and this process can be described as follows: where x i+1 is the final result after merging the three branch results, N(x i , W i ) , M(x i , W r ) and O(x i , W c ) are the output results of the convolution branch and residual branch and spatial separable convolution branch respectively.C(O(x i , W c ), W j ) represents the result of matrix multiplication between O(x i , W c ) and the original input feature map, followed by convolutional dimension reduction.
As shown in Eqs. ( 7)-( 13), the SCDM is described in detail.Based on the "split-merge" dual-branch structure, it uses spatial separable convolution to obtain weight maps, which are then fused with the original input feature map, the residual branch, and the output feature maps of the convolution branch.This fusion process not only extracts important features but also emphasizes key regions with higher weights, capturing both global lane features and paying more attention to various forms of lane information and the surrounding key regions in the feature map.Therefore, SCDM is beneficial for learning more discriminative features in the lane detection process.

Strip pooling module
In instance segmentation tasks, achieving fine-grained segmentation of feature maps at different positions is highly challenging.The features of lane lines are usually distributed at different locations in an image and are often accompanied by occlusions and wear.Additionally, due to visual reasons, the width of lane lines can vary between near and far regions.Since lane lines themselves are long strips with straight or curved shapes that can appear in any local region of the image, it requires the model to pay more attention to these local details when segmenting the feature maps.Traditional average pooling methods may inevitably include many irrelevant regions when dealing with complex scenes containing lane lines and other objects.As shown in the red grid in Fig. 6, these methods are also limited in capturing the flexibility of anisotropic context that widely exists in real-world scenes, especially for long strip-like objects like lane lines.
Inspired by Hou et al. 18 , this paper introduces the SPM before the output branch of the hourglass block.Figure 7 shows the architecture of the SPM.Accurately extracting lane line features allows the model to enhance segmentation accuracy, leading to improved detection performance by precisely segmenting lane lines, particularly at remote edge positions.
The SPM introduces the concept of strip pooling.In strip pooling, a 2D tensor x ∈ R H×W is subjected to spatial pooling along either the horizontal or vertical dimension.This pooling operation involves averaging the feature values in each row or column of the tensor.The resulting output after horizontal strip pooling, denoted as z h ∈ R H , while the output after vertical strip pooling, denoted as z v ∈ R W . z h and z v can be expressed as: The pooling layers with long and narrow kernel shapes play a crucial role in capturing long-range dependencies in sparsely distributed regions of lane lines and encoding regions with a banded shape.Simultaneously, due to their narrow core shape along the other dimension, they effectively capture local details.Moreover, strip pooling operations enable the collection of remote contexts from different spatial dimensions.
Assuming an input tensor ∈ R C×H×W , where C represents the number of channels.This tensor is first fed into two distinct branches, each consisting of a horizontal or vertical strip pooling layer.Subsequently, a 1D convolutional layer with a kernel size of 3 is applied to modulate the current location and its neighboring features.This process produces two intermediate outputs, z h ∈ R C×H and z v ∈ R C×W .
To obtain the final output tensor Y ∈ R C×H×W , which contains more informative global priors, z h and z v are combined as follows, resulting in a tensor z ∈ R C×H×W , and the tensor z can be described as: The output Y can be described as: www.nature.com/scientificreports/During the aggregation process, each position in the output tensor is connected to multiple positions in the input tensor.This is achieved through elemental multiplication ( Scale ), the Sigmoid function ( σ ), and a 1 × 1 convolution ( l ).In Fig. 7, the square in the output tensor is connected to all positions that share the same horizontal or vertical coordinates.By repeating this aggregation process iteratively, long-term dependencies can be established across the entire scene.Unlike global average pooling, strip pooling focuses on long but narrow ranges instead of the entire feature map.This approach avoids unnecessary connections between remote locations and highlights important areas at remote edge positions, effectively reducing the impact of distance changes on remote lane detection.Importantly, this method is theoretically feasible and has the potential to improve performance.

Experiment Dataset
The experiments in this paper utilize two widely recognized benchmark datasets for lane detection: Tusimple dataset 33 and CULane dataset 6 .
TuSimple dataset is one of the large-scale datasets currently used for lane detection tasks, provided by Tucson Technology, including 3268 training images, 358 validation images, and 2782 test images.The resolution of the image is 1280 × 720 pixels.The filming scene is mainly on the highway, with mostly sunny weather conditions, sufficient lighting, and clear lane markings.Figure 8 shows some examples of images.Some of the images contain situations such as worn and blurred lane markings, vehicle occlusion, intersections, and high curvature curves, which play an important role in the comprehensive evaluation of the robustness and generalization ability of the algorithm.
CULane is a challenging and widely covered dataset of lane markings published in a paper in 2018.It effectively solves the problem of a single scene in the TuSimple dataset and can better measure the comprehensive performance of a lane detection algorithm.This dataset contains a total of 133,235 images, divided into three parts: training set, validation set, and test set.The three subsets contain 88,880 images, 9675 images, and 34,680 images, respectively.
CULane dataset contains nine different complex scenarios, as shown in Fig. 9, namely: Normal Crowded、Night、No line、Shadow、Arrow、Dazzle light、Curve、Crossroad.These images come from various scenes in Beijing and its surrounding rural areas, including urban roads, highways, etc., covering different times, weather conditions, lighting conditions, and other situations.Some example images of the CULane dataset are shown in Fig. 10.Table 1 in the paper provides additional details about these datasets.
For the TuSimple dataset, this paper uses accuracy as the main evaluation metric and evaluates it using independent evaluation source code.The definition of accuracy is as follows: This evaluation metric measures the accuracy of predicting points on the lane line of a given image.It calculates the number of correctly predicted lane points and the total number of ground truth points in the given image.In addition, this paper also evaluated the false positive (FP) and false negative (FN) of the predicted results.The formulas for these indicators are as follows: where F pred represents the number of incorrectly predicted lanes, N pred represents the total number of predicted lanes, M pred represents the number of predicted missed lanes, and N gt represents the actual number of lanes.
(17) Y = Scale(x, σ (l(z))) For the CULane dataset, the width of its lane lines is considered to be 30 pixels wide.When calculating the consistency between predicted and actual ground conditions, the Intersection over Union (IoU) criterion is used.The CULane dataset uses F1 score for performance evaluation, which takes into account both accuracy and recall, and is widely used in many classification problems.The F1 score is within the range of [0, 1], and the closer the score is to 1, the better the performance of the model.The specific calculation method is as follows:  www.nature.com/scientificreports/Precision refers to the proportion of samples that actually belong to the positive category predicted by the model, while Recall refers to the proportion of positive category samples predicted by the model to all true positive category samples.The IoU threshold is set to 0.5.True positive (TP) indicates a prediction with an IoU greater than 0.5, FP indicates a prediction with an IoU less than 0.5, true negative (TN) indicates that there is no such lane and no such lane is predicted, and FN indicates missed detection.There is such a lane but it is predicted that there is no such lane.The formulas for accuracy and recall are as follows:

Implementation details
In this research, the original images from both Tusimple and CULane datasets are resized to 512 × 256 pixels and the RGB values are normalized from the range of 0-255 to 0 − 1.To enhance the dataset, various methods such as shading, noise addition, flipping, panning, rotating, and intensity change are applied.Additionally, to tackle the issue of data imbalance caused by a large number of image frames in both datasets, hard data with poor loss values are sampled during the training process to increase the selection ratio of challenging data.
The experiments are conducted using a GPU (RTX3060 TI 8 GB) and the code is implemented using PyTorch.During training, each batch consists of 2 images, and the learning rate is set to 0.0001 for Tusimple and 0.00001 for CULane in the optimization process.The training epoch for all datasets is set to 1500.The threshold and other hyperparameters are determined through experimentation.
The LHFFNet model achieves accurate lane line predictions by predicting the exact positions of lane points and fitting these points with spline curves.

Comparison experiment
In order to demonstrate the advantages of the proposed method in lane line detection task, this study compares the proposed method with other popular methods on different datasets.The experiments is conducted with SHN as the backbone and the experiments were labelled as LHFFNet.This paper explores the impact of the number of heads on lane detection performance of LHFFNet by setting the number of heads of MHSA in MAEM to 2, 4, 8, 16 and 32. Figure 11 and Fig. 12 respectively shows the impact of changes in the number of heads on the detection performance of LHFFNet in the Tusimple and CULane datasets.As the LHFFNet performs better when the number of heads is 8 and 16, the entire experiment will explore the effect of setting the number of heads to 8 and 16.
The paper presents the results of LHFFNet on TUsimple dataset and compares them with popular lane detection methods, including SCNN, RESA, PINet, Enet-SAD, LaneNet PolyLaneNet, UFLD, LaneATT and CondLaneNet 34 .As shown in Table 2, the evaluation of TuSimple dataset requires precise x-axis values with certain fixed y-axis values.LHFFNet shows high performance in term of accuracy and false positive rate.The false negative rate also shows reasonable value.LHFFNet (8) and LHFFNet (16) represents LHFFNet with 8 and 16 heads respectively in MHSA.LHFFNet (8) has lower FP and FN values when the detection accuracy is the same as RESA, suggesting that the method can detect lane lines more accurately.In particular, LHFFNet( 16) achieves an advanced level of accuracy of 96.84% when the number of heads is set to 16.Compared to the PINet network, LHFFNet( 16) achieved an accuracy improvement of nearly 0.1% on the TuSimple dataset.This is because the TuSimple dataset has a slightly simpler scene, and it is not difficult to see from the results of other models that most models can achieve detection accuracy of over 96%.Nevertheless, compared with other methods, LHFFNet still achieved an accuracy of 96.84%, and FP and FN decreased to 2.66% and 2.44%, respectively, which means that LHFFNet achieves higher accuracy on lane line detection task.
Figure 13 shows a comparison of the specific detection results of the proposed model LHFFNet on the TuSimple dataset with PINet.
In the case of simple road scenes, fewer lanes, and obvious lane line features, it can be seen that LHFFNet and PINet have achieved relatively accurate detection results for lane lines, and both models have good detection results.However, PINet may experience slight deviation in the predicted lane lines at the near and far ends, especially when the features at the far end are not obvious and there is vehicle occlusion, resulting in significant deviation and point accumulation in its detection results.LHFFNet achieves more accurate detection at both the far and near ends, and the predicted points can perfectly cover the real lane line.Whether at the near or far end, The detection results of LHFFNet greatly reduce deviation and point accumulation.In curve detection with vehicle occlusion at the far end, the fitting effect of LHFFNet on the lane line is also smoother.
This paper conducts comparative experiments on representative algorithms in recent years on CULane, mainly comparing algorithms such as SCNN, LaneATT, RESA, PINet, UFLD, and LaneAF 35 .The specific results in each scenario are shown in Table 3. Comparing the results of LHFFNet with other models when the number of heads is 16, it can be seen from Table 3 that the F1 score of LHFFNet( 16) is 75.9%, which exceeds some previous lane detection methods.In the Crowd and No Line scenarios, the best results were achieved compared with other models.In various scenarios, compared to PINet, all detection results have improved, achieving a growth of 3.5% in the Shadow scenario, 4.1% and 4.4% in the Arrow and Curve scenarios, and 2.9% in the Night scenario.In other scenarios (except for Cross), the detection results have improved by more than 1% compared  to PINet.LHFFNet has achieved performance improvement compared to PINet, and its detection performance is also highly competitive compared to other mainstream models.Figure 14 shows a comparison of the detection results of LHFFNet in multiple scenes in CULane, From top to bottom, they are Normal, No line, Dazzle light, Shadow, Arrow and Night.LHFFNet can achieve good detection results for these complex road scenes.From the Fig. 14, it can be seen that LHFFNet fits the lane lines well in various scenarios, and in each scenario, the lane lines fitted by LHFFNet are farther away and have a smoother shape compared to those fitted by PINet.For areas that PINet did not predict, LHFFNet predicted lane lines with slightly sparse key point arrangements, demonstrating the superiority of mixed feature fusion of LHFFNet in lane line modeling.

Ablation experiment
The effectiveness of MAEM, SCDM and SPM are discussed in detail in this paper, the advantages of each module are analyzed in detail.In order to verify the effectiveness of these different components, experiments are carried out on Tusimiple dataset and CULane dataset to demonstrate their performance.LHFFNet chooses SHN as the backbone of extracting lane feature information.Meanwhile, in this paper, the ablation experiment was discussed when the number of heads in MAEM was set to 8 and 16.Tables 4 and 5 show the results of ablation in TuSimple, and Tables 6 and 7 show the results of ablation in CULane.The experiment fully proves the influence of different number of heads and each module on the baseline, and confirms the ability of the proposed module.The ablation study uses official indicators to calculate the average performance gap.
Explore with the number of heads set to 8. In the TuSimple dataset, after adding the MAEM module to the baseline network, the accuracy improved to 96.77%.At the same time, adding MAEM and SCDM resulted in a small improvement in accuracy, and the FP decreased compared to the baseline, which means that the overall detection performance of the model is more stable.Simultaneously adding MAEM and SPM resulted in an accuracy of 96.79%, with a slight increase in FN.After all three modules were added to PINet, the accuracy was 96.82%, FP was 2.43%, and FN reached the lowest level compared to before, at 2.36%.In the CULane dataset,  the baseline network increased F1 to 74.9% after adding MAEM and 75.2% after adding SCDM.Finally, with the addition of SPM, the F1 score increased to 75.5%.Although adding SCDM brings an increase in the number of parameters to the model, it does not affect the real-time detection speed of the model.Then explore with the number of heads set to 16.In the TuSimple dataset, after adding the MAEM module to the baseline network, the accuracy improved to 96.77%.At the same time, adding MAEM and SPM, the accuracy improved to 96.80%, FP decreased to 2.52%, FN was 2.51%.Then, after adding SCDM, the accuracy reached 96.84%, compared to joining MAEM and SPM, FP slightly increased, FN slightly decreased, and overall remained stable.Analysis of the CULane dataset results showed that the baseline network increased F1 to 75.1% after adding MAEM, then to 75.3% after adding SCDM, and finally to 75.9% after adding SPM.The experimental

Conclusion
This paper proposes a multi-scale and multi-space feature enhancement fusion method for lane detection, LHFFNet.This method designs a MAEM that combines MHSA to enhance the correlation of multi-scale lane features, utilize multi head self attention for multi-space attention enhancement, and achieves multi-space feature enhancement to enhance output logic for subsequent feature fusion.SCDM introduces spatial separable convolutional branches to connect multi-scale lane line features, retain different scale feature information, and emphasize important lane areas in multi-scale feature information in feature maps with variable spatial distribution.SPM flexibly refines lane features at edge positions (especially at the far end), obtaining more accurate contextual information and facilitating accurate modeling of subsequent lane lines.LHFFNet further improves the detection performance of the original model for lane markings, which is of great significance for the safety of autonomous driving.However, for heavy rain and muddy roads, the detection performance of the model in lane markings still needs to be enhanced.The next step of work will focus on addressing this defect, constructing a dataset for rainy driving, and conducting corresponding model construction and testing.

Discussion
The discussion must not contain subheadings.

Figure 1 .
Figure 1.Lane line images in different scenarios.

Figure 2 .
Figure 2. Overview of LHFFNet.LHFFNet consists of an resizing network and a predicting network.The prediction network consists of four hourglass blocks.

Figure 3 .
Figure 3.The structure of the MAEM.The first part is residual structure, and the second part is MHSA using relative position encoding.

Figure 4 .
Figure 4.The original jumping layer structure of PINet.It only contains one convolution branch and one residual branch.

Figure 5 .
Figure 5.The structure of SCDM.Compared to Fig. 4, it adds spatial separable convolution branch to re encode multiscale information and filter important regions.

Figure 6 .Figure 7 .
Figure 6.Strip pooling and spatial pooling have different effects on analyzing different scenes.Compared to traditional pooling, strip pooling (green grid) can better extract local information of the lane and capture longrange dependencies (horizontal green squares).

Figure 11 .
Figure 11.Comparison chart of accuracy results for setting the number of heads for MHSA in the Tusimple dataset, with the number of heads is set to 2, 4, 8, 16 and 32.

Figure 12 .
Figure 12.Comparison chart of F1 results for setting the number of heads for MHSA in the CULane dataset, with the number of heads is set to 2, 4, 8, 16 and 32.

Table 2 .
Comparison of results with other models on the Tusimple dataset.

Table 3 .
Comparison of results with other models on the CULane dataset (Show only FP for cross).

Table 4 .
The number of heads for MHSA is set to 8, and the study of modular ablation is performed on Tusimple dataset, with the baseline representing the results without any substitution or introduction operations.

MAEM SCDM SPM Accuracy(%) FP(%) FN(%)
www.nature.com/scientificreports/results show that regardless of whether the number of heads is 8 or 16, with the subsequent addition of modules, the accuracy of TuSimple and the F1 score of CULane overall increase, indicating that each module is effective in improving model performance.