Element detection and segmentation of mathematical function graphs based on improved Mask R-CNN

: There are approximately 2.2 billion people around the world with varying degrees of visual impairments. Among them, individuals with severe visual impairments predominantly rely on hearing and touch to gather external information. At present, there are limited reading materials for the visually impaired, mostly in the form of audio or text, which cannot satisfy the needs for the visually impaired to comprehend graphical content. Although many scholars have devoted their efforts to investigating methods for converting visual images into tactile graphics, tactile graphic translation fails to meet the reading needs of visually impaired individuals due to image type diversity and limitations in image recognition technology. The primary goal of this paper is to enable the visually impaired to gain a greater understanding of the natural sciences by transforming images of mathematical functions into an electronic format for the production of tactile graphics. In an effort to enhance the accuracy and efficiency of graph element recognition and segmentation of function graphs, this paper proposes an MA Mask R-CNN model which utilizes MA ConvNeXt as its improved feature extraction backbone network and MA BiFPN as its improved feature fusion network. The MA ConvNeXt is a novel feature extraction network proposed in this paper, while the MA BiFPN is a novel feature fusion network introduced in this paper. This model combines the information of local relations, global relations and different channels to form an attention mechanism that is able to establish multiple connections, thus increasing the detection capability of the original Mask R-CNN model on slender and multi-type targets by combining a variety of multi-scale features. Finally, the experimental results show that MA Mask R-CNN attains an 89.6% mAP value for


Introduction
A survey of the World Health Organization reveals that in 2019, there were about 2.2 billion individuals globally with visual impairments [1], and in 2020, the total number of people who were completely blind worldwide amounted to approximately 75 million. With impaired visual functions, people with severe visual impairments cannot experience the same visual pleasures as those with normal vision.
At present, people with severe visual impairments mostly depend on the auditory and tactile senses in order to gain external information. For example, audiobooks can be listened to, as can auditory readers and computer screen reading software. Meanwhile, there are texts and tactile graphics that can be felt by the fingers. These include paper braille publications and paper tactile graphics, as well as Braille e-books, Braille displays and tactile graphic displays [2,3].
A survey found that the China Braille Publishing House produces and distributes approximately 1000 titles of Braille books and periodicals each year, with a total circulation of around 320,000 copies. This averages at about 12,900 titles and 40 copies per publication, which is significantly lower than the average number of publications available to those with normal vision [4]. At present, the majority of reading materials for visually impaired individuals are either audiobooks or braille books. However, these options do not adequately satisfy the need of those with vision impairments for perception of images, which must be converted into tactile graphics in order for them to be touch-read by the blind. Tactile graphics refers to images which can be perceived through touch, composed of convex and concave lines, points and surface textures. Currently, the production of tactile graphics is not fully automated, and most of this work is done by hand, requiring a great deal of time and labor. Furthermore, the production of tactile graphics requires personnel with specialized knowledge to complete. So far, most of this work is done by teachers in schools for the visually impaired. Due to varying abilities and experiences in reading tactile maps, different blind people require tactile graphics of the same content to be presented in different forms, such as different sizes and hierarchical representations. Therefore, in order to reduce the burden on educators, enhance the efficiency of the production of tactile graphics and reduce production costs, the development of an automated tactile graphics creation method with the help of computer technology is of great significance.
The most critical aspect of creating books for the visually impaired is to make them electronic, as electronic books can be easily accessed by the visually impaired in the form of multimedia or tactile graphics. Currently, various scientists and organizations have devoted their attention to the conversion of printed books into electronic books (printed books, abbreviated as e-books). The Digital Accessible Information System (DAISY) alliance [5], a joint effort of European and American countries, aims to provide a universally accessible solution for people with Dyslexia to access digital audiobooks. The specifications necessary for creating DAISY files include 1) text files (.html or .xml) prepared in HTML (Hypertext Markup Language) or XML (Extensive Markup Language), 2) audio files (.mp3, .wav, etc.), 3) text and voice synchronization file (.smil) and 4) image files in scalable vector graphics (SVG) format written in XML [6][7][8]. At present, the production of digital audiobooks based on DAISY standards has been widely accepted by countries around the world. Furthermore, the Electronic Publication (EPUB) [9] standard set by the International Digital Publishing Forum (IDPF) is compatible with DAISY standards in terms of document and image preservation. Consequently, during the electronic conversion of scientific publications, images need to be converted from the Dot Image format to SVG format. Nevertheless, at the moment, paper images can only be saved as bitmaps by scanning, and bitmaps cannot be automatically converted to SVG. Taking the mathematics textbook in Figure 1 as an illustration, the basic steps of book digitization are as follows: First, the separation of text content (including mathematical expressions) and graphs can be accomplished using the text-graph separation technology proposed by P. P. Rege et al. [10], the end-to-end trainable neural network proposed by P. Lyu et al. [11] or the character region awareness technology proposed by Y. Baek et al. [12]. Second, the recognition and digitization of text content can be achieved through the use of optical character recognition (OCR) [13] technology. The development of OCR technology has reached a high level of maturity, and existing technologies such as Tesseract OCR [14] and East [15] make this task easier. The mathematical expressions within the text can be recognized and digitized with InftyReader [16], which is developed by the InftyProject [17]. In addition, deep neural networks with sequence-level training proposed by Z. Wang et al. [18] can also be utilized for this purpose. Both of these techniques can translate the original mathematical expressions in the textual content of books into formats such as LaTeX and MathML. Lastly, the segregated graphs need to be identified and digitized. Currently, although the ImageCaption [19] method such as the end-to-end transformer based model proposed by Y. Wang et al. [20] can achieve semantic description of images, it is primarily applied for describing daily actions. Upon examining Figure 1, it is evident that the function graphs comprise coordinate axes and quadratic functions. Currently, there is no available technology to recognize and convert the graph elements into SVG format for digitizing function graphs.
During the electronic conversion of graphs, there are a series of research methods [21][22][23][24] that can digitize function graphs; however, these methods require that the graphs meet certain criteria. For instance, the graphs must be confined to a specific area (delineated by the x and y axes). A. Balaji et al. [25] analyzed the labeled area through the connected component and fitted the labeled area through the minimum surrounding rectangle, but they only managed to extract the bar graph in the chart. Some scholars, such as J. Chen et al. [26], recognize and redraw function graphs using pattern recognition and save them as SVG. However, the recognition accuracy is low, and it does not have real-time processing capabilities. On one hand, graph acquisition generally introduces noise; on the other, the feature learning abilities of the respective methods may not be sufficient. When compared to traditional graph recognition methods, deep learning models are able to automatically learn the features of the training data, and they possess stronger feature learning abilities, which in turn results in a better performance when using deep learning models for graph recognition. J. Staker et al. [27] successfully applied the deep learning method to recognize the visual representation of chemical molecular structures. Likewise, M. Oldenhof et al. [28] were also able to identify optical patterns of compounds using a deep learning model. Many computer vision tasks today incorporate deep learning techniques. Deep learning has found widespread application across industries, leveraging its high adaptability, powerful feature learning capabilities, ability to handle large-scale data and efficiency. Thus, building upon these strengths, this paper takes a deep learning approach to tackle the instance segmentation of function graphs. In terms of feature extraction in deep learning, the Transformer network proposed by A. Vaswani et al. [29] quickly dominated the field of natural language processing as soon as it was introduced. With subsequent applications of Transformer and its improved models such as ViT proposed by A. Dosovitskiy et al. [30] and Swin Transformer proposed by Z. Liu et al. [31] in the computer vision domain, networks with Transformer as the backbone have emerged as popular directions, surpassing traditional convolution-based networks in various tasks and achieving state-of-the-art rankings. However, the emergence of ConvNeXt has shifted scholars' attention back to convolutional networks. Z. Liu et al. [32] demonstrated through experiments that ConvNeXt outperforms Swin Transformer and achieves state-of-the-art performance in various image processing tasks with lower computational requirements. Regarding function graphs, compared to Transformers, ConvNeXt leverages the local connectivity and weight sharing properties of convolutional operations, enabling better capture of spatial information and local features, thereby enhancing feature representation capabilities. In terms of feature fusion in deep learning, BiFPN [33] employs a simple and efficient network structure that addresses the limitations of information propagation and loss in traditional fusion networks while maintaining low computational and storage costs. For function graphs, a multi-scale feature fusion approach in BiFPN offers translational and scale invariance, improving robustness and generalization to object deformations of the model, occlusions and scale variations. Based on the analysis above, this paper explores improvements using ConvNeXt and BiFPN as the base models.
This paper proposes an improved instance segmentation model based on Mask R-CNN [34], which leverages the benefits of convolution and self-attention and effectively improves the accuracy of the model in detecting and segmenting graph elements in function graphs. The contributions of this paper are summarized as follows: 1) For feature extraction, ConvNeXt is used as the base model. By adding an attention module made up of local detail features, remote information features and channel information features to the ConvNeXt block module, the improved ConvNeXt is called MA ConvNeXt in this paper. MA ConvNeXt is able to amplify its ability to extract local information, global information and varied channel feature information.
2) For feature fusion, this paper proposes a new model by adding an Atrous Spatial Pyramid Pooling (ASPP) [35] module and Residual Feature Augmentation (RFA) [36] module to BiFPN, which is called MA BiFPN. The proposed model can enhance the detection capability at different scales.
3) In order to improve the segmentation capability of the model, the aspect ratio of the proposed candidate anchor boxes in the Region Proposal Network (RPN) layer of the model is modified, and a penalty factor is included in the calculation of the Intersection-over-Union (IoU) metric in order to create a margin space in the target detection box.
4) The MA Mask R-CNN model is proposed to detect and segment 13 kinds of subdivided mathematical function graphs, reaching a detection accuracy of 89.6% and mask segmentation mAP score of 72.3%. Through multiple comparison tests, it has been proven that MA Mask R-CNN performs better in terms of category average accuracy and mask segmentation quality of function graphs.
The remaining organization of this paper is as follows. Section 2 introduces the specific structure of the original Mask R-CNN model. Section 3 describes the improved network structure proposed in this paper. It also details the specific improvements and loss functions of this network in comparison to the original Mask R-CNN model. These improvements include feature extraction, feature fusion, region proposal network and mask segmentation methods. Section 4 reports on the datasets used, as well as the comparative experimental results and corresponding analyses of each model on the dataset. Finally, Section 5 summarizes the work of this paper and outlines future directions and key points. Mask R-CNN is a model, after adding branches for predicting target segmentation mask, based on Faster R-CNN [37]. Since it was proposed, it has achieved good results in object detection and segmentation. It is a classic model in the instance segmentation task. Figure 2 shows the overall framework of Mask R-CNN. The steps are as follows: 1) The original image is input to Residual Network (ResNet) [38], which is a feature extraction network to generate feature maps. 2) The feature maps of the four stages in the feature extraction network are input to the Feature Pyramid Networks (FPN) [39], which are feature fusion networks to obtain new feature maps that combine high-level feature maps with low-level feature maps. 3) Input the feature maps generated by FPN into RPN and RoI Align. Anchor boxes and anchor confidence of feature maps are obtained in RPN, and Non-Maximum Suppression (NMS) is used to eliminate anchor boxes with low confidence. Since anchor sizes generated in RPN are inconsistent, different sizes of anchor are mapped into fixed area sizes in RoI Align. 4) After the above steps, the feature maps fixed by RoI Align are input into the Fully Convolutional Layer (FCL) and Fully Convolutional Network (FCN). Bounding box and class confidence of the target instance are output after FCL, and mask of the target instance is output after FCN.

Overall architecture of MA Mask R-CNN
MA Mask R-CNN is an improved model based on Mask R-CNN. In MA Mask R-CNN, ResNet feature extraction network of Mask R-CNN is replaced by MA ConvNeXt network, and MA BiFPN is used to replace the original FPN in feature fusion network. The mask segmentation strategy is replaced by the original FCN with PointRend [40]. The frame diagram of the overall model is shown in Figure 3. The process of the MA Mask R-CNN model is as follows: 1) Input mathematical function graph into the model, enter the feature extraction network, which is MA ConvNeXt, and generate corresponding feature maps according to the four feature extraction stages in MA ConvNeXt. Compared with the original model ConvNeXt, MA ConvNeXt added an additional attention module with local, global and channel information at the same time, so that the generated feature maps have more advanced semantic information and positioning capabilities. 2) Input the feature maps generated by MA ConvNeXt into MA BiFPN, which is a feature fusion network. In order to enhance the feature fusion network's recognition ability in slender targets and multi-scale information fusion ability, an RFA module is added to the original BiFPN after the top-level semantic information of the top-to-bottom branch. ASPP modules are added to the input feature maps of the top-to-bottom and bottom-to-top branches. 3) Adjust RPN. In RPN, in order to make the detection box not completely close to the target, appropriately increase the size proportion of the anchor box, and add a penalty factor to IoU to obtain a more suitable detection box and enhance the segmentation effect. The foreground and background are classified according to the adjusted RPN, and a regression operation is carried out on the bounding box. After the NMS removes the Anchor box with low confidence, the proposed region is finally generated. 4) Put the proposed region into ROI to output the feature map of the same size, and conduct the regression classification of the graph element, so as to generate the final prediction box and segmentation mask. In the mask branch, PointRend is used instead of FCN to obtain a smoother and finer pixel mask.

MA ConvNeXt network for feature extraction
In the aspect of a feature extraction network, an improved network named MA ConvNeXt based on ConvNeXt is proposed in this study. The overall network structure of MA ConvNeXt is shown in Figure 4. The ConvNeXt-T network is selected as the foundation network. Compared to Transformer based networks, ConvNeXt does not require complex operations such as block merging, sliding windows and relative location indexing, thus providing superior performance and less computation. ConvNeXt uses Transformer network for reference in network structure design, and it uses the methods in convolution network for replacement and optimization. The components of the MA ConvNeXt network constructed in this study are mainly the convolution layer, which is called MA ConvNeXt Block. MA ConvNeXt Block is the core part of feature extraction network, which is responsible for extracting features of mathematical function graphs. In order to deepen the network to obtain better and more features, the stacking times of the four MA ConvNeXt Block layers are set to (3,3,9,3) according to the original ConvNeXt-T network structure. In MA ConvNeXt, first, a 4 × 4 convolution kernel with step size set to 4 is used for the convolution operation, and then layer normalization is carried out to initially refine features and improve the distribution of features in feature map. After the above processing, feature maps are input into the MA ConvNeXt Block for attentional multi-scale feature extraction. The general structure of the MA ConvNeXt Block is shown in Figure 5(a). In MA ConvNeXt Block, first, a Dw Conv is carried out on the features. In order to make effective use of the information of different layers in the same spatial position, the processed features continue to be input to Pw Conv after layer normalization (LN), which generates a new feature map by weighted combination of the input independent feature maps in the depth direction. The above methods greatly reduce the amount of computation and parameters and can make the network deeper with the same parameters. The feature maps are then input into the Attention module by introducing nonlinear properties through activation functions.
(a) Structure of MA ConvNeXt Block.
(b) Attention module; The Attention module in (a), where K is the size of the convolution kernel, P is the padding, and G is the grouping. Dw Conv is depthwise convolution, and Pw Conv is pointwise convolution. Asym Conv is asymmetric convolution. In this study, it is hoped that the Attention module is able to get local attention as well as good remote information. In general, local attention can be completed by small convolution kernels, while remote information can only be completed by large convolution kernels except for the self-attention mechanism. However, the self-attention mechanism destroys the two-dimensional property of graphs, and large convolution kernels require a large amount of computation. In order to avoid the respective shortcomings of the self-attention mechanism and the large nuclear convolution, the Attention module in the MA ConvNeXt Block is shown in Figure 5(b). The Attention module can be expressed by Eqs (1) and (2).
where ∈ ℝ denotes attention, and its value denotes the importance of different features. ∈ ℝ denotes input feature maps. denotes asymmetric convolution with a convolution kernel of size 3 1.
denotes asymmetric convolution with a convolution kernel of size 1 3 . ⊕ denotes element-wise addition. ⊗ denotes an element-wise product.
The operation steps of the Attention module are as follows: First, the features are input into the asymmetric convolution [41]. An ordinary 3 3 convolution is replaced by an asymmetric convolution of feature maps whose kernels are (3,1) and (1,3), respectively. This method reduces the amount of computation and can not only enhance the feature extraction of slender objects such as mathematical function graphs but also make up for the information loss caused by the following Dw Conv dilation rate of 3. Second, separate BN operations are carried out before the fusion of two asymmetric convolution branch feature maps to avoid unifying BN and weakening the feature maps between different convolutions. Then, by using the convolution of 7 7 large kernel Dw Conv with a dilatation rate of 3 and a 1 1 Pw Conv, the fused features after BN operation can obtain better remote information and receptive field. Finally, the result of the previous step is used as the weight of the original different feature maps, so as to play the role of attention.

Set ; End
The input features pass through the Attention module and output the features with attention into the final part of the MA ConvNeXt Block. In order to prevent model overfitting and improve generalization ability, DropOut and DropPath layers are added at the end of MA ConvNeXt Block. Both work similarly. DropOut deactivates neurons at a certain rate, while DropPath deactivates the master branch structure at a certain rate. To integrate with the preceding content and provide a comprehensive and intuitive presentation of MA ConvNeXt, Algorithm 1 also provides the pseudocode of MA ConvNeXt.

MA BiFPN network for feature fusion
The MA BiFPN feature fusion network is proposed based on the BiFPN network. Mask R-CNN uses FPN as a feature fusion network. FPN has the following defects in feature information fusion: 1) There is information loss in the process of adjacent scale feature fusion: semantic information loss caused by fewer channels in the process of fusion from high-level to low-level. 2) Because of the top-to-bottom or bottom-to-top structure of FPN, it pays more attention to the feature maps of adjacent layers, so the feature maps of the high-level layer cannot be directly transferred to the lower layer but must pass through the layers, which will lead to the loss of high-level semantic information.
3) The characteristic information of each scale is inconsistent. BiFPN builds a bottom-to-top path based on FPN to better retain the low-level feature information and adds extra weight to each input to learn the importance of each feature. BiFPN goes on to add residual joins, remove nodes with only one input edge and do weight fusion and iterate the entire fusion network three times. BiFPN increases the residual connection to enhance the representation of features. BiFPN removing the node with a single input edge is because the node with single input variable does not have feature fusion and has less information and low contribution. BiFPN adds a weight for each scale feature map to reflect the scale contribution. Equation (3) shows the weighted fusion calculation method.
where is the learnable weight, is the set smaller learning rate, and the final value is between 0 and 1.
The BiFPN has the above improvements, but it still has not solved the problems of poor semantic information fusion effect in non-adjacent layers and information loss caused by multi-scale information transmission. Therefore, this study proposes an improved feature fusion network, which is called MA BiFPN, and the specific structure is shown in Figure 6, where , , and are output features of the four stages in MA ConvNeXt, respectively.
In order to avoid the information loss caused by the reduction of the number of channels when the highest-level feature map is transmitted to the feature fusion network and to strengthen the contextual information compatible with other layers, this study replaces the path of passing the feature map of the highest-level feature extraction network to and with the RFA module. The structure of the RFA module is shown in Figure 7. Using ratio-invariant adaptive pooling on the layer generates multi-scale feature maps. A 1 1 convolution is used to generate the output feature with a channel number of 256, which is then up-sampled to the same scale by bilinear interpolation. The Adaptive Spatial Fusion module is used to adaptively combine these contextual features to reduce the blurring effect caused by the interpolation. The Adaptive Spatial Fusion module assigns weights to each feature map and then aggregates these contextual features into a new feature map, as shown in Eq (4) for the weighted feature fusion computation of the Adaptive Spatial Fusion module. where denotes the feature vector at (i, j) of the l-th layer. , , denote the remaining three layers' learnable weights (i.e., contribution degree) on the feature maps.
→ denotes the feature vector generated at (i, j) of the l-th layer after the feature map is resized from the n-th layer. The newly generated feature map is then combined with the feature map in the following feature fusion network and propagated to the lower-level feature maps for fusion, and it is then processed by the ASPP module and combined with the feature map to generate the feature map. Equations (5) and (6) illustrate the feature fusion process of the improved MA BiFPN at layer .
where and are all learnable parameters, and e is the learning rate. indicates that after processing by the module, is upsampled and scaled to the same size as .
denotes the intermediate feature map at level 3 on the top-to-bottom path, and indicates the scaling of the output feature of to the same size as .  Continuing from the previous part, the feature is obtained by weighting the pooled feature and the feature after passing through the RFA and ASPP modules. Equations (7)-(10) present the feature fusion process of the highest-level feature N4 in the improved MA BiFPN. denotes the three multi-scale features generated by applying Ratio-invariant Adaptive Pooling to , denotes the bilinear interpolation method, denotes the Adaptive Spatial Fusion module, denotes the weighted feature obtained by , denotes the number of convolutions with different parameters in the ASPP module, denotes the convolution operation with different parameter settings in the ASPP module, denotes the feature obtained by applying the ASPP module to , are learnable parameters, e is the learning rate, and indicates the down-sampling of feature to the same scale as feature . When obtaining and extracting features of each layer, FPN only uses convolution whose convolution kernel is 1 1 to reduce the number of channels, greatly reducing the feature content of each scale. Therefore, in this study, ASPP is replaced with the path of feature transfer at each layer of feature fusion. In the BiFPN of three iterations, multi-scale feature maps are passed through ASPP to the next bottom-to-top or top-to-bottom fusion path. The concrete structure of ASPP is shown in Figure 8, where Rate stands for dilation rate. The input feature map is obtained by 4 parallel convolution branches with different dilation rates, and the multi-scale feature maps generated by the input feature map are merged at the end. ASPP can enhance the receptive field of feature fusion and enhance the extraction ability of the overall information, which is conducive to improving the final feature maps obtained by each branch of the feature fusion network. To integrate with the preceding content and provide a comprehensive and intuitive presentation of MA BiFPN, Algorithm 2 also provides the pseudocode of MA BiFPN.

Improved region proposal network
After the MA BiFPN feature fusion network, four multi-scale fusion feature maps are obtained. These are then fed into the RPN for bounding box regression. The bounding box regression of the RPN is a rough result, and it is easy to lose information for the segmentation task. Common RPN judges candidate boxes by the IoU values of bounding boxes and candidate boxes (in target detection, it is the IoU of prediction box and ground truth box; and in segmentation, it is the IoU of prediction mask and ground truth mask). The IoU is calculated as Eq (11).
where is the candidate box, and is the ground truth box. As can be seen from the equation, if the candidate box is too large, the IoU value will be too small and will be discarded. However, the candidate box which is larger than the ground truth box can contain the segmented object more comprehensively and keep more background area to reduce the information loss which may be discarded by the candidate box which is close to the edge. Therefore, in the model of this study, the selection of anchor candidate box increases by about 5%, and the penalty factor ∩ is added for IoU to punish the candidate box for not covering the target region. The improved IoU calculation equation is shown in (12).
where denotes the learned penalty intensity parameter, and ∩ denotes the spatial gap between the candidate box and the ground truth box.
The improved RPN retains a certain amount of elastic space for the selected candidate boxes, providing better proposal regions for subsequent segmentation. Figure 9 shows the segmentation strategy of the Mask branch selected by MA Mask R-CNN. MA Mask R-CNN uses PointRend to replace the FCN in traditional Mask R-CNN for mask segmentation. PointRend innovatively introduces the rendering idea of computer graphics and regards the graph as the discretization expression of the real target. Therefore, the segmentation problem can be regarded as the prediction of the area occupied by a real target in the discretized graph. Compared with the prediction after sampling to the same size as the input graph on the FCN, PointRend is predicted directly after multiple pooled downsampling. PointRend consists of three main components: Point Selection, Point-wise and Point head. Point Selection: Among rough segmentation feature maps, difficult points with poor segmentation effect are selected; Point-wise: By combining Fine-grained features and Coarse features, point-wise feature maps are constructed in selected difficult points. Point head: A simple MLP network is trained to perform point-by-point segmentation prediction for selected difficult points' features, and the re-predicted results are replaced with the original coarse predicted results. PointRend uses the above method to reduce the information loss caused by continuous upsampling, and because it is more aimed at the goal of fewer difficult points than the whole mask pixel points, it can not only achieve accurate segmentation and smooth segmentation edge but also greatly save the consumption of computing resources.

Loss function
In this study, loss function can be expressed as the sum of loss at RPN stage, the final classification loss of the model, bounding box regression loss and mask branch loss loss, as shown in Eq (13).

13
In Eq (13), is composed of classification loss and bounding box regression loss at RPN stage, as shown in Eqs (14)- (17). where denotes the probability that the i-th anchor is predicted to be a true label, and the value of * denotes the current positive and negative samples, which is 1 when the calculated samples are positive and 0 when the calculated samples are negative.
denotes the bounding box regression parameter predicting the i-th anchor, and * denotes the ground truth corresponding to the i-th anchor.
indicates the number of all samples in a batch, and indicates the number of anchor locations.
is the balance parameter set. ∑ , * denotes classification loss, and ∑ * , * denotes bounding box return loss. uses Softmax Cross Entropy loss function and uses ℎ loss function.
In Eq (13), is composed of classification loss and bounding box regression loss in the final prediction stage of the model, as shown in Eq (18).
where denotes the softmax probability distribution predicted by classifier , … , . denotes the real category label of the corresponding target, and denotes the regression parameter of the corresponding category g predicted by the corresponding bounding box regressor , , , . denotes the bounding box regression parameter corresponding to the real target , , , . still uses Softmax Cross Entropy loss function, while uses ℎ loss function.
, denotes classification loss, and 1 , denotes bounding box regression loss.
in Eq (13) adopts the Binary Cross Entropy loss function on the Mask branch: where 1 means 1 when the k-th channel corresponds to the true category of the target, and 0 otherwise. denotes the output value at the current position, and denotes the result of transforming the output through a sigmoid function.
indicates the label value of the mask at the current location, which is 0 or 1.
denotes the size of the feature map. Figure 10. Part of the dataset is displayed.

Experimental data and experimental environment
The mathematical function graphs in the experiment come from the real book [42] function graphs and self-made high-quality function graphs, for a total of 1200. Some graph examples are shown in Figure 10. Each graph is composed of multiple graph elements, including horizontal axis, vertical axis, straight line, quadratic function curve, cubic function curve, sine function curve (contains sine, cosine, where the phase difference between sine and cosine is ), arcsine function curve (contains arcsine and arccosine, which are the inverse functions of sine and cosine), tan function curve, arctan function curve, exponential function curve, logarithmic function curve, composite function curve (free curve) and circle. For the labeling information such as equation and point coordinates in the function graph, the density method used in the paper of J. Chen et al. [26] is used to separate the information from the function entity, and only the function entity is retained.
In terms of processing, the graph is preprocessed first, including graph enhancement, flipping and size scaling. Second, Labelme software is used to annotate the graph elements in the graph. Polygon annotation method is used to annotate the contents, including the location and category information of graph elements. The corresponding graph elements annotation information is saved in JSON files. Then, the graph dataset is divided into the training set and the test set according to the ratio of 9:1.
The experimental environment of the computer was as follows: Intel Core i9-12900K 2.4-5.2 GHz CPU, 16 G RAM, Windows 10 64-bit operating system, Nvidia RTX 3090 Ti-24G, Python 3.7 and PyCharm 2022 as the programming platform.  Table 1 shows the main training parameters used for the models in the experimental section of this study. To ensure the reliability and fairness of the experiments, this study maintains consistency in the main training parameters. In Table 1, the values for LEARNING_RATE and WEIGHT_DECAY are set based on empirical values. While LEARNING_RATE and WEIGHT_DECAY are empirical values, they have been proven effective through experimentation. In this experiment, a warm-up strategy was applied for the learning rate during the first 10 epochs, followed by the use of the Adam optimizer in subsequent epochs. These methods allowed for a smaller initial learning rate. Warm-up is a learning rate scheduling strategy that starts with a smaller learning rate in the initial training phase and gradually increases it to help the model converge better. The Adam optimizer combines momentum and adaptive learning rate adjustment methods, automatically adjusting the learning rate based on the gradients of the parameters.
WEIGHT_DECAY is used to prevent overfitting by adding a regularization term to the loss function. BATCH SIZE refers to the number of samples selected for each training iteration. In general, a larger BATCH SIZE is preferred as long as it does not lead to overgeneralization. However, in this experiment, the maximum BATCH SIZE is limited to 8 due to the available GPU memory. EPOCHS refers to the process of training the model using all the samples in the training dataset once. In this study, after conducting multiple experiments, it is observed that the total loss of each model starts to stabilize after reaching 150 epochs, showing no significant changes. Therefore, EPOCHS is set to 150.

Experimental results and analysis
The mathematical function graph data collected consists of the function graphs sampled from real books and the high-quality function graphs made by ourselves. In this section, the effectiveness of the proposed MA Mask R-CNN is demonstrated through experiments. The contents are presented in Tables 2 and 3. The categories of graph elements are as follows: C1 represents the horizontal axis, C2 represents the vertical axis, C3 represents a straight line, C4 represents a quadratic function, C5 represents a cubic function, C6 represents an exponential function, C7 represents a logarithmic function, C8 represents a sine function, C9 represents an arcsine function, C10 represents a tan function, C11 represents an arctan function, C12 represents a composite function, and C13 represents a circle. Tables 2 and 3 show the comparison between the model proposed in this study and other mainstream instance segmentation model algorithms. Among them, the evaluation index is mAP with IoU set at 0.75. Figures 11 and 12 show the visualization of results of Tables 2 and 3, respectively. Then, the mAP values of each model on mathematical function graphs target detection and segmentation are respectively shown in Tables 2 and 3, in which Mask R-CNN is the original model, YOLACT++ [43], PANET [44] and BlendMask [45] are classic models of instance segmentation, and Mask Dino [46] is the state-of-the-art instance segmentation model. The calculation method of the evaluation index mAP (mean Average Precision) in Table 2 is as follows: Step 1: The region selection algorithm is used to get the candidate box. The IoU between each candidate box and the ground truth box is calculated. By comparing the IoU value and the given IoU threshold, the samples are classified as positive and negative samples, and a test set similar to that in the classification is obtained.
Step 2: The test set of Step 1 is used to calculate the score of positive samples through the classifier, and the positive and negative samples are classified by comparing the score value with the given score threshold.
Step 3: The accuracy rate and recall rate of the current category are calculated according to the results of Step 2.
Step 4: The AP value of each class is calculated, and the mAP is calculated according to the AP value of all the classes obtained.
As for the mAP in Table 3, one only needs to replace the IoU calculation target in Step 1 above with candidate mask and ground truth mask. Figure 11. Object detection mAP of each method on each function class.
Combining Table 2 and Figure 11, it can be found that the target detection mAP of the model proposed in this study has been improved to varying degrees in each category. Especially, the target detection mAP of the C12 category (composite function) has been significantly improved. By introducing the attention module of local information and remote information, and strengthening the fusion of high-level and low-level information in the feature fusion stage, the MA Mask R-CNN model can effectively improve the detection ability of complex mathematical function graph recognition.
Furthermore, it can be observed that the model proposed in this study performs better than Mask Dino in all categories except for C13. There are several reasons for this: In the dataset used in this study, when annotating C13 (circle), a polygon annotation is used along the edge of the circle, and the entire solid circle is labeled as C13. Unlike other categories, C13 is not a thin line but a solid shape. Mask Dino has a significant advantage in handling non-elongated objects like C13. Therefore, the model proposed in this study is relatively weaker in the C13 category compared to Mask Dino. It is important to consider the specific characteristics and challenges posed by different object categories when evaluating the performance of models. While our model may not outperform Mask Dino in the C13 category due to the nature of the annotations, it demonstrates superior performance in other categories.  Figure 12. mAP of each category in segmentation branch.
Combined with Table 3 and Figure 12, it can be found that the model proposed in this study has different enhancement degrees of segmentation mAP in each category. Among them, the improvement of our proposed model compared to the state-of-the-art instance segmentation model, Mask Dino, is limited in categories such as exponential functions C6, logarithmic functions C7, and composite functions C12. Additionally, the improvement in category C13 is slightly inferior to that of Mask Dino. Potential reasons are as follows: Mask Dino has an advantage in detecting the C13 category. Furthermore, its utilization of query embeddings for performing dot products on high-resolution pixel embedding maps enables the prediction of a set of binary masks. This approach proves particularly effective for the C13 category, which exhibits fixed geometric shapes without elongation. It allows for a better understanding of the shape information of the target and generates masks that are more compatible with the shape of C13. Therefore, Mask Dino still maintains a certain advantage in functionally segmenting images of the C13 category. In some cases, C6 and C7 have high morphological similarity, and the feature difference is not obvious. However, due to the particularity of the category, C12 has no fixed morphological rules, and some entities easily have similarities with other categories, relying more on the supplement of datasets, so it may be due to insufficient training samples and the graph elements of some training samples extending to the edge of the image, which affects the segmentation. Due to the wide range of independent variables of sine function, more features, and higher requirement for continuity than other categories in samples, the remote information acquisition capability of models is also higher, resulting in limited improvement of all preceding models on C8 (sine function). However, our model has obvious improvement in this category, and the reasons are as follows: The improvement of branch detection effect, the expansion of the anchor frame ratio in the RPN layer and the addition of penalty factor to the IoU make the candidate frame more accurate and bring some elastic space. Benefiting from the foundation of the above gain, the PointRend of segmentation based on graph rendering ideas can better improve the segmentation accuracy and generate high-quality segmentation masks.
The backbone network of Mask R-CNN, YOLACT++, PANET, BlendMask and Mask Dino in Tables 2 and 3 Table 4 shows the ablation experimental results of this study, in which Params denotes the number of parameters contained in the model (unit: MB), abbreviated as M in the table. Det mAP denotes the mAP for object detection, and Seg mAP denotes the mAP for segmentation. GFLOPs are billion floating-point operations per second and are used to measure the complexity of a model. FPS denotes the number of graphs the model can process per second. Table 5 displays the runtime of each model mentioned in the experimental section of this study, in which Time denotes the duration from when the input data undergoes preprocessing and is fed into the model to start the timer, until the model generates the output results and the timer stops. According to the analysis of Table 4, compared with the original Mask R-CNN, the model proposed in this study has significantly improved the mAP of target detection and the mAP of instance segmentation. According to the analysis of Tables 4 and 5, compared with the original Mask R-CNN and Mask Dino, the accuracy of the model proposed in this study is improved at the cost of less speed loss. Therefore, the ablation results prove the effectiveness of the improved model in this study.  This section does not compare YOLACT++ because it only has a clear advantage in speed. CAM (Class Activation Mapping) is a common visualization tool for attention mapping. By analyzing Figure 13, it can be found that compared with Mask R-CNN, PANET and Blend Mask models, MA Mask R-CNN can pay more attention to target graphs in various mathematical function graphs. In addition, compared with Mask Dino, MA Mask R-CNN only pays less attention to circle. So, MA Mask R-CNN has better performance in mathematical function graphs, which proves that this model is effective in improving local attention and remote information.  Figure 14 shows the instance segmentation results of the four models on the test graph. In order to better display the recognition and segmentation effect of each model on the function graph in Figure 14, the segmentation instances of each model in the original Figure 14 are extracted separately in Figure 15. In Figure 15(b)-(f), the red line denotes the solid part of the logarithmic function, the blue line denotes the solid part of the arcsine function, the green line denotes the solid part of the circle, and the black line denotes the solid part of the coordinate axis. Among them, each first image in Figure 15(b)-(f) denotes the logarithmic function divided by the model from Figure 15(a), the second image in Figure 15(b)-(f) denotes the arcsine function entity divided by Figure 15(a), the third image in Figure 15(b)-(f) denotes the circle divided by the Figure 15(a), and the fourth image in Figure 15(b)-(f) denotes the remaining part of the model after removing the entity divided by the first three graphs from Figure 15(a). The following is an analysis of the instance segmentation results of the four models refined in Figure 15 (the four models are similar in the segmentation effect of the axes, so the analysis is not carried out): 1) The logarithmic function and arcsine function entities extracted by the original Mask R-CNN segmentation in Figure 15(b) are too few, with poor continuity. Although a complete circle is divided through the mask of the circle, part of the coordinate axis and the entity of the arcsine function are also incorrectly divided. Through the analysis of the fourth image in Figure 15(b), it can be found that the instances that Mask R-CNN can extract are limited, the features of the residual instances are not fully utilized, and the segmentation effect of the junctions of different instances is poor. 2) The segmentation effect of PANET in Figure 15(c) is slightly improved. By analyzing the fourth image in Figure 15(c), it can be found that there is still a large amount of entity information that has not been utilized. 3) In Figure 15(d), the logarithmic function and arcsine function entities extracted by Blend Mask have been significantly improved. The interference of the arcsine function instance can be removed in the circle segmentation, but the partial axis entity intersecting with the circle boundary is still incorrectly segmented. By analyzing the fourth image in Figure 15(d), it can be found that the Blend Mask has obtained more entity information than the previous two models, but the utilization rate of entity information still cannot meet the requirements. 4) In Figure 15(e), it can be observed that Mask Dino exhibits further improvement in extracting various function entities compared to Blend Mask. The function entities become more complete, but there are still instances where the logarithmic function and arcsine function suffer from mutual occlusion with other functions, resulting in the inability to generate masks for the occluded portions. However, Mask Dino shows the most significant improvement in segmenting circle function entities by eliminating the interference caused by neighboring function entities and coordinate axes, resulting in a segmented mask that only contains the circular function. Analyzing the fourth image in Figure 15(e), it can be noticed that Mask Dino has a higher utilization rate of function entity information compared to the previous three models. However, there is still room for improvement. 5) In Figure 15(f), the logarithmic function and arcsine function entities segmented by MA Mask R-CNN (ours) appear to be the most complete and continuous. Additionally, its performance on circle is nearly comparable to Mask Dino, as it effectively eliminates the interference from other function entities when segmenting the circle. Moreover, it successfully avoids segmenting the coordinate axis entities at the intersection with the circle as part of the circle's mask. By analyzing the fourth image in Figure 15(f), it can be found that MA Mask R-CNN (ours) has the least remaining entities after removing each segmentation instance, indicating that this model makes the best use of the instance information of each function graph. From the above, it can be found that the model proposed in this study can significantly improve the function instance segmentation effect by strengthening the multi-scale fusion of local information and global information and adopting the mask segmentation method, which can generate higher-quality and smoother results. This model can generate more complete and more continuous masks on the function graphs, and the masks can be generated correctly at the classifications of different function categories.

Conclusions and prospect
In this paper, MA Mask R-CNN is proposed to improve the detection and segmentation capabilities of the original Mask R-CNN model for mathematical function graphs. Improvements include the following: 1) adding an attention module with a combination of local and remote information, using ConvNeXt as a baseline; 2) taking BiFPN as the baseline, adding RFA and ASPP modules to strengthen the fusion of multi-scale information at high and low layers of BiFPN; 3) adding penalty factor to IoU in RPN stage, allowing candidate anchor box to have a certain redundancy, so that the detection box can better contain categories and enhance the segmentation effect; 4) adopting PointRend mask segmentation strategy, according to the rendering angle of computer graphs, making the segmentation results smoother and more complete. Moreover, experimental results show that MA Mask R-CNN is better than other models in case segmentation of function graphs.
In the following directions, we will continue to improve the work of mathematical function graph instance segmentation technology in the production of tactile graphics: 1) Improve the detection speed of the model and continue to reduce the parameters. Pruning the model allows the model to be lightweight and quickly applied to different scenarios.
2) Expand the dataset and optimize the distribution of the dataset. The classification of the dataset can still be optimized. The generalization ability of the model is gradually strengthened by expanding the well-distributed dataset.
3) Adding Deformable convolution [48] to enhance the geometric transformation ability of the model, so that the model can extract more function image features. 4) After segmenting the function graph, the mask is skeletonized, and the resulting subdivision mask is redrawn through fitting. Finally, the fitted mathematical function graph is saved in SVG format, providing a foundation for the subsequent production of tactile graphics.