Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR

: Scientific documents contain a large number of mathematical expressions and texts containing mathematical semantics. Simply using mathematical expressions or text to retrieve scientific documents can hardly meet retrieval needs. The real difficulty in retrieving scientific documents is to effectively integrate mathematical expressions and related textual features. Therefore, this study proposes a multi-attribute scientific documents retrieval and ranking model based on GBDT (gradient boosting decision tree) and LR (logistic regression) by integrating the expressions and text contained in scientific documents. First, the similarities of the five attributes are calculated, including mathematical expression symbols, mathematical expression sub-forms, mathematical expression context, scientific document keywords and the frequency of mathematical expressions. Next, the GBDT model is used to discretize and reorganize the five attributes. Finally, the reorganized features are input into the LR model, and the final retrieval and ranking results of scientific documents are obtained. The experiment in this study was carried out on the NTCIR dataset. The average value of the final MAP@20 of the scientific document recall was 81.92%. The average value of the scientific document ranking nDCG@20 was 86.05%.


Introduction
Most existing search engines support text retrieval, but still have problems retrieving mathematical expressions, especially expressions without natural language annotations. While traditional search engines are losing their roles in this respect, recent research on mathematical expression retrieval has achieved relatively rich results [1−5].
Focusing on mathematical expressions in LaTeX format, Zhong et al. [6] proposed a mathematical formula retrieval algorithm based on Operator Tree. By matching multiple disjoint common subtrees with the same structure, the maximum number of sub-formulas is matched, which improves the efficiency of formula matching. Although the maximum number of matching sub-forms can improve retrieval accuracy, most sub-forms are more complicated. Therefore, the response time of real-time retrieval is approximately 20 s, which cannot meet the needs of real-time mathematical formula retrieval. To achieve faster sub-formulas retrieval, the team also proposed a strategy based on an inverted index and dynamic pruning [7], which improves the time efficiency of retrieval while ensuring that the retrieval results are still valid.
Focusing on mathematical expressions in MathML format, Schubotz et al. [8] proposed the VMEXT system, which can realize a visual tree of expressions in MathML format. It can also realize human-computer interaction, which is convenient for users to quickly find and improve the expression tree. In addition, similar or identical elements of two expressions can be visualized to calculate the similarity of expressions.
Focusing on mathematical expression images, Davila et al. [9] proposed a mathematical formula matching system. The system is mainly aimed at matching handwritten formulas on the teaching whiteboard with the formulas in course notes. First, the entire image was preprocessed, including formula search and structure correction. Then, the largest match in each image was identified by the symbolic consistent spatial alignment and similar relative sizes. Finally, each mathematical formula was divided into multiple symbol pairs. Symbol pairs are two symbols in a formula that are the nearest geometric neighbor of each other, which indicates the logical relations between them. The angle of a symbol pair is the angle between the line connecting the centers of the symbols and a horizontal line, which is helpful for judging the relationship between the two symbols. The images were sorted by the angle of the symbol pair.
With the development of deep learning, text embedding methods are widely used in natural language processing. Gao et al. [10] tried to apply the same method to formula embedding. They applied neural networks to mathematical information retrieval and proposed the "symbol2vec" model. This model was used to learn the vector representation of mathematical symbols and perform similarity calculations. Similarly, the NTFEM model [11] used an N-ary tree to convert the mathematical formula into a linear sequence. The word embedding model is used to embed the formula, and a weighted average embedding vector is obtained by using a weighting function. In mathematical formula retrieval, the BERT (bidirectional encoder representations from transformer)-based embedding model [12] is proposed to introduce more semantic information when the formula is embedded. The model uses the LaTeX format as the input and the BERT model is used to encode the formula. The index is built according to the embedded formula vector, formula id and post id from which the formula originates, and finally, the cosine similarity is used to obtain the final ranking of the formula.
In terms of fusion retrieval and ranking of mathematical expressions and scientific documents, Pathak et al. [13−15] committed to fusing expressions and related texts for retrieval. First, they proposed the MathIR system composed of three modules: "TS", "MS" and "TMS". This made scientific documents retrieval a similarity calculation of expression and text fusion rather than a simple expression search. Next, the "context-formula" pair was extracted, and the context of the formula was merged for retrieval. Finally, the modules of the system were optimized, and the formula retrieval was effectively integrated with the retrieval module for the text. Similarly, Schubotz et al. [16] regarded formulas and natural text as a single information source. The description of mathematical formula symbols was extracted from the surrounding text of the formula. These mathematical symbol descriptions were used to represent the definition of mathematical symbols. The namespace was formed as an internal data structure for mathematical information retrieval. This method can eliminate the ambiguity of mathematical symbols and better meet the retrieval needs of users. While retrieving mathematical expressions, Wang et al. [17] integrated other attributes to rank scientific documents, such as document category, types of journals to which scientific documents belong, and document citations. The sorting results were optimized by fusing these attributes of scientific documents. To better integrate mathematical expressions and text in scientific document retrieval, a weight parameter was proposed [18]. Based on formula similarity and text similarity, the proportion of text and mathematical expressions is manually adjusted.
In conclusion, current scientific document retrieval and ranking methods could be roughly divided into two types, the first type recalls by mathematical expression similarity and sorts by text similarity or recalls by text similarity and sorts by expression similarity. Regardless of what kind of similarity is used for the final sorting, it will weaken the similarity of another part. The second type manually adjusts the weight to fuse expression similarity and text similarity, but this method is difficult for users with less experience to control the specific values of the parameters. To solve the above problem, this study proposes a multi-attribute retrieval and ranking model of scientific documents that combines mathematical expressions and related texts. This model is an improvement of the second type, and can eliminate the need to manually adjust the weights of expressions and texts.
The similarity of five attributes is calculated: mathematical expression symbols (MESY), mathematical expression sub-forms (MESF), mathematical expression context (MECT), scientific document keywords (SDKY) and the frequency of mathematical expressions in scientific documents (FOME). A gradient boosting decision tree (GBDT) and logistic regression (LR) are used for feature reorganization and calculation to obtain the final search results, which improves the rationality of the retrieval. Figure 1 shows a flow chart of the scientific documents retrieval and ranking system (solid lines denote online query flows and dotted lines denote offline index flows). The whole process consists of four parts: query preprocessing, scientific document preprocessing, multi-attribute similarity measure and scientific document retrieval and ranking.

Overview
The query preprocess module is used to process the input query. The query is a combination of mathematical expressions and text, which need to be split. The scientific document preprocessing module is used to extract mathematical expressions and related text, preliminarily decompose mathematical expression symbols and calculate the weights of related text. Then, the module interacts with the database module to store and index the information corresponding to the scientific documents to facilitate subsequent similarity calculations. The multi-attribute similarity measure module calculates the similarity of the five attributes of scientific documents. According to the different characteristics of each attribute, different similarity calculation algorithms are set up. The module interacts with the database module to store the calculated similarity. The scientific document retrieval and ranking module combines the similarity of the multiple attributes of scientific documents to fuse and calculate the attributes. Finally, the similarity between the scientific documents and the input query is obtained, and the scientific documents are ranked according to the similarity.

Similarity calculation of mathematical expression symbols (MESY)
For the retrieval of mathematical expressions, there will be problems when inputting query expressions, such as inaccurate input and incorrect input of mathematical symbols. It is necessary to retrieve each mathematical symbol one by one to improve the fault-tolerant performance of the system. Definition 1 Q ME is the query expression, is the mathematical expression dataset from the scientific documents, and M E T is the number of mathematical expressions in the dataset. First, FDS [19] is used to normalize the mathematical expressions in various formats into a unified form by decomposing them into multiple mathematical symbols with the corresponding five attribute values called level, flag, count, ratio, and operator.
The "level" attribute represents the level of a mathematical symbol, based on its position relative to the horizontal baseline. For example, in the mathematical expression 2 b a , the level values of W W , a, b and 2 are 0, 1, 1 and 2, respectively. "Flag" represents the spatial flag bit of a symbol. Table 1 shows the value of the flag taking x as an example. "Count" refers to the sequential position of a symbol in the mathematical expression. "Ratio" refers to the frequency of the operator in the mathematical expression. "Operator" refers to whether a mathematical symbol is an operator. If a symbol is an operator, it is marked as 1; otherwise, it is marked as 0.
In this way, the mathematical expression is converted into a list, which is convenient for subsequent retrieval of expression symbols. Table 2 shows the membership functions of the five attributes [20]. According to the distribution of values in each attribute by symbols in the data set, the balance factors in the function is determined by using curve fitting. The values of each balance factor are as follows.
After the membership calculation is completed, each symbol corresponds to a five-tuple membership degree vector, denoted by sym where the term refers to the current mathematical symbol and ex refers to the expression id corresponding to the current mathematical symbol.
" and Dt ME = " 2 x y  " as examples. The three mathematical symbols that are the same in the two expressions are " x ", "  " and " x ". Table 3 shows the attribute values and membership degrees after the decomposition of the three symbols.
Next, hesitant fuzzy sets [21−23] are used to calculate the membership degree of each mathematical symbol. Hesitant fuzzy sets have advantages in dealing with multi-attribute decision-making problems. The formula for calculating the similarity of expressions using hesitant fuzzy sets is shown in Eq (2).
Finally, the normalization calculation of the mathematical symbols is performed to obtain the similarity of the expressions. The specific algorithm is shown in Algorithm 1.

Definition 3
The formula [20] for calculating Symbol Sim in Algorithm 1 is shown in Eq (2).   . When = 1, the formula degenerates to the standard Hamming distance. When  = 2, the formula degenerates to the standard euclidean distance. In this study,  = 2.
Take the two mathematical expressions in Table 3 as an example, we suppose that x x  is query and 2 x y  is the mathematical expression with id = 1 in the data set. Algorithm 1 is used to calculate the similarity of these two expressions. The result of the first update of  1, 1, 1, 1), (1, 1, 1, 1, 1), (1, 1, 1, 1, 1), (1, 1, 1, 1, 1)].The final calculated SIM = 0.1425. The mathematical expression sub-form similarity calculation refers to the retrieval of Q ME as a whole object.  Table 4 shows the membership functions corresponding to the three attributes [16]. represent the membership value of the three attributes length, level, and flag, respectively.

Contextual text similarity calculation of mathematical expressions (MECT)
BERT (bidirectional encoder representations from transformer) [22−24] is a pre-training language model that uses unsupervised data for pre-training and fine-tuning on the task corpus, and has excellent performance on tasks for understanding natural language. There are two tasks in the model pre-training phase: masked language mode and next sentence prediction. The joint training of these two tasks makes the word vector obtained by training more accurate and comprehensive. It can solve the polysemy problem that cannot be solved in word2vec.
This study uses mathematical expression contextual text to fine-tune BERT to achieve the similarity calculation of the contextual text. The specific algorithm is shown in Algorithm 3.

Similarity calculation of scientific document keywords (SDKY)
The Jaccard coefficient is used to calculate the similarity of two sets ( , ) A B G G . It is expressed as the ratio of the intersection and union of the two sets, and can effectively calculate the degree of overlap between the two sets to obtain the similarity of the sets. Its definition is shown in Eq (3).

Jaccard( , )
Each scientific document often contains a specific topic. The keywords of the documents are extracted, and similarity matching with the query words can improve the accuracy of the search results. The contents of the scientific document are divided into words. By calculating the weight of the words, the 5 words with the highest weights are selected as the keywords of the scientific document. The weight calculation method is shown in Eq (4 Since the difference in text length will affect the calculated keyword similarity, this study improves the Jaccard coefficient and adds the length difference part. The calculation of similarity is shown in Eq (5).
where DT WE refers to the keyword collection of scientific documents, and  is the balance factor.

The frequency of mathematical expressions in scientific documents (FOME)
When retrieving scientific documents, the same mathematical expression appears differently in different scientific documents, and the importance and retrieval order of scientific documents are also different.
The frequency of mathematical expressions in scientific documents is the product of the frequency of mathematical expressions in the document (EF) and the inverse document frequency (EIDF), which is similar to TFIDF. The difference is that when the text frequency is calculated, the query text must be exactly the same as the text in the document before the text can be considered to appear once. In the process of searching for mathematical expressions, partially identical expressions can also be considered to appear once. For example, when ME Q is U IR  , the appearance of U IR   The calculation of EIDF requires the number of occurrences of ME Q in the dataset. If ME Q appears multiple times in different scientific documents, its importance will decrease accordingly. The calculation method of the EIDF is shown in Eq (7).
where N refers to the total number of scientific documents in the data set, INCLUDE( ) exp refers to the number of scientific documents containing exp. The specific calculation of INCLUDE( ) exp is shown in Eq (8).
Finally, the calculation method of the frequency of mathematical expressions in scientific documents is shown in Eq (9). fre Sim EF EIDF   (9)

Multi-attribute integration of scientific documents
The LR (logistic regression) model is based on linear regression plus sigmoid function (non-linear) mapping. It is shown as Eq (10).
where T x  is the input of the sigmoid, and  and x are both matrices.  is the linear regression parameter. T refers to the transpose of matrix. x refers to the feature of the input. The LR model has a simple structure and fast running speed, but the learning ability and expression ability of the LR model are very limited. A large amount of feature engineering is required for feature dispersion and feature combination to increase the learning ability of the model. Therefore, an approach is needed for automatically discovering effective features and feature combinations and shortening the LR feature experiment cycle. The GBDT model can automatically discover features and carry out effective feature combinations.
GBDT (gradient boosting decision tree) [25−28] is a boosted tree model based on the CART regression tree model. In the process of generating each tree, the residual of the previous tree is calculated. The next tree is fitted on the basis of the residuals so that the residuals obtained on the next tree decrease. It is shown in Eq (11).  The sample T x is judged by two tree nodes and belongs to different leaf nodes of the two trees. The leaf nodes of the two trees are coded. The leaf nodes to which sample T x belongs are marked as 1, and the others are marked as 0. The leaf node codes of the two trees are connected in series to form a seven-dimensional sample (1, 0, 0, 1, 0, 0, 0).
Each T x will go through multiple GBDT trees to recombine features. For GBDT trees, the path from the root node of the tree to the leaf nodes is a combination of different features. Therefore, the leaf node can uniquely represent this path. The leaf node is input into the LR model as a discrete feature for training. In the final prediction, the input sample will pass through each tree of GBDT to obtain a discrete feature (a set of feature combinations) corresponding to a certain leaf node. Then, the feature is passed into LR in one-hot form for linear weighted prediction. The final similarity SIM calculation result is obtained. Figure 3 shows the specific flow chart. For the LR model, the L2 penalty term is used, and the value of the inverse of regularization strength is 0.05. For the GBDT model, the metric is "binary_logloss", num_leaves is 32, num_trees' is 60 and the learning_rate is 0.005.

Experimental data and environment
The dataset used in the experiment is "MathTagArticles" in NTCIR-12_MathIR_Wikipedia_Corpus, which includes 31742 scientific documents. The "MathTagArticles" includes 16 archive files (they are coded as wpmath0000001-wpmath0000016)，and each archive file contains about 2000 scientific documents. In this study, the hold-out method is used: "wpmath0000001-wpmath0000008" are used for training, "wpmath0000008-wpmath0000012" are used for verification, "wpmath0000013-wpmath0000016" are used for testing. Table 5 shows the experimental environment.

Relevance ratings
The evaluators are five mathematics graduate students who are familiar with mathematical expressions and scientific documents. For each set of queries, the top 10 results are selected for evaluation. The evaluation indicators are relevant, partially relevant and not relevant. Among them, relevant ones are marked as 2, partially relevant ones are marked as 1, and not relevant ones are marked as 0. The results of the same query will be marked separately by five evaluators. Different evaluators should not mark the same retrieval result too differently. For example, for the same search result, when some commenters are marked as 2, other commenters can mark 1 or 2, but cannot mark 0. So, another labeling rule is set: for the same result, the difference between the scores of different evaluators should be less than or equal to 1. If it is greater than 1, the marks are invalid.
Finally, the results of the five evaluators are summarized. The reviewer's score is converted to a comprehensive score in Table 6. Based on the principle of obedience to the majority, a total score greater than 7 is considered relevant, a total score greater than 2 is considered partially relevant, otherwise it is not relevant. In the subsequent evaluation of results, if the evaluation metrics only require relevant and not relevant, the partial relevant will default to relevant.   Reciprocal rank (RR) is the reciprocal of the ranking of the first related document in the retrieved results. MRR is the average of the reciprocal rankings of multiple queries, and the calculation method is shown in Eq (13).
where rank( ) i refers to the ranking of the first related document for the i-th query. Table 8 shows the values of P@3, P@5, P@10, and MRR for the 20 queries in Table 6, and Figure  4 shows the values of P@3, P@5 and P@10 for the 20 queries in Table 7. Figure 4 shows that the P@3 of some queries can reach 100%. However, the precision of some queries is low, which is related to the fact that there are fewer scientific documents matching it in the dataset.

Ablation experiment
Average precision considers the position factor on the basis of precision. It is more sensitive to the position of sorting. The calculation method is shown in Eq (14).
where r refers to the total number of related documents, pos( ) i refers to the position of the i-th related document in the retrieved results.
NDCG is the normalized loss cumulative gain. The calculation method of DCG (discounted cumulative gain) is shown in Eq (15). 1 2 where i r e l refers to the relevance of the i-th document. There are three levels of relevance: good, fair and bad. They are assigned scores of 3, 2 and 1.
In an ideal state, according to the order of relevance from largest to smallest, the case where DCG takes the maximum value is IDCG. 1 2 where REL refers to the sorting situation of the documents in the ideal state, and k refers to the collection of the first k documents. NDCG uses IDCG to normalize the evaluation indicators.
The similarities of the five attributes of scientific documents are calculated separately, they are MESY, MESF, MECT, SDKY and FOME. In order to verify the role of each attribute in the experiment, an ablation experiment was carried out in this study. One of the five attributes is removed in turn, and the remaining four attributes are input into GBDT and LR for training, then five models are obtained. Experiments with these five models are compared with the original model, and the results obtained are shown in the Figure 5. In Figure 5, model A represents MESF + MECT + SDKY + FOME, model B represents MESY + MECT + SDKY + FOME, model C represents MESY + MESF + SDKY + FOME, model D represents MESY + MESF + MECT + FOME, model E represents MESY + MESF + MECT + SDKY, and the model F represents MESY + MESF + MECT + SDKY + FOME. As shown in Figure 5, the MESY attribute affects the precision of the model. There are fewer relevant results retrieved, and the less relevant results are ranked relatively higher, so the MAP and nDCG of model A will be slightly higher. MESF also affects the precision of the model, but has little effect on the ranking. The two attributes of MECT and FOME have little effect on precision, but they will affect the ranking of results. The SDKY attribute will get more relevant results and affects the ordering of the model to some extent. Figures 6 and 7 show the comparison results of the algorithm in this study with Tangent-CFT [4] and MIaS [3], MIaS system is an open-source system. The Tangent-CFT model was reproduced experimentally. Table 9 gives the average comparisons of MAP and NDCG. Tangent-CFT [4] is a mathematical expression embedding model realized by word2Vec, that can achieve precise matching of mathematical expression structure. To locate a scientific document according to a mathematical expression, the retrieval of "mathematical expression-scientific document"(scientific document pairs corresponding to mathematical expressions) is realized. MIaS [3] is an open search engine for mathematical expressions. It can also retrieve corresponding scientific and technological documents based on the similarity of mathematical expressions. The system builds an XML tree through the structure of mathematical expressions to retrieve query expressions and expressions with query expressions as sub-expressions.

Conclusions
This study proposes a multi-attribute retrieval and ranking model based on GBDT + LR to solve the problem of poor integration of mathematical expressions and relevant texts in scientific document retrieval. This method combines the five attributes MESY, MESF, MECT, SDKY and FOME. GBDT is used to reorganize the features, and LR trains the reorganized features. Finally, the similarity of the final scientific documents is obtained and sorted.
Future research is expected to complete the semantic retrieval of expression symbols based on the context of expressions. Meanwhile, in terms of semantics, it is better to effectively integrate expressions and text. When sorting the final scientific documents, the attributes of the scientific