Introduction to Mathematical Language Processing: Informal Proofs, Word Problems, and Supporting Tasks

Abstract Automating discovery in mathematics and science will require sophisticated methods of information extraction and abstract reasoning, including models that can convincingly process relationships between mathematical elements and natural language, to produce problem solutions of real-world value. We analyze mathematical language processing methods across five strategic sub-areas (identifier-definition extraction, formula retrieval, natural language premise selection, math word problem solving, and informal theorem proving) from recent years, highlighting prevailing methodologies, existing limitations, overarching trends, and promising avenues for future research.


Introduction
Prove that there is no function f from the set of non-negative integers into itself such that f (f (n)) = n + 1987 for every n.

Show that the nearest neighbor interaction Hamiltonian of an electronic quasiparticle in Graphene can be written as
How is the sun's atmosphere hotter than its surface?
If we hope to use machines to derive mathematically rigorous and explainable solutions to address such questions, models must reason over both natural language and mathematical elements such as equations, expressions, and variables.
Given some input problem description, the ideal model is at least capable of recalling relevant statements (premise selection), assigning contex tual descriptions to math elements within that text (identifier-definition extraction), and performing robust manipulation of equations and expressions towards an explainable reasoning argument (informal theorem proving).Previous years have advanced many of the components required to deliver this vision.Transformer-based (Vaswani et al., 2017), large language models (LLMs) (Brown et al., 2020;Chen et al., 2021) have begun to exhibit mathematical (Rabe et al., 2020) and logical (Clark et al., 2020) capabilities.Graphbased models also show competence in premise selection (Ferreira and Freitas, 2020b), math question answering (Feng et al., 2021), and math word problems (MWPs) (Zhang et al., 2022b).The evolutionary path of mathematical language processing can be traced from MWPs (Feigenbaum and Feldman, 1963;Bobrow, 1964;Charniak, 1969) and linguistic analysis of formal proofs (Zinn, 1999(Zinn, , 2003)), to the present day, where transformers and graph-based models deliver leading metrics in math and language reasoning tasks, complemented by symbolic methods (Zhong et al., 2022).This survey provides a synthesis of this recent evolutionary arch: We consider five representative tasks with examples, describe contributions leading to the current state-of-the-art, discuss notable limitations of the current solutions, overarching trends, and promising research directions.

Representative Tasks
There is an abundance of tasks considering mathematical language, such as question answering (Hopkins et al., 2019;Feng et al., 2021;Lewkowycz et al., 2022;Mansouri et al., 2022b) and headline generation (Yuan et al., 2020;Peng et al., 2021).Mathematical language processing (MLP) itself has been described in the context of Identifier-definition extraction assigns identifiers (e.g., ψ(x)) to their context.Formula retrieval considers the structure of formulae, and scores them based on similarity to a query formula.Premise selection selects statements most likely to be useful for solving a proof.Solving MWPs (math word problems) involves calculating solutions to arithmetic problems.Informal theorem proving involves the production of proofs and inference chains combining natural and mathematical language.
various targeted texts, such as linking variables to descriptions (Pagael and Schubotz, 2014), grading answers (Lan et al., 2015), and deriving abstract representations for downstream applications (Wang et al., 2021).We take an inclusive stance, selecting a few choice tasks spanning surface-level retrieval, as seen in identifier-definition extraction and formula retrieval tasks, through models which require the encoding of formal abstractions and implicit reasoning chains, such as solving MWPs and informal theorem proving.These areas are projected onto an inference spectrum displayed in Figure 1.Extractive tasks are positioned to the surface form of the text (information retrieval perspective), including identification of relevant mathematical statements, ranking lists of formulae, and linking variables to contextual definitions.Logical puzzle solvers (Groza and Nitu, 2022) and informal reasoning generation models (Lewkowycz et al., 2022) exist far into the abstractive side, due to the step-wise and sometimes symbolic reasoning required to address them.The use of ''formal'' versus ''informal'' differentiates strict automated theorem prover (ATP) approaches requiring the use of a consistent formal language representation (Rudnicki, 1992) and hard-coded logic (Bansal et al., 2019), from approaches that input mathematical language and infer without necessary reliance on strict symbolic and logical inference mechanisms.Autoformalization (Szegedy, 2020;Wu et al., 2022) aims to cross this divide.We consider informal methods for solving five representative tasks in this context, with examples given below, visually displayed in Figure 1.
Identifier-Definition Extraction.The assignment of meaning to otherwise vague mathematical elements.Without context, equations such as p = k are ambiguous.What meaning is attributed to k?This task involves finding (identifier, definition) pairs, such as (k, wavevector) (Kristianto et al., 2012;Stathopoulos et al., 2018).
Formula Retrieval.Mathematical language includes math elements written in markup languages such as LaTeX.Given a query formula, the Wikipedia Formula Browsing task (Zanibbi et al., 2016a;Mansouri et al., 2022b) involves ranking a list of candidate formulae in terms of their similarity to that formula.For example, given the query x 2 + y 2 = r 2 , the formula a 2 + b 2 = c 2 should rank higher than y = mx + c.

Natural Language Premise Selection (NLPS).
Given a mathematical statement s that requires proof, and a collection of premises P , this task consists of retrieving the premises in P that are most likely to be useful for proving s (Ferreira and Freitas, 2020a;Valentino et al., 2022).For example, given the purple claim statement in Figure 1, a NLPS model should select the green statements as premises, excluding the red.
Math Word Problem Solving.Solving arithmetic (Roy and Roth, 2016) or algebra (Kushman et al., 2014) word problems.Andrew has 3 dogs.If they each give birth to 2 others, how many dogs will he have?An example requiring premise selection and identifier-definition extraction is given in Figure 1.
Informal Theorem Proving.Outputting reasoning chains from premises in order to ''prove'' a mathematical language statement.From Figure 1, the energy of the particle is E k = γmc 2 .Substituting v = 0 into the Lorentz factor gives γ = 1, and substituting γ = 1 into E k = γmc 2 gives E k = mc 2 .Such informal reasoning does not rely on formal frameworks, such as Fitch-style proofs, to infer quantitative results (Lewkowycz et al., 2022).

Methods
We highlight key points abstracted from task approaches in bold, give an overview of methods in Table 1, and discuss approach specific limitations in the Appendix.

Identifier-Definition Extraction
A significant proportion of variables or identifiers in formulae or text are explicitly defined within a discourse context (Wolska and Grigore, 2010).Descriptions are usually local to the first instance of the identifiers in the discourse.It is the broad goal of identifier-definition extraction and related tasks to pair-up variables with their intended meaning.
The task has not converged to a canonical form.Despite the clarity of its overall aim, the task has materialized into different forms: Kristianto et al. ( 2012) predict descriptions given expressions, Pagael and Schubotz (2014) predict descriptions given identifiers through identifierdefinition extraction, Stathopoulos et al. (2018) predict if a type matches a variable through variable typing, and Jo et al. (2021) predict notation given context through notation auto-suggestion and notation consistency checking tasks.More concretely, identifier-definition extraction (Schubotz et al., 2016a) involves scoring identifier-definiens pairs, where a definiens is a potential natural language description of the identifier.Given graph nodes from predefined variables V and types T , variable typing (Stathopoulos et al., 2018) is the task of classifying whether edges V × T are either existent (positive) or non-existent (negative), where a positive classification means a variable matches with the type.Notation autosuggestion (Jo et al., 2021) uses the text of both the sentence containing notation and the previous sentence to model future notation from the vocabulary of the tokenizer.This area can be traced from an early ranking task (Pagael and Schubotz, 2014) reliant on heuristics and rules (Alexeeva et al., 2020), through ML-based edge classification (Stathopoulos et al., 2018), to language modeling with Transformers (Jo et al., 2021).Different datasets are proposed for each task variant.
There is a high variability in scoping definitions.The scope from which identifiers are linked to descriptions varies significantly, and it is difficult to compare model performance even when tackling the same variant of the task (Schubotz et al., 2017;Alexeeva et al., 2020).At a local context, models such as Pagael and Schubotz (2014) and Alexeeva et al. (2020) match identifiers with definitions from the same document ''as the author intended'', while other identifier-definition extraction methods (Schubotz et al., 2016a(Schubotz et al., , 2017) ) rely on data external to a given document, such as links to semantic concepts on Wikidata and NTCIR-11 test data (Schubotz et al., 2015).At a broader context, the variable typing model proposed in Stathopoulos et al. (2018) relies on an external dictionary of types (Stathopoulos and Teufel, 2015;Stathopoulos and Teufel, 2016;Stathopoulos et al., 2018) extracted from both the Encyclopedia of Mathematics1 and Wikipedia.
Vector representations have evolved to transfer knowledge from previous tasks, allowing downstream variable typing tasks to benefit from pre-trained embeddings.Overall, vector representations of text have evolved from feature-based vectors learned from scratch for a single purpose, to the modern paradigm of pretrained embeddings re-purposed for novel tasks.Kristianto et al. (2012) input pattern features into a conditional random fields model for the purpose of identifying definitions of expressions in LaTeX papers while Kristianto et al. (2014a) learn vectors through a linear-kernel SVM with input features comprising of sentence patterns, part-of-speech (POS) tags, and tree structures.Stathopoulos et al. (2018) extend this approach by adding type-and variable-centric features as a baseline also with a linear kernel.Alternatively, Schubotz et al. (2017) 1: Summary of different approaches for addressing tasks related to mathematical language processing.The methods are categorized in terms of (i) Learning: Supervised (S), Self-supervised (SS), Unsupervised (UNS), Rule-based (R) (no learning); (iii) Approach; (iv) Dataset; (v) Metrics: MAP (Mean Average Precision), P@K (Precision at K), Perplexity, P (Precision), R (Recall), F1, Acc (Accuracy), BLEU, METEOR, MRR (Mean Reciprocal Rank), Edit (edit distance); (vi) Math format: MathML, LaTeX, natural language (NL), Isabelle formal language.Diagrammatic representations of approaches in identifier-definition extraction (Figure 3), formula retrieval (Figure 4), and MWP solving (Figure 5) can be found in the Appendix.function (Schubotz et al., 2016b) and pattern matching features (Pagael and Schubotz, 2014) as input to an SVM with a radial basis function (RBF) kernel, to account for non-linear feature characteristics.Alternative classification methods (Kristianto et al., 2012;Stathopoulos et al., 2018) do not use input features derived from non-linear functions, such as the Gaussian scoring function, and hence use linear kernels.Embedding spaces have been learned in this context for the purpose of ranking identifier-definiens pairs through latent semantic analysis at the document level, followed by the application of clustering techniques and methods of relating clusters to namespaces inherited from software engineering (Schubotz et al., 2016a).These cluster-based namespaces are later used for classification (Schubotz et al., 2017) rather than ranking, but do not positively impact SVM model performance, despite previous evidence suggesting they resolve coreferences (Duval et al., 2002) such as ''E is energy'' and ''E is expectation value''.Neither clustering nor namespaces have been further explored in this context.More recent work learns context-specific word representations after feeding less specific pre-trained word2vec (Mikolov et al., 2013;Stathopoulos and Teufel, 2016) embeddings to a bidirectional LSTM for classification (Stathopoulos et al., 2018).The most recent work predictably relies on more sophisticated pretrained BERT embeddings (Devlin et al., 2018) for the language modeling of mathematical notation (Jo et al., 2021).VarSlot (Ferreira et al., 2022) obtains SOTA results on variable typing (Stathopoulos et al., 2018), and demonstrates robustness to variable renaming, by fine-tuning the sentence transformers (Reimers and Gurevych, 2019) SciBERT (Beltagy et al., 2019) encoder on augmented data, learning separate representation spaces for variables and mathematical language statements.Four BERT encoder-based approaches (Lee and Na, 2022;Popovic et al., 2022;Ping and Chi, 2022;van der Goot, 2022) were submitted to the Symlink task (Lai et al., 2022), following the trend of knowledge transfer through pretrained embeddings.

Formula Retrieval
The task of retrieving similar equations to a query equation, with applications in math-aware search engines (Mansouri et al., 2022a).Guidi and Coen (2016) and Zanibbi and Blostein (2011) emphasize the encoding of formulae and their context for retrieval tasks.
Combining formula tree representations improves retrieval.There are two prevalent types of tree representations of formulae: Symbol Layout Trees (SLTs) and Operator Trees (OPTs), shown in Figure 2. Methods reliant solely on SLTs, such as the early versions of the Tangent retrieval system (Pattaniyil and Zanibbi, 2014;Zanibbi et al., 2015Zanibbi et al., , 2016b)), or solely OPTs (Zhong and Zanibbi, 2019;Zhong et al., 2020) tend to return less relevant formulae from queries.OPTs capture formula semantics while SLTs capture visual structure (Mansouri et al., 2019).Effective representation of both formula layout and semantics within a single vector allows a model exploit both representations.Tangent-S (Davila and Zanibbi, 2017) was the first evolution of the Tangent system to outperform the NTCIR-11 (Aizawa et al., 2014) overall best performer, MCAT (Kristianto et al., 2014b;2016), which encoded path and sibling information from MathML Presentation (SLT-based) and Content (OPT-based).Tangent-S jointly integrated SLTs and OPTs by combining scores for each representation through a simple linear regressor.Later, Tangent-CFT (Mansouri et al., 2019) considered SLTs and OPTs through a fast-Text (Bojanowski et al., 2017) n-gram embedding model using tree tuples.MathBERT (Peng et al., 2021) does not explicitly account for SLTs, claiming that LaTeX markup somewhat accounts for SLTs, and therefore encode OPTs.They pretrain the BERT (Devlin et al., 2018) model with targeted objectives each accounting for different aspects of mathematical text.They account for OPTs by concatenating node sequences to formula + context BERT input sequences, and by formulating OPT-based structure-aware pretraining tasks learned in conjunction with masked language modeling (MLM).
Purely explicit methods still deliver competitive results.Explicit representation methods are those that rely on prescribed representations (structural relations and associated types) rather than learned implicit relationships.Tangent-CFT (Mansouri et al., 2019) and MathBERT (Peng et al., 2021) are two models to employ learning techniques beyond the level of linear regression.Each model is integrated with Approach0 (Zhong and Zanibbi, 2019) through the linear combination of individual model scores.This respectively forms the TanApp and MathApp baselines in Peng et al. (2021).Approach0 achieves the highest full bpref score of the individual models.While we focus primarily on the NTCIR-12 dataset, recent work (Zhong et al., 2022) evaluates a selection of transformer-based models on both NTCIR-12 and ARQMath-2 (Mansouri et al., 2021b) datasets.They confirm that MathBERT delivers SOTA performance on partial bpref, and Approach0 combined with a fine-tuned dense passage retrieval (DPR) model (Karpukhin et al., 2020) outperforms on full bpref (Approach0 + DPR).Combining explicit similarity-based search (Zhong and Zanibbi, 2019;Meadows and Freitas, 2021) with modern encoders (Khattab and Zaharia, 2020;Karpukhin et al., 2020) delivers leading performance.

Natural Language Premise Selection
Formal and informal premise selection both involve the selection of relevant statements for proving a given conjecture (Irving et al., 2016;Wang et al., 2017a;Ferreira and Freitas, 2020a).The difference lies in the language in which the premises and related proof elements are encoded (either conforming to a logical form or as they appear in mathematical text).Mathematical language as it occurs in papers and textbooks (Wolska and Kruijff-Korbayová, 2004) is not compatible with existing provers without autoformalization; a widely acknowledged bottleneck for the construction of formal proof libraries (Irving et al., 2016).Typically, when reasoning over large formal libraries comprising thousands of premises, the performance of ATPs degrades considerably, while for a given proof only a fraction of the premises are required to complete it (Urban et al., 2010;Alama et al., 2014).Theorem proving is essentially a search problem with a combinatorial search space, and the goal of formal premise selection is to reduce the space, making theorem proving tractable (Wang et al., 2017a).While formal premises are written in the languages of formal libraries such as Mizar (Rudnicki, 1992), informal premises, as seen in ProofWiki,2 are written in combinations of natural language and La-TeX (Ferreira and Freitas, 2020a;Welleck et al., 2021a).Proposed approaches either rank (Han et al., 2021) or classify (Ferreira andFreitas, 2020b, 2021) candidate premises for a given proof.Natural language premise selection was originally formulated as pairwise relevance classification, evaluated with F 1 (Ferreira andFreitas, 2020b, 2021), but has since been evaluated with ranking metrics (Valentino et al., 2022).Alternatively, Welleck et al. (2021a) propose mathematical reference retrieval as an analogue of premise selection.The goal is to retrieve the set of references (theorems, lemmas, definitions) that occur in its proof, formulated as a ranking problem.
Separate mechanisms for representing mathematics and natural language can improve performance.Regardless of the task variation, most current methods do not fully discriminate the semantics of mathematics and natural language, not specifically accounting for aspects of each modality.Ferreira and Freitas (2020b) extract a dependency graph representing dual-modality mathematical statements as nodes, and solve a link prediction task (Zhang and Chen, 2018).Recent transformer baselines (Ferreira and Freitas, 2020b;Welleck et al., 2021a;Han et al., 2021;Coavoux and Cohen, 2021), and those at the shared NLPS task (Valentino et al., 2022), also do not differentiate between mathematical elements and natural language (Tran et al., 2022;Kadusabe et al., 2022;Kovriguina et al., 2022).STAR (Ferreira and Freitas, 2021) purposefully separates the two modalities, encoding distinct representations through self-attention.Explicit disentanglement of the modalities forces STAR to exploit relationships between natural language and mathematics, through the BiLSTM layer.Neuroscience research suggests the brain handles mathematics separately to language (Butterworth, 2002;Amalric and Dehaene, 2016;Kulasingham et al., 2021).

Math Word Problems
Solving math word problems dates back to the dawn of artificial intelligence research (Feigenbaum and Feldman, 1963;Bobrow, 1964;Charniak, 1969).It can be defined as the task of translating a problem description paragraph into a set of equations to be solved (Li et al., 2020).We focus on trends in the task since 2019, as a detailed survey (Zhang et al., 2019) captures prior work.
Use of dependency graphs is instrumental to support inference.In graph-based approaches to solving MWPs, embeddings of words, numbers, or relationship graph nodes are learned through graph encoders, which feed information through to tree (or sequence) decoders.Embeddings are decoded into expression trees which determine the problem solution.Li et al. (2020) learn the mapping between a heterogeneous graph representing the input problem, and an output tree.The graph is constructed from word nodes with relationship nodes of a parsing tree.This is either a dependency parse tree or constituency tree.Zhang et al. (2020) represent two separate graphs: a quantity cell graph associating descriptive words with problem quantities, and a quantity comparison graph which retains numerical qualities of the quantity, and leverages heuristics to represent relationships between quantities such that solution expressions reflect a more realistic arithmetic order.Shen and Jin (2020) also extract two graphs: a dependency parse tree and numerical comparison graph.Zhang et al. (2022b) construct a heterogeneous graph from three subgraphs: a word-word graph containing syntactic and semantic relationships between words, a number-word graph, and a number comparison graph.Although other important differences exist (such as decoder choice), it seems models benefit from relating linguistic aspects of problem text through separate graphs.
Multi-encoders and multi-decoders improve performance by combining complementary representations.Another impactful decision is the choice of encoder/decoder, and whether to consider alternative representations of a problem.To highlight this, we consider the following comparison.Shen and Jin (2020) and Zhang et al. (2020) each extract two graphs from the problem text.One is a number comparison graph, and the other relates word-word pairs (Shen and Jin, 2020) or word-number pairs (Zhang et al., 2020).They both encode two graphs rather than one heterogeneous graph (Li et al., 2020;Zhang et al., 2022b).They both use a similar tree-based decoder (Xie and Sun, 2019).A key difference is that Shen and Jin (2020) include an additional sequence-based encoder and decoder.The sequence-based encoder first obtains a textual representation of the input paragraph, then the graph-based encoder integrates the two encoded graphs.Then tree-based and sequence-based decoders generate different equation expressions for the problem with an additional mechanism for optimizing solution expression selection.In their own work, Shen and Jin (2020) demonstrate the impact of multi-encoders/decoders over each encoder/decoder option individually through ablation.Zhang et al. (2022a) similarly combine top-down and bottom-up reasoning to achieve leading results.
Goal-driven decompositional tree-based decoders are a significant component in the state-ofthe-art.Introduced in Xie and Sun (2019), this class of decoder is considered by most of the discussed approaches, and includes non-graph-based models (Qin et al., 2021;Liang et al., 2021).In GTS, goal vectors guide construction of expression subtrees (from token node embeddings) in a recursive manner, until a solution expression tree is generated.Proposed models do expand on the GTS-based decoder through the inclusion of semantically aligned universal expression trees (Qin et al., 2020(Qin et al., , 2021)), though this adaptation is not as widely used.Some state-of-the-art (Liang et al., 2021;Zhang et al., 2022b) models follow the GTS decoder closely.
Language models that transfer knowledge learned from auxiliary tasks rival models based on explicit graph representation of problem text.As an alternative to encoding explicit relations through graphs, other work (Kim et al., 2020;Qin et al., 2021;Liang et al., 2021) relies on pre-trained transformer-based models, and those which incorporate auxiliary tasks assumed relevant for solving MWPs, to learn such relations latently.However, it seems the case that auxiliary tasks alone do not deliver competitive performance (Qin et al., 2020) without the extensive pre-training efforts with large corpora, as we see with BERT-based transformer models.These use either both the ALBERT (Lan et al., 2019) encoder and decoder (Kim et al., 2020), or BERT-based encoder with goal-driven tree-based decoder (Liang et al., 2021).More recent work (Cao et al., 2021;Jie et al., 2022;Zhang et al., 2022a) involves iterative relation extraction frameworks for predicting mathematical relations between numerical tokens.

Informal Theorem Proving
Formal automated theorem proving in logic is among the most abstract forms of reasoning materialised in the AI space.There are two major bottlenecks (Irving et al., 2016) that formal methods must overcome: (1) translating informal mathematical text into formal language (autoformalization), and (2) a lack of strong automated reasoning methods to fill in the gaps in already formalized human-written proofs.Informal methods either tackle autoformalization directly (Wang et al., 2020;Wu et al., 2022), or circumvent it through language modeling-based proof generation (Welleck et al., 2021a,b), trading formal rigor and inference control for flexibility.Transformer-based models have been proposed for mathematical reasoning (Polu and Sutskever, 2020;Rabe et al., 2020;Wu et al., 2021).Converting informal mathematical text into forms which are interpretable by computers (Kaliszyk et al., 2015a,b;Szegedy, 2020;Wang and Deng, 2020;Meadows and Freitas, 2021) can strategically im-pact the dialogue between knowledge expressed in natural text, and a large spectrum of solvers.
Autoformalization could be addressed through approximate translation and exploration rather than direct machine translation.A long-studied and challenging endeavour (Zinn, 1999(Zinn, , 2003)), autoformalization involves converting informal mathematical text into language interpretable by theorem provers (Kaliszyk et al., 2015b;Wang et al., 2020;Szegedy, 2020).Kaliszyk et al. (2015b) propose statistical learning methods for parsing ambiguous formulae over the Flyspeck formal mathematical corpus (Hales, 2006).Using machine translation models (Luong et al., 2017;Lample et al., 2018;Lample and Conneau, 2019), Wang et al. (2020) explore dataset translation experiments between LaTeX code extracted from ProofWiki, and formal libraries Mizar (Rudnicki, 1992) and TPTP (Sutcliffe and Suttner, 1998).The supervised RNN-based neural machine translation model (Luong et al., 2017) outperforms the transformer-based (Lample et al., 2018) and MLM pre-trained transformer-based (Lample and Conneau, 2019) models, with the performance boost stemming from its use of alignment data.Szegedy (2020) advises against such direct translation efforts, instead proposing a combination of exploration and approximate translation through predicting formula embeddings.In seq2seq models, embeddings are typically granular, encoding word-level or symbol-level (Jo et al., 2021) tokens.The method consists of learning mappings from natural language input to premise statements nearby the desired statement in the embedding space, traversing the space between statements using a suitable prover (Bansal et al., 2019).Guided mathematical exploration for real-world proofs is still an unaddressed problem and does not scale well with step-distance between current and desired conjecture.Wu et al. (2022) directly autoformalize small competition problems to Isabelle statements using language models.Similar to previous indication (Szegedy, 2020), they also autoformalize statements as targets for proof search with a neural theorem prover.
The need for developing robust interactive natural language theorem provers.We discuss the closest equivalent to formal theorem proving in an informal setting.Welleck et al. (2021a) propose a mathematical reference generation task.Given a mathematical claim, the order and number of references within a proof are predicted.A reference is a theorem, definition, or a page that is linked to within the contents of a statement or proof.Each theorem x has a proof containing a sequence of references y = (r 1 , . . . ,r |y| ), for references r m ∈ R.Where the retrieval task assigns a score to each reference in R, the generation task produces a variable length of sequence of references (r 1 , . . . ,r|y| ) with the goal of matching y, for which a BERT-based model is employed and fine-tuned on various data sources.Welleck et al. (2021b) expand on their proof generation work, proposing two related tasks: nextstep suggestion, where a step from a proof y (as described above) is defined as a sequence of tokens to be generated, given the previous steps and x; and full-proof generation which extends this to generate the full proof.They employ BART (Lewis et al., 2019), an encoder-decoder model pre-trained with denoising tasks, and augment the model with reference knowledge using Fusion-in-Decoder (Izacard and Grave, 2020).The intermediate denoising training and knowledge-grounding improve model performance by producing better representations of (denoised) references for deployment at generation time, and by encoding reference-augmented inputs.Minerva (Lewkowycz et al., 2022) is a language model capable of producing step-wise reasoning with mathematical language (LaTeX).They fine-tune a PaLM decoder-only model (Chowdhery et al., 2022) on webpages containing MathJax formatted expressions, and evaluate on school-level math problems (Hendrycks et al., 2021;Cobbe et al., 2021), a STEM subset of problems (Hendrycks et al., 2020) of varying difficulty, undergraduate-level STEM problems, and the National Math Exam in Poland.They evaluate for generalization capabilities by generating 20 alternative evaluation problems, perturbing problem wording and numerical values in the MATH (Hendrycks et al., 2021) dataset, and compare accuracy before and after the change.While they suggest ''minimal memorization'', the numerical intervention comparison does less to support this claim.

Datasets
Various datasets have been proposed for tasks related to identifier-definition extraction and var-iable typing (Schubotz et al., 2016a;Alexeeva et al., 2020;Stathopoulos et al., 2018;Jo et al., 2021), with limited adoption.The Symlink shared task (Lai et al., 2022) is an emerging solution, with training data, annotations of 102 papers, and high inter-annotator agreement.Formula retrieval data exists through NTCIR-12 (Zanibbi et al., 2016a), which has been expanded in the most recent ARQMath task (Mansouri et al., 2022b), removing formula duplicates and balancing query complexity.Premise selection datasets include PS-ProofWiki (Ferreira and Freitas, 2020a), used in the NLPS shared task (Valentino et al., 2022), and NaturalProofs (Welleck et al., 2021a).The latter is more inclusive, comprising ProofWiki, text books, and other sources.Modern consensus MWP datasets include (easy) MAWPS (Koncel-Kedziorski et al., 2016), (medium) Math23K (Wang et al., 2017b), and (hard) MathQA (Amini et al., 2019), comprising both Chinese and English problems.GSM8K (Cobbe et al., 2021) claims to resolve diversity, quality, and language (Huang et al., 2016) issues from previous datasets, involves step-wise reasoning and natural language solutions, with balanced difficulty.MATH (Hendrycks et al., 2021) is larger and more difficult than GSM8K.Informal theorem proving data includes NaturalProofs (Welleck et al., 2021a), and some MWP datasets involving step-wise reasoning with mathematical language, such as MATH and GSM8K.However, there is no consensus data for autoformalization or theorem proving from mathematical language input involving sequence learning.ProofNet (Azerbayev et al., 2022) aims to remedy this, by providing 297 theorem statements expressed in both natural and formal (Moura et al., 2015) language, at undergraduate difficulty.Some are accompanied by informal proofs.MiniF2F (Zheng et al., 2021) is a neural theorem proving benchmark of Olympiad-level problems written in many formal languages.Lila (Mishra et al., 2022) provides data for 23 math reasoning tasks.Key datasets information is described in Table 2.
Data Scarcity.Some datasets, such as MATH and the Auxiliary Mathematics Problems and Solutions (AMPS) (Hendrycks et al., 2021)  of actual mathematicians would be invaluable.Meadows and Freitas (2021) attempt to tackle this problem for a single research paper in a very limited setting.

Discussion
State-of-the-art.In identifier-definition extraction, leading performance is obtained on Symlink by Lee and Na (2022), using a SciBERT encoder and MRC-based model (Li et al., 2019).Importantly, rather than the BERT tokenizer, they use a rule-based symbol tokenizer, evidencing the benefits of discerning natural language from math elements.VarSlot (Ferreira et al., 2022) leads in variable typing, and echoes the importance of such discrimination (see Section 3.2).
In formula retrieval, SOTA methods generally include linear combinations of scores obtained from symbolic and neural models.On NTCIR-12, Zhong et al. (2022) show that MathBERT leads on partial bpref, and Approach0 + DPR leads on full bpref (see Section 3.2).Approach0 + ColBERT (Khattab and Zaharia, 2020) leads on ARQMath-2 (Mansouri et al., 2021b).This work reinforces the importance of including formula structure across multiple tasks.In premise selection, leading results are obtained on the shared NLPS task by a fine-tuned RoBERTa-large en-coder (Liu et al., 2019b), computing similarity scores between statements with Manhattan distance (Tran et al., 2022).However, none of the competing models discern mathematical elements from natural language, or include formula structure.In MWP solving, the multi-view model (Zhang et al., 2022a) achieves state-of-theart results on Math23K, MAWPS, and MathQA.Minerva, and the Diverse approach (Li et al., 2022) based on OpenAI code-davinci-002, lead on MATH.Minerva also beats the national 57% average by 8% on the Polish national math exam.
In informal theorem proving, we discuss autoformalization and theorem proving from mathematical language.In the former, code-davinci-002 leads on ProofNet.In the latter, a BART-based model leads on NaturalProofs, and Codex (Chen et al., 2021) fine-tuned on autoformalized theorems (Wu et al., 2022), leads on MiniF2F.These later methods, particularly those that score highly on MATH, largely consist of fine-tuning generative LLMs also without distinctly considering mathematical content or structure.
Separate Representations for Math and Natural Language.Many models do not benefit from processing each modality separately.The leading model on Symlink uses a special tokenizer to extract math symbols from scientific documents (Lee and Na, 2022).VarSlot improves variable typing by learning representation spaces for variables and mathematical language statements (Ferreira et al., 2022).STAR (Ferreira and Freitas, 2021) improves on a self-attention baseline encoding combined math/language statements, by separately encoding math and language with the same encoder.MathBERT learns embeddings from tree and latex representations of formulae, and natural language (Peng et al., 2021).The Approach0 + [encoder] models linearly combine scores from entirely different methods; one designed for formulae, and one for language (Zhong et al., 2022).Multi-view learns an embedding each for words, quantities, and operations (Zhang et al., 2022a (Peng et al., 2021), and is also used in math question answering (Mansouri et al., 2021a).
Combining Complementary Representations from the Same Input.Combined use of OPTs and SLTs of the same formula has been suggested to improve formula retrieval performance (Davila and Zanibbi, 2017;Mansouri et al., 2019;Mansouri et al., 2021a).This extends to dualmodality mathematical language input.Shen and Jin (2020)  Math Word Problem Limitations.In Graph2-Tree-Z, Zhang et al. (2020) suggest considering more complex relations between quantities and language, and introducing heuristics to improve solution expression generation from the tree-based decoder.In EPT, Kim et al. (2020) find error probability related to fragmentation issues increases exponentially with number of unknowns, and propose generalizing EPT to other MWP datasets.HGEN (Zhang et al., 2022b) note three areas of future improvement: Combining models into a unified framework through ensembling multiple encoders (similar to Ferreira and Freitas, 2021); integrating external knowledge sources (e.g., How-Net (Dong and Dong, 2003), Cilin (Hong-Minh and Smith, 2008)); and real-world dataset development for unsupervised or weakly supervised approaches (Qin et al., 2020).
Informal Theorem Proving Limitations.Meadows and Freitas (2021) note the significant cost of semi-automated formalization, requiring detailed expert-level manual intervention.They also call for a set of well-defined computer algebra operations such that robust mathematical exploration can be guided in a goal-based setting.

Figure 1 :
Figure 1: Extractive tasks tend to not require inference chains to solve them, compared to more abstractive tasks.Identifier-definition extraction assigns identifiers (e.g., ψ(x)) to their context.Formula retrieval considers the structure of formulae, and scores them based on similarity to a query formula.Premise selection selects statements most likely to be useful for solving a proof.Solving MWPs (math word problems) involves calculating solutions to arithmetic problems.Informal theorem proving involves the production of proofs and inference chains combining natural and mathematical language.

Figure 2 :
Figure 2: Formula (a) y = e x with its Symbol Layout Tree (SLT) (b), and Operator Tree (OPT) (c).SLTs represent formula appearance by the spatial arrangements of math symbols, while OPTs define the mathematical operations represented in expressions.For more detail, see Mansouri et al. (2019).
use a Gaussian scoring

Table 2 :
datasets, include detailed workings at high school to undergraduate level difficulty.If we aim to use models to produce new mathematics, equivalent datasets composed of the research workings Key datasets for the representative tasks.
(Tran et al., 2022)are stateof-the-art and show advantage over baselines that do not invoke separate mechanisms.Despite this evidence, methods related to informal theorem proving and premise selection, such as Minerva, IJS(Tran et al., 2022), and others, do not discriminate math from language.This is likely true for other subfields of MLP.
Coavoux and Cohen (2021)021)vide basic support for wildcards.Tangent-CFT also does not evaluate on wildcard queries, and the authors suggest extending the test selection to include more diverse formulae, particularly those that are not present as exact matches.They propose integrating nearby text into learned embeddings.MathBERT(Peng et al., 2021)performs such integration, but does not learn n-gram embeddings.MathBERT evaluates on non-wildcard queries only.Informal Premise Selection Limitations.Limitations involve a lack of structural consideration of formulae and limited variable typing abilities.Ferreira and Freitas (2020b)note that the graph-based approach to premise selection as link prediction struggles to encode mathematical statements which are mostly formulae, and suggest inclusion of structural embeddings (e.g., Math-BERT[Peng et al., 2021]) and training BERT on a mathematical corpus.They also describe value in formulating sophisticated heuristics for navigating the premises graph.Later, following a Siamese network architecture(Ferreira and Freitas, 2021)reliant on dual-layer word/expression self-attention and a BiLSTM (STAR), the authors demonstrate that STAR does not appropriately encode the semantics of variables.They suggest that variable typing and representation are a fundamental component of encoding mathematical statements.Han et al. (2021)plan to explore the effect of varying pre-training components, testing zero-shot performance without contrastive fine-tuning, and unsupervised retrieval.Coavoux and Cohen (2021)propose a statement-proof matching task akin to informal premise selection, with a solution reliant on a self-attentive encoder and bilinear similarity function.The authors note model confusion due to the proofs introducing new concepts and variables rather than referring to existing concepts.
Welleck et al. (2021a)gest the development of highquality datasets for evaluating translation models, including structural formula representations, and jointly embedding multiple proof assistant libraries to increase formal dataset size.Szegedy (2020)argues that reasoning systems based on self-driven exploration without informal communication abilities would suffer usage and evaluation difficulties.Wu et al. (2022)note limitations with text window size and difficulty storing large formal theories with current language models.After proposing the NaturalProofs dataset,Welleck et al. (2021a)characterize error types for the full-proof generation and next-step suggestion tasks, noting issues with: (1) hallucinated references, meaning the reference does not occur in NaturalProofs; (2) non-ground-truth reference, meaning the reference does not occur in the ground-truth proof; (3) undefined terms; and (4) improper or irrelevant statement, meaning a statement that is mathematically invalid (e.g., 2/3 ∈ Z) or irrelevant to the proof; and (5) statements that do not follow logically from the preceding statements.Dealing with research-level physics,