Machine-Learning Arithmetic Curves

We show that standard machine-learning algorithms may be trained to predict certain invariants of low genus arithmetic curves. Using datasets of size around one hundred thousand, we demonstrate the utility of machine-learning in classification problems pertaining to the BSD invariants of an elliptic curve (including its rank and torsion subgroup), and the analogous invariants of a genus 2 curve. Our results show that a trained machine can efficiently classify curves according to these invariants with high accuracies (>0.97). For problems such as distinguishing between torsion orders, and the recognition of integral points, the accuracies can reach 0.998.

In this article we build on recent work by the present authors, namely: [HLOa], [HLOb]. In the latter, we presented experiments demonstrating the capacity of machine-learning to predict basic invariants of algebraic number fields. Many of the invariants studied there appear together in the analytic class number formula: in which ζ F (s) is the Dedekind zeta function of a number field F , and, (r 1 , r 2 ) is the signature, Reg F is the regulator, h F is the class number, ∆ F is the discriminant, and w F is the number of roots of unity in O F .
Recall that the set E(Q) of rational points on an elliptic curve E defined over Q defines a finitely generated abelian group. We will denote the rank of this group by r, and the torsion subgroup by E(Q) tors . Associated to an elliptic curve E over Q, In this paper, we will apply machine-learning techniques to predict -with varying levels of success -the invariants appearing in equation (1.2).
Equation ( For example, it is unknown whether or not the Tate-Shafarevich group appearing on the right-hand side is finite, cf. [Sil1,Chapter X,Conjecture 4.13], and there is no rigorous method known for its computation. Nevertheless, when the Tate-Shafarevich group is finite, its order is known to be a square [Sil1,Chapter X,Theorem 4.14]. As for the left-hand side, it is not yet known whether or not, as E varies, the set of ranks r is bounded (a heuristic model suggesting that this set might be bounded was presented in [PPVW]). Furthermore, in a suitable asymptotic sense, it is conjectured that 50% of elliptic curves have rank 0, 50% have rank 1, and 0% have rank ≥ 2. [Gol79,Conjecture B], [KS99a,KS99b].
For a broad introduction to machine-learning, see [GBC, HTF]. In this paper we will utilise logistic regression, naive Bayes, and random forest classifiers, which are reviewed in [HTF,Sections 4.4,6.6.3,15]. Machine-learning algorithms require the training of classifiers using large sets of data, and we obtain our data sets from [LMFDB]. Previous elliptic curve machine-learning experiments were documented in [ABH], as part of a recent programme of machine-learning various structures in mathematics [He17,He19,HeBook]. The key difference in the present article is that, whilst [ABH] utilised Weierstrass coefficients as training data (which had enormous variation in magnitude), we will here lean more heavily on the Euler factors of Lfunctions. We observe this to be much more successful, allowing even for extrapolation to elliptic curves with conductors in ranges beyond those in the training dataset.
This paper also studies genus 2 curves over Q. There are some important differences between elliptic curves and those of genus 2. For example, there are known to be only finitely many rational points on a genus 2 curve. In contrast, an elliptic curve with positive rank has infinitely many rational points. On the other hand, the rational points on the Jacobian of a genus 2 curve form a finitely generated abelian group which could be infinite. By the rank of a genus 2 curve, we mean the rank of its Jacobian. For elliptic curves, classification by rank is essentially binary because curves of higher rank do not provide enough training data (as per the conjecture mentioned above). On the other hand, there is a significant proportion of genus 2 curves with rank 2 -thus allowing for a ternary classification.
We remark that there is some comparison to be made with [HLOa], in which we studied machine-learning of a particular classification problem arising from the Sato-Tate conjecture for hyperelliptic curves. There, we found that naive Bayes classifiers could distinguish, with accuracies into 0.99 ∼ 1.00 range, the Sato-Tate groups of hyperelliptic curves with genus 1 or 2, using a small number of Euler-factors. The experimental results of this paper show that the same method is just as powerful for other invariants. It is interesting that a method as simple as naive Bayes could do so well for various invariants in number theory. It might be suggesting that mathematics is more workable with machine-learning than the real world where data sets could be dimmed or distorted by various noises.
We conclude this introduction with an overview of what is to come. In Section 2 notations are fixed with brief explanations of some concepts. In Section 3, the generation of training data and the experimental set-up are explained. In Section 4 we document the experimental outcomes for elliptic curves. In Section 5, we do the same for genus 2 curves. Finally, in Section 6, we offer some concluding remarks and tentative directions for further research.

Notation
We use the following notation throughout: Elliptic curve defined over Q is denoted by E. The set E(Q) of rational points defines a finitely generated abelian group; Genus 2 curve defined over Q is denoted by C. The curve C is assumed to be smooth, projective, and geometrically integral. The set C(Q) of rational points is finite; Jacobian of C is denoted by J. The Jacobian is a two-dimensional Abelian variety defined over Q. The set J(Q) of rational points defines a finitely generated Abelian group; Rank of E (resp. C) denoted by r E (resp. r C ) is the rank of the finitely generated abelian group E(Q) (resp. J(Q)). If the rank is 0 (resp. positive), then there are finitely many (resp. infinitely many) rational points; Torsion subgroup of E(Q) (resp. J(Q)) is denoted by E(Q) tors (resp. J(Q) tors ); Cyclic Group of order n is denoted C n . The torsion subgroup of E(Q) is a product of cyclic groups; Good primes of a variety X defined over Q are those primes p ∈ Z such that X has an integral model whose reduction modulo p defines a smooth variety of the same dimension. A good prime for C is a good prime for J, but the converse is not necessarily true; Bad primes of a variety X defined over Q are those primes p ∈ Z which are not good.
The bad reduction types of an elliptic curve are reviewed in [Sil1, Section VII.5]; Conductor of E (resp. C) denoted by Q E (resp. Q C ) is a positive integer of the form p ep in which p varies over the bad primes for E (resp. J). The power e p to which a bad prime p appears depends on the reduction type, cf. [Sil2, Section IV.10]; Tate-Shafarevich group of E (resp. J) denoted by X(E/Q) (resp. X(J/Q)) is a torsion Abelian group and measures the extent to which the Hasse principle fails to hold, cf. [Sil1,Section X.4].

Methodology
In this section we explain our experimental set-up. In particular, we construct the training and validation sets from appropriate data and overview the machine-learning strategies used.

Euler factors
Let X be a smooth, projective, geometrically connected curve of genus g ∈ {1, 2}.
For each good prime p of X, we define the local zeta function to be: It is well-known that the local zeta function can be written in the form is a polynomial of degree 2g with constant term 1.
EXAMPLE 1. If X = E is an elliptic curve defined over Q and p is a good prime for E, then: For a bad prime p, we also define a p as in equation (3.6). Using SageMath [Sage], we may compute a large amount of a p quickly. For i ∈ Z >0 , let p i denote the ith prime. For a positive integer N , we introduce the vector: In practice, we will take N to be 100, 200, 300 or 500. We note that the 100th prime is 541, 200th is 1223, the 300th is 1987, and the 500th is 3571.
EXAMPLE 2. If X = C is a smooth projective geometrically connected genus 2 curve defined over Q and p is a good prime for C, then: L p (C, T ) = 1 + a 1,p T + a 2,p T 2 + a 1,p pT 3 + p 2 T 4 , a 1,p , a 2,p ∈ Z. (3.8) For a bad prime p, we will simply use the convention (a 1,p , a 2,p ) = (0, p). (3.9) Using SageMath [Sage], we may compute (a 1,p , a 2,p ). For a positive integer N , we introduce the vector: where we do not include p 1 = 2 as it is always bad. In practice, we will take N = 200.
Given a finite set F of X and an invariant I(X) for each X, we associate the following labeled dataset: (3.11) We will refer to the entries in v L (X) as the Euler coefficients of X. When X is an elliptic curve (resp. a genus 2 curve), the Euler coefficients are integers (resp. pairs of integers).

Experimental strategy
1. Let F be a finite set of elliptic curves (resp. smooth projective geometrically connected genus 2 curves). The choice of F depends on the experiment. For example, F could be the set of elliptic curves (resp. genus 2 curves) over Q with conductor less than some bound and rank in the set {0, 1}.
2. For an elliptic curve E (resp. genus 2 curve C) in F, let I(E) (resp. I(C)) denote an invariant of interest. For example, I(E) (resp. I(C)) could be the rank of E(Q) (resp. J(Q)).

Generate datasets of the form
where D is as in (3.11) * . We will take N to be one of: 100, 200, 300, 500. We stress at this point that N is an absolute constant, and does not vary with the curves in D.
4. Choose a subset T ⊂ D and denote its complement by V = D −T . We will refer to T as the training dataset, and V as the validation dataset. It is important * In exceptional circumstances, we will in fact construct different datasets in place of D. We will do this, for example, in the investigation of particularly accurate classifiers as in Section 4.6, and in an attempt to improve on a poorly performing classifier as in Section 4.5. Such a digression from convention will always be clearly indicated.
that the training set and validation set have no intersection so as not to over-fit the machine-learning. We will not typically specify T , or its size relative to D, as the choice will not impact significantly on the results. See also step 7.
5. Train a classifier on the set T with a standard supervised-learning algorithm.
In this paper we will use naive Bayes, random forests, and logistic regressionsee [HTF,Sections 4.4,6.6.3,15]. We implement the algorithms using Mathematica [Wolf].
6. For all curves X in V, ask the classifier to determine I(X). We record the precision and confidence, which together constitute a good measure of accuracy and performance of the machine. The precision and confidence are real numbers in the interval [0, 1], and the aspiration is that both are close to 1. By precision, we mean the proportion of predictions in agreement with [LMFDB], the validity of which is discussed in [LMFDB, Reliability of elliptic curve data over Q, Reliability of genus 2 curve data over Q]. By confidence, we mean the Matthew's correlation coefficient [Matt]. The confidence value is an extra check intended to minimize false positives and false negatives.

Elliptic curves
In this section we describe our experimental results for elliptic curves defined over Q.
For standard algorithms used in the computation of the invariants discussed below, the reader is referred to [Cre97]. To perform the experiments in this section, we downloaded data from [LMFDB, Elliptic curves over Q], the completeness of which is discussed in [LMFDB, Completeness of elliptic curve data over Q]. We note that the Hasse-Weil L-function of an elliptic curve E is an invariant of its isogeny class, and so we in fact downloaded a representative curve for each isogeny class. On the LMFDB, an isogeny class is represented by an optimal curve, and hence our data sets are generated from optimal curves only. In general, the torsion order, torsion structure and the number of integral points, considered in Sections 4.2 -4.4, are not uniquely determined by an isogeny class.  Table 1: The above table shows the precision and confidence of a logistic regression classifier when asked to distinguish elliptic curves over Q with rank 0 from those with rank 1. The classifier is trained on E with conductor Q E in the ranges specified by the first column, using the number of Euler factors given in the second column. The classifier is verified on E with conductor in the ranges specified by the third column.

Rank
Recall that we denote by r E the rank of an elliptic curve E.
Furthermore, it is known that if r E ≤ 1 then r E is equal to the order of vanishing of L(E, s) at s = 1. It is therefore expedient to consider this as a binary classification problem using the vectors v L defined by Euler factors as in (3.7). For different ranges of conductor Q E , we established a balanced dataset of size ∼ 2 × 10 4 (×2) for rank 0 and rank 1.
Trying several standard classifiers, we find that logistic regression worked best and the results are summarized in Table 1. We see that the accuracies are in the high 0.90s, which is reassuring that a machine learns ranks of elliptic curves. What is of particular interest is the last line in the table, where we trained on 300 Euler factors for conductors in the range from 1 to 10 4 but validated on those in the range from 2 × 10 4 + 1 to 3 × 10 4 , and still achieved a 0.92 precision.
The results show that the number of Euler factors needed for high precision is about 3 max{Q E } in the range of Q E we considered. We also note that a logistic regression classifier also performed best in distinguishing the ranks of algebraic number fields [HLOb] when number fields were presented through defining polynomials.
On the other hand, when trained on Weierstrass coefficients as in [ABH], no classifier was able to accurately predict the rank of an elliptic curve.  for torsion order 1 and 2 together. A naive Bayes classifier was used and the results are summarized in Table 2. We see that the accuracies are extremely good, using 500

Torsion order
Euler coefficients. We note that the naive Bayes classifier appeared also in [HLOa].
We will revisit this experiment in Section 4.6.

Torsion structure
[1, 1 × 10 6 ] 5.4 × 10 3 (×2) 500 0.885 0.789 Table 3: The above table shows the precision and confidence of a random forest classifier when asked to distinguish elliptic curves over Q such that E(Q) tors ∼ = C 4 from those such that E(Q) tors ∼ = C 2 × C 2 . The classifier is trained on a random sample of curves with conductor in the range specified by the first column, and verified on those which remain.
Continuing with the torsion group, let us see how well the actual torsion group can be distinguished. We established a balanced dataset of size ∼ 5 × 10 3 (×2) for C 4 and C 2 × C 2 altogether. Using a random forest classifier, we found that E(Q) tors being C 4 or C 2 × C 2 can be separated using 500 Euler coefficients to fairly good accuracy. The results are summarized in Table 3. Note that the size of the dataset is relatively small compared to those of previous experiments. With a larger dataset, the precision might be improved.

Integral points
[1, 5 × 10 4 ] 3.2 × 10 4 (×2) 500 0.999 0.998 Table 4: The above table shows the precision and confidence of a naive Bayes classifier when asked to distinguish elliptic curves over Q with no integral points from those with a single integral point. The classifier is trained on a random sample of curves with conductor in the range specified by the first column, and verified on those which remain.
It is known that an elliptic curve has only finitely many integral points [Sil1, Chapter VIII, Chapter IX, Theorem 3.1]. In contrast, it may have infinitely many rational points (this is the case when the rank r E > 0), as addressed above. We set up a supervised ML to try to distinguish curves with no integral points from those with a single integral point, a total of around 60 thousand curves with conductor in the interval [1, 5 × 10 4 ]. A balanced data-set of size ∼ 3.2 × 10 4 (×2) for "single integral point" or "no integral point" was established and a naive Bayes classifier produced the results summarized in Table 4. One can see that the results are extremely good.
We will revisit this experiment in Section 4.6.

Tate-Shafarevich group
[1, 10 6 ] 2.8 × 10 4 (×2) 500 <0.6 We could try the following binary classification problem: take 500 Euler coefficients and see whether one could distinguish between a Tate-Shafarevich group of order 4 versus 9. We tried a variety of methods, such as Bayesian or logistic classifiers, as well as some forward-feeding neural-networks, but none performed especially well. This is in accordance with the difficulty in computing this group. The results are summarized in Table 5.
For this problem alone, we implemented Weierstrass coefficient training (as was done in [ABH]). This experimental variant did not do well with any of the standard classifiers or regressors, again yielding no better than < 0.6 precision. Nevertheless, we briefly review this approach for completeness. Every elliptic curve over Q has a unique reduced minimal Weierstrass equation of the form: y 2 + e 1 xy + e 2 y = x 3 + e 3 x 2 + e 4 x + e 5 , e 1 , e 3 ∈ {0, 1}, e 2 ∈ {−1, 0, 1}, e 4 , e 5 ∈ Z. (4.12) Using the coefficients in (4.12), we define the vector: v W (E) = (e 1 , e 2 , e 3 , e 4 , e 5 ) ∈ Z 5 . (4.13) Let F denote a finite set of elliptic curves, and, for each E ∈ F, let I(E) be an invariant of interest. For example, F could be the set of all elliptic curves over Q with conductor less than one million and, for E ∈ F, the invariant I(F ) could be the rank of E. We introduce the following labeled dataset: (4.14) Such a labeled dataset was used in [ABH].

Interpretation of naive Bayesian models
Of the experimental results above, two instances with strikingly high accuracies are: the order of torsion subgroups in E(Q) (Section 4.2), and, the existence of integral points on E (Section 4.4). The naive Bayes classifier was found to be optimal in both cases. Below we explore possible explanations.
We first observe that these classification problems are related to one another.
Indeed, it can be shown that † : 1. If #E(Z) = 1, then #E(Q) tors = 2. Furthermore, the unique integral point is the torsion generator.
On the other hand, we observe the following "human" procedure for distinguishing between torsion order 1 and 2 using the vectors v L (E) as in equation (3.7). Recall from equation (3.6) that a p = p + 1 − #E(F p ). When p is an odd prime, it follows that a p is even if and only if #E(F p ) is even. If p is moreover a prime of good reduction, then a point of order 2 in E(Q) maps to a point of order 2 mod p and so #E(F p ) is even. We conclude that if #E(Q) tors = 2, then the vector v L (E) consists of even integers with a few possible exceptions coming from p = 2 and bad primes (the exceptions are actually ±1). In the case #E(Q) tors = 1 we observe that a p 's are frequently odd as well as even. We are led to speculate that a naive Bayes classifier successfully distinguishes between vectors whose entries are all even from those whose entries are a mixture of even and odd numbers.
To test this, we perform the following experiment. We generate one set of 100dimensional vectors with random integer coordinates in the range [−10, 10], and another set of 100-dimensional vectors with coordinates equal to two times a random † We are grateful to Álvaro Lozano-Robledo and Chris Wuthrich, who informed the authors that one can prove these statements using the Nagell-Lutz theorem and other facts about elliptic curves.
integer in the range [−5, 5]. A naive Bayes classifier is able to distinguish these vectors to 99.8% accuracy. By comparison, a random forest achieves around 74%. These accuracies are comparable to those observed in our experiments in Sections 4.2 and 4.4 and confirms the expectation that a Bayes classifier recognizes this difference. We calculate the number of zeros in v L for each curve E ∈ F i , i = 0, 1, and draw the resulting histograms. This is shown in parts (a) and (b) respectively in Figure   1. Clearly, the means of the two distributions are different. Precisely, F 0 has mean 10.85 with standard deviation 13.21, while F 1 has mean 15.26 with standard deviation 14.70.
To check whether a naive Bayes classifier detects this difference, we define the following binary vector for a positive integer N : The binary vectors in equation (4.15) are analogous to the binary vectors used in [HLOb]. Replacing v L (E) with v B (E) in Section 3.2 (Step 3) and performing the experiment in Section 4.4, we observe that the naive Bayes is accurate to around 0.8 precision. The result is similar if we instead use the following ternary vectors: (4.16) Therefore, we see that what the Bayes classifier is picking up to reach the near 100% predictions is based on more than merely the frequency of zeros/positives/negatives.
If we do include the actual values of the Euler coefficients, it takes at least around 7 coefficients to get to more than 0.9 accuracy.
On the other hand, we should point out that the number of zeros to the Euler coefficients is part of the Lang-Trotter [LT76] conjecture which is a refinement of the Sato-Tate [Ta65] conjecture. We are not aware of any claims in the literature that relate the distribution of zeros in the Euler coefficients to the number of integral points on or the torsion order of an elliptic curve. Our experimental results suggest that such relations may exist.

Genus 2 curves
Having met with success for the genus 1 case, in this section we describe our experimental results for genus 2 curves defined over Q. Throughout, we take the

Rank
We performed an experiment analogous to that in Section 4.1. In the current context, a significant proportion of genus 2 curves have rank 2 and we consider the ternary classification problem of predicting whether the rank is 0, 1, or 2, from the Euler coefficients. A balanced dataset of size ∼ 1 × 10 4 (×3) was thus established and a logistic regression classifier was found to perform well, with accuracies ∼ 0.97. We emphasize that this is a 3-way classification and to obtain this level of accuracies in impressive. The results are summarized in Table 6.

Torsion order
As with the genus 1 case, we can try to distinguish the torsion group of order 1 versus 2 in a binary classification (cf. Section 4.2). A balanced data-set was established, with size ∼ 1.5 × 10 4 (×2) and a naive Bayes classifier was found to perform best, with results presented in Table 7.

Rational points
As mentioned in the Introduction, curves of genus > 1 have only a finite number of rational points. This allows for an experiment slightly different in nature to what was possible with elliptic curves. Indeed, one could ask for a multi-category classification using the number of rational points, being predicted from the Euler coefficients. We tried various classes, after balancing the data but no classifier performed especially well. The results are summarized in Table 8, where a 7-way classification is shown in the first row, and a binary, in the second. We suspect that training with a larger data set would result in a better performance.

Trivial Tate-Shafarevich group
Finally, we move to the Tate-Shafarevich group. Note that the order now needs not be a square for a genus 2 curve. We performed a binary-classification (having established a balanced data set of size ∼ 4 × 10 4 (×2)) of whether Tate-Shafarevich group is trivial or not. Again, no classifier was found to perform particularly well, though a logistic regression classifier performed best (see Table 9), and the accuracies are comparable to those of the genus 1 case. Once again, the prediction is better than completely random.

Conclusions and Outlook
The experiments in this paper show that an ML classifier can be trained to predict the rank and the torsion order of an elliptic curve or a genus 2 curve with high precision when the curve is represented by a few hundred Euler coefficients. In particular, for elliptic curves, the torsion order and the number of integral points are determined almost perfectly by ML classifiers. Among the discrete invariants appearing in the BSD conjecture, only the order of the Tate-Shafarevich group seems to be out of reach with our approach of using a finite number of Euler coefficients.
Along with our previous work [HLOa, HLOb], this paper confirms that ML classifiers perform surprisingly well with various invariants in number theory. High accuracies attained in our experiments reflect that data sets arising from mathematics are actually "clean and clear" without any noise. Prospectively, this opens up new opportunities of developing ML techniques for mathematics which exploit mathematical structures in data sets.
With all these experimental results and evidence at hand, a compelling call to action is to understand what ML classifiers actually recognize in the data sets. Though the algorithms of standard classifiers are well-known, it does not seem straightforward to precisely analyse what a classifier does with data sets.
In another direction, we are reminded that the influential Langlands program anticipates correspondences between two kinds of data sets: arithmetic data and automorphic data. We have been experimenting with arithmetic data. In accordance with Langlands program, we expect that a machine would learn automorphic data with high precision and efficiency. It would be very interesting to investigate whether this expectation is valid.