Machine learning on the electron–boson mechanism in superconductors

To unravel pairing mechanism of a superconductor from limited, indirect experimental data is always a difficult task. It is common but sometimes dubious to explain by a theoretical model with some tuning parameters. In this work, we propose that the machine learning might infer pairing mechanism from observables like superconducting gap functions. For superconductivity within the Migdal–Eliashberg theory, we perform supervised learning between superconducting gap functions and electron–boson spectral functions. For simple spectral functions, the neural network can easily capture the correspondence and predict perfectly. For complex spectral functions, an autoencoder is utilized to reduce the complexity of the spectral functions to be compatible to that of the gap functions. After this complexity-reduction process, relevant information of the spectral function is extracted and good performance restores. Our proposed method can extract relevant information from data and can be applied to general function-to-function mappings with asymmetric complexities either in physics or other fields.


I. INTRODUCTION
The mechanism of superconductivity has been one of hottest topics for more than one century since the first experimental evidence of superconductivity in mercury was observed by H. K. Onnes in 1911 1 .After half a century, the seminal work by Bardeen, Cooper and Schrieffer 2 , called the BCS theory, elucidating inevitable pairing attraction from the electron-phonon coupling (EPC), successfully resolve the long-standing puzzle of the origin of superconductivity.However, materials with strong EPC could not be incorporated in this scope.The strong EPC issue in normal metal was solved by Migdal using the field theoretical Green's function approach to show the perturbation series of the EPC can converge quickly due to the negligible vertex correction compared to the selfenergy 3,4 .The work was further extended to describe the superconducting states by Eliashberg on the basis of collection of states introduced by Bogoliubov 5, 6 .The resulting Eliashberg theory, also known as the Migdal-Eliashberg (ME) theory, is a modification of the BCS theory to take into account the EPC more realistically and thus the retardation effect more accurately 7 .
The success of EPC for superconductivity demonstrates the huge impact from collaboration between fermions (electrons) and bosons (phonons).Since then, scientists keep searching for possible bosonic modes behind superconductivity, including unconventional superconductors such as high-temperature superconducting cuprate 8 and iron-based superconductors 9,10 .Possible bosonic modes include the spin wave, spin fluctuation and so on 11 , each of which has specific spectrum.In this way, the original BCS theory is widely generalized to describe the superconductivity based on the interactions between electronic and bosonic modes and is applied to various superconducting systems even when the pairing glues of them are expected to be unconventional.Knowing the bosonic spectrum relevant to the superconducting pairing thus gives great insights on the mechanism of su-perconductivity.
Theoretically, it is straightforward to calculate gap functions based on the spectrum of bosonic glue of superconductivity.On the contrary, the bosonic spectrum is material dependent and requires case-dependent ab initio calculations.Since electron and boson contribute to the superconductivity through the overall effective electron-boson spectral function (EBSF) without the need to know much material details, the inference of the EBSF from superconducting properties is possible and can stand for a convincing proof of the mechanism.Our goal is to infer the EBSF from the gap function using machine learning techniques, and this strategy may be extended to other mechanisms including unconventional superconductivity.
Since the beginning of this century, machine learning (ML) related techniques have been intensively developed in the fields of the computer vision and natural language processing [12][13][14][15] .Among the methods for machine learning, the supervised learning and unsupervised learning have been largely applied to various fields other than the computer sciences.In the supervised learning (SL), data with labels are used to train the machine such that the well-trained machine can predict correct labels of data unseen before.For instance, if we have two functions F A and F B having some unknown one-to-one correspondence to each other we can train a SL machine to predict F B for given F A or vice versa.In the unsupervised learning (USL), a machine is trained by feeding data without labels to learn the distributions of the data with or without the dimension reduction.For instance, the restricted Boltzmann machines can learn the distributions of data without the dimension reduction 16 while autoencoders (AEs) can learn the distributions in the latent space for a compressed representation in reduced dimensions 17 .
For superconductivity, efforts are mainly devoted to the prediction of the transition temperature.These include the similarity search using the encoded fingerprints 56 , learning the representations of elements in the periodic table to predict superconductivity 57,58 and the estimation of the transition temperature via disparate models, such as the support vector machines 59,60 , gradient boosted model 61 , regression models 62,63 , selflearned descriptors combined with the atom table convolutional neural networks (ATCNN) 64 , equationbased model 65 , convolutional gradient boosting decision trees 66 , variational Bayesian neural network 67 and models using the text-mining-generated training datasets 68 .Effects of training data and improved measures are also investigated for materials discovery 69 .However, compared to the prediction of the superconducting transition temperatures, the mechanism of the superconductivity attracts less attention.One representative work is the estimation of normal and anomalous self-energies of superconductors from experimental ARPES data 70 .
In this work, we realize a machine which can answer the possible EBSF from the input: a gap function.Through the establishment of the relations learned by the machines, we can derive the spectral function relevant to the gap function without performing the ab initio calculations.The performance for simple EBSFs is good by the SL and what is learned by the machine during the training process is investigated.The SL alone, however, is hard to improve the performance of complex, randomlygenerated EBSFs.In order to resolve this issue, we used the AE to evaluate the complexities of both the EBSFs and gap functions.It was found that EBSFs have much higher complexity.Therefore, the complexity of EBSFs is reduced by AE to be compatible to that of the gap function, in which way smoothed EBSFs are generated to contain key information relevant to the input data.By using the AE-smoothed EBSFs as the new labels and the AE-transformed gap functions as the new inputs, the performance is greatly improved, which indicates that the new EBSFs preserve the essential information relevant to the new gap function so that they have one-to-one correspondence between them.This approach thus proposed a process of complexity reduction which can generally extract the key information of the data relevant to our input functions, and can be utilized to improve SL tasks in physics or other fields.The process is demonstrated in Fig. 1.Note that we also study the cases where the density of state (DOS) is the input.The results are similar to those for the gap functions so only results for the gap functions are presented unless otherwise specified. .If the performance is not good, we have to check if the latent dimension of output functions, dLo, is larger than that of input functions,dLi (step 3).If it is not the case, we expect the performance can be improved by changing the hyper-parameters of the neural network, such as the number of layers and the number of neurons in each layer (go back to step 1).See also the discussion section.If dLo > dLi, we firstly train an autoencoder (AE) with latent dimension dLi using the original dataset and generate new datasets from the well-trained AE (step 4).At step 5, we train a new SL machine with the newly generated dataset and the good performance is expected (step 6).

SL machine of good
The outline of the rest of this article is given in the following.In Sec.II, we introduce the ME formalism, describing the relations between the EBSFs and the gap functions, the training datasets.In Sec.III, we describe the details and results for the training with simple (Sec.III A) and complex EBSFs (Sec.III B).We discuss the relation between the performance and the compatibility of complexity as well as potential applications of our work and then make a conclusion in Sec.IV.

II. MIGDAL-ELIASHBERG FORMALISM
The key feature of the superconductivity relies on the Cooper-pair condensation, initiated by the pair of states (k ↑, −k ↓) occupied coherently.Therefore, the Nambu-Gor'kov formalism utilizing the two-component spinor ψ † k = (c † k↑ c † −k↓ ) was introduced [71][72][73] .the Green function at momentum k and (imaginary) Matsubara frequency where k is the normal-state energy dispersion and Σ is the self-energy.For the self-energy Σ, the ME theory considers the self-energy Σ comes from both the EPC and the electron-electron Coulomb interaction.Especially, the vertex corrections are argued to be small so that a bare vertex is a good approximation, which means the EPC is truncated at order of ω D /E F , the ratio between the Debye frequency and the Fermi energy.This leads to the self-energy (2) where V C (k − k ) is the screened Coulomb potential, and g k,k ,ν is the screened EPC strength which describes scattering between electron states k and k through a phonon with wave vector q = k − k and frequency ω qν .The propagator for phonons is qν ] and the self-energy is in the form of ) where Z(k, iω n ), χ(k, iω n ), φ(k, iω n ) are functions at the Matsubara frequencies to be determined.Here, the phase of the pairing potential φ(k, iω n ) is gauged so as to exclude the σ 2 component.Therefore, the Dyson equation becomes The functions of Z(k, iω n ), χ(k, iω n ) and φ(k, iω n ) can be determined.
The functions depend on k, which impedes heavily on the computational task.A standard approximation is applied that the DOS is constant of N F around the Fermi energy E F since E F is much larger than the pairing energy.Hence the energy shift χ is a constant to be omitted.Furthermore, for conventional superconductors the gap function anisotropy is usually weak or smeared out by impurities 73 .Therefore, an isotropic approximation is adopted to average k over the Fermi surface.By defining the electron-phonon (electron-boson) spectral function (EBSF) where the function is positive definite, we finally have the ME equations to solve self-consistently in the following is the interaction kernel function.T is the temperature, and the reduced Planck constant and the Boltzmann constant k B are set to unity for convenience.The electron-electron interaction is renormalized and then considered via the the Morel-Anderson pseudopotential µ * in the frequency below ω c .The µ * value is empirical and in the range of 0.1 ∼ 0.2 for most conventional superconductors 73 .Its value usually does not change the behaviour of the results much.The summation of the Matsubara frequencies is truncated in the numerical calculations.Once Z(iω n ) and ∆(iω n ) are calculated at the Matsubara frequencies, we perform analytical continuation iω n → ω + = ω + iΓ, where ω is the real frequency and Γ → 0 + an infinitesimal positive number.Numerically this can be performed under Padé approximation 74 .In Appendix C, we show the details of the numerical calculations of the ME equations and the preparation of the input (gap function ∆(ω)) and output (EBSF α 2 F (ω)) part of the training datasets.

III. INFERRING THE SPECTRAL FUNCTIONS BY SUPERVISED LEARNING
We build up a SL machine by using Keras on the Tensorflow backend to take ∆(ω) as an input function to predict the EBSF as the output function.The network structure of the machine contains the input layer (300 neurons), output layer (1000 neurons) and two hidden layers (500 and 800 neurons, respectively), sketched in Fig. 2. ∆(ω) is fed into the input layer and, after passing through two hidden layers, the predicted EBSFs, α 2 F (ω), are produced in the output layer.The activation functions for the first three layers are ReLU while a linear function is used in the output layer.We choose the mean-square error (MSE) as the loss function to be minimized by the ADAM optimizer 75 .We divide our dataset into three subsets: 10% of the data for the testing set, 9% for the validation set and the rest for the training set. A. Good performance for simple bosonic spectrum In this subsection, we study the performance of the machine when the complexity of the data is low.We prepare the data (totally 1800 data in this dataset) in which every EBSF is constructed by a combination of Gaussian variants.We show some EBSFs in Fig. 3(a) and gap functions in Fig. 3(b).Our machine nicely predicts the EBSF, as shown in the close overlap between the real (dashed orange) and predicted (solid blue) functions in Fig. 3(a).
Given the good performance, we would like to have insights on what the machine has learned from the training dataset.The details are shown in the appendix B.

B. AE-aided supervised learning for asymmetric complexities
In this part of work, we prepare diverse EBSFs (totally 10000 data in this dataset) by randomly generating the value for each frequency in one EBSF and performing some smoothing (averaging) schemes to slightly eliminate fluctuations.The performance after the training is shown in Fig. 4. The predicted EBSFs (p-EBSFs) match approximately the main profile of the ground-truth EB-SFs (gt-EBSFs) but miss fine structures.We propose that this discrepancy is caused by different complexities of EBSFs and gap functions so that they do not have oneto-one correspondence to each other.Explicitly speaking, it is close to a many(EBSFs)-to-one(gap function) mapping, which will be discussed in the appendix A. Therefore, in order to restore good performance necessary is the complexity adjustment, during which relevant information of EBSFs can be extracted.In the following, we quantify the complexity by an AE and show explicitly that the complexity of the EBSFs is much higher than that of the gap functions ∆(ω).
An AE, as illustrated in Fig. 5, is usually used to reduce the dimensions of data and to represent data as latent vectors in the latent space.We train an AE to firstly take in a data in the input layer and, after going through the whole AE network, to reconstruct the same data in the output layer.The middle layer of an AE is called the bottleneck whose number of neurons defines the dimension of the latent space, the latent dimension d L .Our strategy is to concatenate each gap function

Autoencoder
Latent space (Middle layer) Hidden layer Hidden layer Hidden layer output layer Input layer FIG.5: The sketch of an autoencoder (AE).An AE contains one encoder and one decoder.An encoder is defined as the part from the input layer to the middle layer (The left three layers.)and a decoder is defined as the part from the middle layer to the output layer (the right three layers).Therefore, an encoder maps a data into an latent vector and a decoder generates a data based on a given latent vector.We train the AE to reconstruct the input in the output layer.
(vector) and the corresponding gt-EBSF (vector) as a new vector and then use those combined vectors to train the AE.
As the first trial, we set d L = 2 and the result is shown in Fig. 6(b).This 1400-dimension vector is composed of the EBSF (the first 1000 dimensions) and the gap function (the remaining 400 dimensions), where each dimension is a discrete frequency point of the corresponding functions.The gap function is well reconstructed while the EBSF is not.This result shows that the two dimensional latent space is enough for representing the gap functions but not enough for representing the complex EBSFs.Next, in order to find the required dimension of the latent space for well representing both the EBSFs and gap functions, we study cases with the latent dimensions d L = 1, 2, 4, 8, 32, and 64.The results are shown in Fig. 6.Both EBSFs and gap functions are well reconstructed for d L ≥ 32, which indicates that the latent dimension for the EBSFs is roughly 32 (The minimal required latent dimension should be between 24 and 32 according to our simulations.) The two complexities (or d L 's) for EBSFs and gap functions are so different that many distinct EBSFs may map onto similar gap functions, consistent with the proposed many-to-one mapping.In this case, the machine has difficulties in learning the mapping and possibly end up with outputting the averaged EBSF.It is consistent with what has been observed in Fig. 4 as the prediction curves seem to serve as smoothed functions of the real curves.This also explains our bad training performance in Fig. 4: randomly generated EBSFs contain much information, and most of it is not relevant to the gap function so that the performance of the SL machine is reduced.Questions are raised: can the simplified p- EBSF be a good prediction?It is possible that this simplified p-EBSFs have contained enough information to reproduce all relevant physics in the gap functions and we do not really need the original, complex EBSFs for understanding the mechanism behind the superconductivity.Maybe all we need are those EBSFs having similar complexity to that of the gap functions.
In order to clarify ideas mentioned above, we calculate two gap functions based on the gt-EBSFs and the p-EBSFs.If both the simplified p-EBSF and complex gt-EBSF contain the relevant information, their resulting gap functions should look similar.Our results, as shown in Fig. 7, supports this idea.For each subfigure, the first panel is the EBSF and the following two panels show the calculated gap-related functions.Although the gt-EBSF has some small-scale fluctuations which do not exist in the p-EBSF, their resulting two functions look similar.It demonstrates that the extra complexity (say, for instance, those small-scale fluctuations) is not physically relevant to the gap functions.Therefore, our SL machine, in fact, does learn the physically relevant part of the EBSFs even though it does not fully reproduce the gt-EBSFs.Given that both the p-EBSFs and gt-EBSFs contain information relevant to the gap functions, we examine what the key information looks like in the following.
Our belief is that if one simplified EBSF do contain mostly the key information relevant to the gap functions, the complexity of that simplified EBSF has to be of the same degree as that of the gap functions.Furthermore, the SL machine, trained by the simplified EBSFs and corresponding gap functions, will restore the good performance.In order to systematically control the complexities, here we use an AE to generate new EBSFs as well as the gap functions with desired complexity (or latent dimension) for our SL trainings.The way to generate The performance is good for two reasons.Firstly, the latent dimension of the EBSFs is reduced to two, similar to that of the gap functions.The second and most important reason is that the EBSFs and gap functions are co-transformed so that relevant connection between them are kept by the AE.If these two functions are separately transformed by the AE the performance is lowered.In other words, via an AE, the latent dimension of the EB-SFs is reduced from 32 to 2 and the co-transformation with the gap functions makes the 2-D latent space of the EBSFs equal to the 2-D latent space of the gap functions.We also check that the AE-transformed EBSFs really lead to, by the self-consistent calculations, the new, co-transformed gap functions for any latent dimension.That is, an AE can preserve the connection between the gap functions and EBSFs and transfer this connection to the new dataset no matter what latent dimension is  chosen.In the appendix E, we provide evidences showing that the EBSFs predicted by our AE-follow-up SL machine do contain information relevant to the gap functions for any latent dimensions.In the appendix A, we use the connection preserved by the AE to numerically show the many-to-one mapping between EBSFs and gap functions claimed above.Given that all newly generated datasets {v n s} for any latent dimensions n satisfy the ME equations, we have to determine which d L is the appropriate one to construct the SL machine for predictions.The key is to find the latent dimension of the gap functions d Lg .From the inspection of Fig. 6, it is easy to find that d Lg = 2, which is the minimal dimension where the AE-transformed gap functions g 2 look similar to the original gap functions g 0 .In other words, the phase space of {g 2 }, the collection of gap functions generated by a trained AE with d L = 2, may contain most of the phase space of {g 0 }.On the contrary, {g 1 } may contain a smaller portion of the phase space of {g 0 }, which can be seen, for instance, in Fig. 6(a) as the reconstructed gap function has some mismatch with the original one.In Appendix D, we evaluate the portion of the {g 0 } phase space occupied by {g n } with several different n's to quantitatively determine d Lg = 2.
For the cases using the DOS as input functions, similar calculations and analysis have been performed and the appropriate latent dimension is one, instead of two for the gap functions.It is consistent with our observations as the difference between DOS curves is mainly the positions of the coherence peaks while the difference between gap functions can exist in much more different ways.
Given the good performance shown above, we can summarize our process in Fig. 1 for improving the performance of a general functional-mapping SL as following.First of all, we check the latent dimensions (or complexities) of both input and output functions d Li and d Lo respectively by AEs with different d L 's.Usually, the bad performance occurs when d Lo > d Li , which will be discussed next.Secondly we train an AE with the latent dimension d Li .During this second step, the complexities of input and output functions are adjusted to be the same and new dataset can be generated by the trained AE.As the final step, we use newly generated datasets to train the SL machine and use this well-trained machine to make predictions for gap functions from experiments.

IV. DISCUSSIONS AND CONCLUSIONS
The bad performance usually occurs when d Lo > d Li .In this case, the machine has to learn to infer remarkably different output functions (EBSFs in this work) based on similar input functions (gap functions in this work).This is a difficult task and the machine usually ends up with predicting the "coarse grained" functions as mentioned above.However, for the opposite situation, where d Lo < d Li , remarkably different input functions are used to infer similar output functions, which is a relatively easier task.Consider an extreme example where two different input functions (q 1 and q 2 ) correspond to the same output function (r 1 ).Then, the machine only needs to learn to infer r 1 no matter which one of q 1 and q 2 is encountered.For a consistency check, we reverse the roles of input and output functions (that is, here, we use EBSFs to predict gap functions via a SL machine) and the performance of the SL machine is good (not shown), which is consistent with our expectations.
In order to have good performance, the latent dimension of the AE has to be chosen appropriately if initially the two latent dimensions of input and output functions are largely different.For the simpler case shown in the first part of this work, two complexities (latent dimensions) are compatible to each other so that the performance is good and we do not need an extra AE.For the complex case shown in the second part of this work, the latent dimension of the output functions (d Lo = 32) is much larger than that of input functions (d Li = 2 for the gap functions and d Li = 1 for the DOS functions).Therefore the best performance occurs when the latent dimension of the AE is chosen to be two (one) if the gap functions (DOS functions) are chosen as the input functions.
Usually, it is difficult to have enough dataset directly from experimental inputs for training a SL machine.Therefore, we theoretically generate datasets for training.In order to include all possibilities encountered in the real physical world, data is generated with maximum randomness.However, this process can create extra complexity which is irrelevant to the physical properties of interest leading to bad performance as expected.Our process of complexity reduction serves as a way to extract physically relevant information of the randomly generated functions.
In this work, our method is applied to mainly BCS superconductivity but the extension to other types of order parameters of superconductivity, such as d-wave SC or other unconventional ones, should be straightforward as long as similar theoretical formalism connecting the gap functions and bosonic glue is established.Furthermore, this method can be generally applied to any function-tofunction supervised-learning tasks with asymmetric complexities to improve the performance and obtain more understanding on the data.
In summary, we predict the EBSF of a superconductor based on the measured gap functions via the supervised learning aided by the autoencoder.When the latent dimensions of input and output functions are compatible, the performance is good.When these two latent dimensions are remarkably different (d Lo > d Li ), we have to reduce the latent dimension of the complex one to be compatible to the simple one so that the good performance restores.This reduction can extract physically relevant information.Therefore, this method paves the ways to improving the performance of general function mappings with asymmetric complexities and to extracting meaningful information in data from physics or other fields.
The authors thank Prof. Ting-Kuo Lee for fruitful discussions.This work is supported by Ministry of Science and Technology (Grant No. MOST 109-2112-M-110-010 and MOST 108-2112-M-110-013-MY3 ) and National Sun Yat-sen University.In order to demonstrate the many-to-one mapping, we statistically examine the difference between two EBSFs (say, the original f 1 and the complexity-reduced f 2 ) and that between two corresponding gap functions (say, g 1 and g 2 ).As the first step we randomly generate f 1 and obtain, via ME self-consistent calculations, the gap function g 1 .We then choose an AE-follow-up SL machine with the latent dimension d L = n and feed this machine with g 1 .The output is f 2 .As the next step, we obtain g 2 self-consistently based on f 2 .Note that, both pairs ((f 1 ,g 1 ) and (f 2 ,g 2 )) are shown in the main text to satisfy the ME equations.This completes the generation of one data (f 1 , f 2 , g 1 , g 2 ) and, in the following, we define and evaluate two differences: ∆ f for EBSFs and ∆ g for gap functions using the Pearson correlation coefficient (PCC) 82 .
For each data, the PCC of f 1 and f 2 and that of g 1 and g 2 are calculated as p f and p g , respectively.The PCC will be equal to 1 if the two compared functions are exactly the same.Then two differences are defined as ∆ f = 1 − p f and ∆ g = 1 − p g .We can then calculate the average of ∆ f and ∆ g over the dataset for each latent dimension n.As shown in table I that ∆ f is roughly two orders of magnitude larger than ∆ g , which means remarkably different EBSFs correspond to similar gap functions, a many-to-one mapping.First of all we analyze how the performance is influenced by the neuron number of the layer next to the output layer (NTO layer, see Fig. 2).This neuron number turns out to be the number of basis functions whose combinations produce the output functions.We extract the weights between the NTO and output layers to find these bases.Suppose the output function f (ω i ) = j w ij a j , where a j is the value of the j-th neuron of the NTO layer.The bias vector is ignored in this discussion as it does not influence the performance.Then w ij is the i-th element of the j-th basis.Note that the weights between the last two layers can be directly interpreted as bases because the activation function of the output layer is set to be linear, instead of nonlinear functions such as ReLU.We demonstrate cases when the NTO layer has different neu- ron numbers in Fig. 9 so that each curve corresponds to one basis function the machine has learned from data.Smooth curves are proper basis found by the machine while noise-like curves, such as the lower-left panel in Fig. 9(c), indicate that the corresponding neuron is impotent.The more proper bases found by the machine, the better the performance.For instance, for the four-neuron case, the machine finds four proper bases so the performance is better than that in the six-neuron case where the machine only finds three proper bases, but worse than that in the eight-neuron case where the machine finds six proper bases.The connection between the performance and the number of proper basis can be further visualized from the training history.The four-neuron case is demonstrated in Fig. 10.When a basis is found the loss function shows a drop and totally four proper bases (See Fig. 9(b)) are found.

Appendix C: Numerical Calculation of the Eliashberg equation
The effects of the EPC are manifest in the mass renormalization factor Z(ω) and the gap function ∆(ω), which can be determined theoretically by the materialdependent EBSF α 2 F (ω).To have large enough datasets for the machine to learn, we generate various EBSFs randomly which suffice for our need of data.The ME equations ( 6) and ( 7) are then solved self-consistently by inputting these generated EBSFs.Other quantities like T and µ * are input parameters to be specified to solve the equations.
The generated EBSF must be continuous and positivedefinite.We first create random functions along the one-dimensional ω space by the geometric Brownian motion which is often used as the financial market price simulations 76,77 .The functions are then shifted to guarantee the positive-definite behaviour, and are further enforced to vanish at zero and large frequencies above the Debye cutoff.The generated functions are then numerically smoothed to make themselves continuous.Finally, the functions can be mapped to the chosen low frequency region of interest.
In the whole work here, we map the generated EBSF to the frequency range of 0 ∼ 10 meV, and set the temperature T = 1.16K.The dimensionless empirical parameter µ * is fixed to be 0.1 for simplicity since its value does not change the result much.Note that the energy can be used as the common unit and other dimensional quantities are scaled correspondingly.For example, we can still get the same result of the gap function and renormalization if the frequency range is mapped to the range of 0 ∼ 50 meV, while T is chosen to be 5.8 K simultaneously.Some randomly-generated EBSFs will not produce superconductivity at the chosen temperature.We keep only the nonzero ∆ data for training and testing.
In Eqs. ( 6) and ( 7), the Z and ∆ are coupled all together with all Matsubara frequencies.In the numerical calculations of the coupled equations, the terms are truncated at high frequencies since Z and ∆ are decaying to fixed values at high Matsubara frequencies.The summation over different Matsubara frequencies ω n is truncated to n ∈ (−M − 1, M ) terms.The M value determines the maximum Matsubara frequency included, which should be large enough to ensure convergent results, and especially high accuracies for the sake of analytical continuation to be performed later.
There always exists the solution ∆ = 0 which corresponds to the normal state.In numerical calculation we should be careful of choosing the initial values and the convergent criteria to find the superconducting solutions (∆ = 0) of interest.To solve the coupled equations selfconsistently, we solve them iteratively.However, the simplest and straightforward iteration of a previous output as the next input usually converges poorly.Typically the trivial solution ∆ = 0 might be obtained, especially when the non-trivial solution is small.To perform the numerical iteration properly, we adopt the Broyden's method, which is a quasi-Newton method to find roots of a set of nonlinear equations f (x) = 0 by updating the Jacobian matrix 78 .Here x = {∆(iω n ), n = −(M + 1), . . ., M } is the root solution to be found.The Jacobian matrix J with (i, j) entry J ij = ∂fi ∂xj specifies the first-order partial derivatives in each small displacement.The n−th iteration is written in the form of where The inverse of the the Jacobian matrix is updated directly as to minimize the Frobenius norm ||J n − J n−1 || F , as suggested by Broyden 78 .With J −1 determined, the roots of the equations are then improved by proceeding in the Newton direction: until the solution converges.
Once the Z and ∆ are solved at the Matsubara frequencies, the analytical continuation of these functions to the real axis is then executed.We adopt the Padé approximation, which requires sufficient and accurate Npoints (here 2M +2 points) data at the imaginary axis 74 .In practice, the truncated M must be large enough to ensure the maximum Matsubara frequency several times of the real frequency range under discussion 5 .According to our energy unit choice, our real frequency ω of interest is about 0 ∼ 60 meV, above which the Z and ∆ are usually saturated to 1 and 0, respectively.Only the low frequency parts below 35 meV of Z and ∆ are used as the input for the later machine learning.Our choice of M = 350 corresponds to a Matsubara frequency ∼ 220 meV, about 22 times of the EBSF domain range and several times of the input frequency range, sufficient to resolve the low-frequency information.If Z(iω n ), ∆(iω n ), or φ(iω n ) do not converge for the chosen M value, M will be doubled until they converge.After these quantities are analytically continued to the real axis, we also compare ∆(ω) and φ(ω)/Z(ω) for consistency check.The two should be mathematically equal but may differ due to the adopted Padé approximation or the numerical errors.We set the criteria to their difference and ratio to each other at each frequency point, and only those data having both criteria passed beyond ω = 35 meV will be kept, since only the Z(ω) and ∆(ω) below 35 meV will be used for later machine learning to find reasonable relations among functions.For gap functions in the AE-generated {g n }, the performance (of predicting the EBSF) can be solely determined by the loss function of the SL machine S n , which is trained using {g n }.How about gap functions other than the dataset {g n }?We expect that for gap functions close enough to any member of {g n }, the performance is also determined by the loss function of S n when S n is used to make predictions.Therefore, to determine the closeness of two data set, we can define a characteristic distance l n between two functions within {g n } and claim one gap function is close enough to {g n } when its closest distance to {g n } is less than l n .For each member g ni in {g n }, we can always find its closest partner with smallest distance l ni which is defined as the summation of the normalized absolute difference between two functions, ω |g ni −g nj |/ ω |g ni |, over all frequencies.Then l n is defined as the maximum of {l ni }.We can then expect that one gap function not belonging to {g n } can share similar performance of S n if the distance between this gap function and its closest member in {g n }, l on , is less than l n .We can then estimate the correct rate of the prediction by the portion of the original {g o } satisfying l on ≤ l n to further evaluate the performance of {g o } using S n .In Table II, we show this portion as well as the optimized value of the loss function of S n for different d L = n.Please see Fig. 8 to visualize the performance.The better the performance, the smaller the loss function.The correct rate is close to 100% for d L ≥ 2 and the performance decreases with increasing d L .Therefore, the best choice will be d L = 2, consistent with naïve inspections of Fig. 6.Since often superconducting properties, such as the critical temperature, can be estimated by some moments of the EBSFs without diving into the spectrum details [79][80][81] , we calculate the p-th moments µ p = ω p f (ω)dω carrying relevant information for comparison, where f (ω) = α 2 F (ω) and p = 2, 1, 0, and -1 is calculated.Therefore, we can check if the original and SL-predicted EBSFs share the same properties.We calculate µ p,o and µ p,n which are p-th moment of old (original) and new (SL-predicted, d L = n) EBSFs, respectively.These new EBSFs are obtained in three steps.At the first step, we obtain AE-transformed gap and EBSFs (See Appendix F).At the second step, transformed gap functions and EBSFs are used to train a SL machine S n .At the third step, we feed the old gap functions into S n to predict the new EBSFs and compare them to the old EBSFs.We then take the mean (m) and standard deviation (std) of the difference of µp,o−µp,n µp,o over the whole dataset.The results, as shown in Fig. 11, show that the old and new EBSFs do share relevant properties.We also compare the old EBSFs to the p-EBSF functions, which is consistent with Fig. 7.

Appendix F: Generating new functions
In this appendix, we provide detailed procedures of generating new EBSFs and gap functions for given d L = n, which are used to train a SL machine.Consider one original gap function g o and EBSF f o as two vectors and we concatenate these two vectors to one combined vector v o , the first part of which is the EBSF f o and the second part of which is the gap function g o .We then train a AE with d L = n, denoted as A n , by using the dataset {v o } until the reconstruction error is optimized (or minimized).After the training is completed, we again feed {v o } to this trained AE and obtain the output {v n }.Then the first part of each vector v n is the new EBSF f n and the second part of it is the new gap function g n .
Then we can use this dataset, {g n } and {f n }, to perform the SL training mentioned in the main text.

6 FIG. 1 :
FIG.1:The flowchart of the training process.We start at step 1 by training a supervised learning (SL) machine with a dataset.If the performance is good then we can use the welltrained machine for predictions (step 2).If the performance is not good, we have to check if the latent dimension of output functions, dLo, is larger than that of input functions,dLi (step 3).If it is not the case, we expect the performance can be improved by changing the hyper-parameters of the neural network, such as the number of layers and the number of neurons in each layer (go back to step 1).See also the discussion section.If dLo > dLi, we firstly train an autoencoder (AE) with latent dimension dLi using the original dataset and generate new datasets from the well-trained AE (step 4).At step 5, we train a new SL machine with the newly generated dataset and the good performance is expected (step 6).

FIG. 2 :
FIG.2:The neural network for the supervised learning to predict EBSFs based on the information from the gap functions.

FIG. 6 :
FIG. 6: The original (orange dashed curves) and reconstructed (blue solid curves) functions of AE's with different latent dimensions dL = 1 (a), 2 (b), 4 (c), 8 (d), 32 (e) and 64 (f).The first 1000 components (on the left) of each curve is the EBSF while the last 400 components (on the right) is the gap function.When dL ≥ 32, both the EBSF and gap functions can be well reconstructed.

1 FIG. 7 :
FIG. 7:The results based on the p-EBSF (a) and the gt-EBSF (b).The first panel of each subplot is the EBSF and the following two panels are gap-related functions.The difference between two EBSFs is mainly small-scale fluctuations, which turn out to be not physically relevant as the resulting gaprelated functions from these two EBSFs are similar.
Appendix A: Many-to-one mapping

25 FIG. 9 :
FIG. 9:The extracted basis functions for machines with different neuron numbers in the NTO layer.Smooth curves are proper basis functions while noise-like curves indicate that the machine fails to find proper bases.The performance increases with the number of the proper bases so that the performance of the four-neuron case (b) is better than the six-neuron case (c).

FIG. 10 :
FIG. 10: The loss function for the four-neuron case during the training process.The loss function drops when the machine finds one basis function.
Appendix D: Determination of the appropriate latent dimension

FIG. 11 :
FIG.11:The difference in different moments (p) between old (original) and newly SL-predicted EBSFs.We study the cases with dL = 1, 2, 4, 8, 16, 32 as well as the p-EBSF.The error bars are standard deviations from the whole datasets.Statistically, differences in these moments between original and new EBSFs can be viewed as zeros, indicating that original and new EBSFs share key information about the gap functions.

TABLE I :
The differences in the gap functions (∆g) and EBSFs (∆ f ) in terms of the Pearson correlation coefficient (PCC).The difference in EBSFs is much larger than that in gap functions, the many-to-one mapping.

TABLE II :
The portion of the original dataset included in the newly generated datasets with different latent dimensions (dL's).The optimized value of the loss function for each dL is also presented.For dL ≥ 2, the domains of newly generated datasets can cover almost all of the original dataset, which ensures the high correct rate of the predictions.