Self-supervised learning of materials concepts from crystal structures via deep neural networks

Material development involves laborious processes to explore the vast materials space. The key to accelerating these processes is understanding the structure-functionality relationships of materials. Machine learning has enabled large-scale analysis of underlying relationships between materials via their vector representations, or embeddings. However, the learning of material embeddings spanning most known inorganic materials has remained largely unexplored due to the expert knowledge and efforts required to annotate large-scale materials data. Here we show that our self-supervised deep learning approach can successfully learn material embeddings from crystal structures of over 120 000 materials, without any annotations, to capture the structure-functionality relationships among materials. These embeddings revealed the profound similarity between materials, or ‘materials concepts’, such as cuprate superconductors and lithium-ion battery materials from the unannotated structural data. Consequently, our results enable us to both draw a large-scale map of the materials space, capturing various materials concepts, and measure the functionality-aware similarities between materials. Our findings will enable more strategic approaches to material development.


Introduction
The diverse properties of the inorganic materials originate from their crystal structures, i.e. the atomic-scale periodic arrangements of elements. How structures determine low-level material properties such as the band gap and formation energy is well studied as the structure-property relationship [1,2]. On the other hand, the materials science literature often discusses 'superconductors' [3], 'permanent magnets' [4], or 'battery materials' [5], referring to their higher-level properties, or functionality. Nevertheless, understanding what structures exhibit such functionality, or understanding the structure-functionality relationship, is a fundamental question in materials science. We call this functionality-level material similarity 'materials concepts' . Traditionally, materials science has sought new materials by experimentally and theoretically understanding specific functionalities of materials in a bottom-up fashion [1][2][3][4][5]. However, this labour-intensive narrowly focused analysis has prevented us from grasping the whole picture of the materials space across various materials concepts. For next-generation material discovery based on the structure-functionality relationship, we argue here the need for a top-down unified view of crystal structures through materials concepts. We pursue this ambition by learning a latent representation space of crystal structures. Thus, this representation space should ideally both (a) recognise materials concepts at scale and showing different information of diamond in different data forms, or modalities. Since each attribute has its own advantages and disadvantages in expressing a material, using multiple attributes for a material can provide a more comprehensive view of the material. Particularly, the combination of the crystal structure and the x-ray diffraction (XRD) pattern, which we employ in this study, is known to well reflect two complementary structural features of materials, the local structure and the periodicity [1]. (b) Our goal is to represent each material as an abstract constant size vector (embedding) whose distances to other embeddings reflect conceptual (functionality-level) similarities between materials. These embeddings allow us to visualise the materials space intuitively and also to search for conceptually similar materials given a query material. We learn embeddings from pairs of crystal structures and XRD patterns in the framework of deep metric learning. This cross-modal learning approach trains deep neural networks by teaching them that each pair should represent the same material entity. Because the XRD pattern can be theoretically calculated from the crystal structure, this learning can be performed in a self-supervised manner without any explicit human annotations for the materials dataset.
(b) be equipped with a functionality-level similarity metric between materials. We here utilise multi-modal structural attributes of materials to effectively capture structural patterns correlated to material functionality (figure 1). The underlying hypothesis here is that materials concepts are the intrinsic nature of crystal structures, and therefore, deeply analysing the structural similarity between materials will lead to capturing functionality-level similarity. Figure 3(a) highlights key results by our representation space, which maps the crystal structures of materials to abstract 1024-dimensional vectors. For visualisation, these vectors were reduced to 2D plots in the figure using a dimensionality reduction technique called t-distributed stochastic neighbour embedding (t-SNE) [6]. We target 122 543 inorganic materials registered in the Materials Project (MP) database (amounting to 93%) to capture nearly the entire space of practically known inorganic materials. These crystal structures themselves contain information about their functionalities implicitly. However, they do not explicitly tell us what structural patterns lead to specific material functionalities such as superconductivity due to complicated structure-functionality relationships. Nevertheless, these materials form clusters of various materials concepts in the space (see annotations in figure 3(a)), showing the success of our representation space capturing structural patterns correlated to material functionality.
Likewise, representation learning is gaining attention for understanding human-incomprehensible large-scale materials data [25][26][27][28][29], visualising the materials space [25,[27][28][29][30], and generating crystal structures [31][32][33][34][35][36]. These material representations aim to map the abstract, comprehensive information of each material into a vector called an 'embedding' . Our work has the same purpose as that of embedding learning. However, to date, neither descriptor nor embedding learning explicitly learns the underlying relationships or similarities between materials. Particularly, existing embeddings [25][26][27][28][29] are learned indirectly as latent feature vectors in an internal layer of a DNN by addressing a surrogate training task (e.g. the prediction of physical properties [27,28], a task of natural language processing (NLP) [26] or its variants [25,29]). In such an approach, it is unclear from which layer we should obtain the latent vectors or which metric we should use to measure the distance/similarity between them.
Capturing abstract concepts of materials via learning structural similarities between them is analogous to word embedding learning [26] in NLP. Similar to materials concepts, meanings of words in natural languages often reside in complex and abstract notions, which prevent us from acquiring precise definitions for them. Word embeddings then attempt to capture individual word concepts, without being explicitly taught, by absorbing our word notions implicitly conveyed in the contexts provided by a large-scale text corpus. Once optimised, similarities/distances between embeddings express their concepts, e.g. the embedding of 'apple' will be closer to those of other fruits such as 'grape' and 'banana' than 'dog' or 'cat' . Our crystal structure embedding shares a similar spirit with word embedding in that both attempt to capture abstract concepts via learned similarities. More importantly, we exploit a large-scale material database as a corpus of materials that implicitly conveys important structural patterns in its contexts of crystal structures, as analogously to word embedding. These structure instances of diverse kinds of materials, even without explicit annotations about their properties, should contain tacit but meaningful information about physics and material functionality that can guide the learning of ML models. From an ML perspective, such a learning strategy is called self-supervised learning [37], in which the data of interest themselves provide supervision.
In this study, we demonstrate the large-scale self-supervised learning of material embeddings using DNNs. In essence, we follow the principle that the structure determines properties and aim to discover materials concepts purely from crystal structures without explicit human supervision in learning. To this end, we use a collection of crystal structures as the only source of training data and do not provide any annotation regarding specific material properties (e.g. class labels such as 'superconductors' and 'magnets' , or property values such as superconducting transition temperature and magnetisation). Furthermore, unlike existing methods for material embedding learning, we explicitly optimise the relationships between embeddings by pioneering the use of deep metric learning [38]. Metric learning is an ML framework for learning a measure of similarity between data points. Unlike the common practice of metric learning performed in a supervised fashion using annotated training data [38], we allow our ML model to be learned from the unannotated structural data in a self-supervised fashion.

Results
Our key idea for self-supervised learning, illustrated in figure 2(a), is to learn unified embedding representations for paired inputs expressing two complementary structural features characterising materials: the local structure and the periodicity [1]. In our model, the local structure is represented by a graph whose nodes and edges stand for the atoms and their connections. The periodicity is represented by a simulated x-ray diffraction (XRD) pattern, which can be theoretically calculated from the crystal structure using Bragg's law and Fourier transformation [1]. We simultaneously train two DNN encoders by enforcing them to produce consistent embeddings across the two different input forms. This training strategy follows a simple optimisation principle: (a) for a positive pair in which the input crystal structure and XRD pattern come from the same material, the Euclidean distance between two embedding vectors is decreased, and (b) for a negative pair in which these inputs come from different materials, the distance is increased. We implement this principle in the form of a bidirectional triplet loss function, as illustrated in figure 2(b). For the detailed method protocol, see section 5.
By design, we minimise human knowledge of specific materials concepts in both the data source and training process, with the belief that materials concepts are buried in crystal structures. This design principle enhances the significance of the resulting embedding highlighted earlier ( figure 3(a)). It captures profound materials relationships through simple data and optimisation operations considering only general and elementary knowledge of materials such as crystallographic data and Bragg's law. The results suggest that materials concepts can be exposed in deeply-transformed abstract expressions unifying the complementary factors, i.e. the local structure and periodicity, of crystal structures.
The following analyses examine the embedding characteristics more carefully to see if the embedding space has the two desired features mentioned above. Specifically, we qualitatively analyse (a) the global embedding distribution using t-SNE visualisation and (b) the local neighbourhoods around some important materials using the learned similarity metric between crystal structures. In the latter, a superconductor (Hg-1223), a lithium-ion battery material (LiCoO 2 ), and some magnetic materials serve as our benchmark To account for respective input data forms ( figure 1(b)), the crystal-structure encoder employs a DNN for graphs while the XRD pattern encoder employs a 1D convolutional neural network. (b) A schematic view of our bidirectional triplet loss. This triplet loss is used to simultaneously train the two DNN encoders to output embeddings that are close together when the input crystal structures and XRD patterns are paired (red-coloured x and y) and far from one another when the inputs are not paired (x and y vs others). More details are given in section 5. materials because of the high social impacts and the diverse properties yet complex structures of these material classes. These analyses also demonstrate the usefulness of our materials map visualisation and similarity metric for material discovery and development.

Global distribution analysis
Careful inspection of the embedding space (figure 3(a)) reveals various clusters consistent with our knowledge of materials. Here, we note several interesting examples. A series of clusters corresponding to double perovskites (A 2 BB ′ X 6 ) with different anions, X, exists along the left edge and at the centre of the map, forming a family of materials with the same prototypical crystal structure. This layout suggests that our model captures the structural similarity while properly distinguishing the local atomic environment at each site. At the lower left of the map, well-known 2D materials (transition metal dichalcogenides) form clusters in accordance with their atomic stacking structures [39]. At the top edge lies a cluster of imaginary unstable materials with extremely low-density structures (see also figure 5(a) for more details), representing one of the simplest cases of crystal structures governing physical properties. This cluster of unstable materials is an example showing that our embeddings capture materials characteristics solely from crystal structures without any explicit annotation given for training. One exciting finding from this map is a cluster of cuprate superconductors at the left edge. This cluster includes the first-discovered copper oxide superconductor, the La-Ba-Cu-O system, and the well-known high-transition-temperature (T c ) superconductors YBCO (YBa 2 Cu 3 O 7 or Y-123), which are located close to La-Ba-Cu-O. These celebrated superconductors share a common structural feature, a CuO 2 plane, that is vital to their superconductivity [3]. The formation of this cluster suggests that our embeddings recognise this hallmark structural feature. A closer look at this cluster (figure 3(b)) further reveals the presence of subclusters with structural features linking them. Y-123 and its variant Y-124 have a non-trivial structural similarity related to the CuO chain (see figure 3(c)). In addition, we confirmed other major cuprate superconductors containing Bi, Tl, Pb, or Hg form respective clusters in accordance with their local structures called 'block layers' , a key structural concept for understanding the underlying physics of cuprate superconductors [40]. The proximity of these materials on the map further supports the claim that the embeddings capture the structural characteristics and, consequently, the structure-functionality relationships between cuprate superconductors.
These findings naturally lead us to the idea that the map might be able to identify potential superconductors or other beneficial compounds that have not yet been recognised. We leave this idea as an open question and have set up a project website where anyone can dig into the embedding map to search for, or rediscover, potential compounds with preferable functionality.
The t-SNE visualisation also provides a macroscopic perspective on the materials space based on the crystal structure. The simplest indicator of success for this model is the distribution of the elements within  the materials map. Because atoms and ions with similar electron configurations compose materials with the same or similar crystal structures, we expect the element distributions to show cluster-like features if our embeddings have been trained successfully. In figure 4, we highlight each element in the map and display all elements in the form of a periodic table. As expected, figure 4 clearly shows similar distributions of blue-coloured clusters in the vertical and horizontal directions. These distributions can be analogously called the 'alkali metal plateau' , the '3d transition metal district' , or 'rare-earth mountains' if we follow the map metaphor, indicating that the embeddings succeed in capturing the similarities of roles between elements in crystal structures. Additionally, we noticed that well-known connections between physical properties and elements can also be probed using this plotting technique (see figures 5(b) and (c) for details). Although these visualisations (figures 4 and 5) are intended to confirm expected outcomes rather than showing interesting findings, they demonstrate their potential utility, e.g. for giving researchers new insights or helping them find materials with desired properties.

Local neighbourhood analysis
We next examine the local neighbourhoods of several benchmark areas to verify whether the learned metric recognises functionality-level material similarity. Since the embeddings were optimised with the Euclidean distance, we also used this metric to determine the neighbourhoods.
As the first example, we analysed the neighbourhoods of Hg-1223, a superconductor with the highest known T c (134 K) at ambient pressure [42]. To our surprise, the first and second nearest neighbourhoods correspond to its close kin Hg-1234 and Hg-1212, which also have high T c values (125 K and 90 K) but different block layers [40] from those of Hg-1223 (see figure 6(a)). Further investigation identified major Tl-based high-T c superconductors, such as Tl-2234 (T c = 112 K), Tl-2212 (T c = 108 K), and Tl-1234 (T c = 123 K) [43], and many other superconductors occupying the top-50 neighbourhoods (see table 1). The connection between the crystal structures and T c values involves non-trivial mechanisms that are not immediately evident from the crystal structures [3,40]. The results suggest that our model effectively bridges this gap with the help of learned structure-functionality relationships that are deeply buried in the 1024-dimensional space.
Next, we examined lithium-ion battery materials, which substantially support our lives of today. This technology has been developed through the discovery of new materials and the understanding of their structure-composition-property-performance relationships and is now bottlenecked by the cathodes (positive electrodes) in terms of the energy density and production cost [5]. We therefore studied the neighbourhoods of LiCoO 2 , the first yet most dominant cathode material [5]. Impressively, two of the three leading cathode material groups, namely, the layered, spinel families [5] (see figure 6(b) for visualisations), were identified in the neighbourhoods. Specifically, similar to LiCoO 2 , a family of layered LiMO 2 , with M We compare the top-50 neighbours of the Hg-1223 superconductor obtained by using our embedding and two hand-crafted descriptors (Ewald sum matrix and sine Coulomb matrix) [22]. The query material, Hg-1223 (HgBa2Ca2Cu3O8), has the highest known Tc (134 K) at ambient pressure. Quite impressively, the neighbour list obtained by our embedding seems to be completely filled with superconductors, including the well-known Hg-1224 (No. 1) and Hg-1212 (No. 2) as well as Tl-based high-Tc superconductors such as Tl-2234 (No. 8), Tl-1234 (No. 32), and Tl-2212 (No. 44). By contrast, the lists obtained by the two existing descriptors contain irrelevant materials rather than superconductors. These results clearly show that our approach captures the conceptual similarity between superconductors, which is undetectable by the existing descriptors. See also the SI (appendix A3) for the detailed procedures of the descriptor computations and more discussions.
being transition metals, were found within the top-10 neighbourhoods of LiCoO 2 (see table 2), including important battery materials LiNiO 2 families. Spinels as another important family were found as LiNi 2 O 4 at the 51th neighbour and LiCo 2 O 4 in the 200s neighbours. The polyanion family, the remaining one of three major cathode families, were not placed in the vicinity of LiCoO 2 but formed a distinctive cluster at the top edge in figure 3(a). Interestingly, all of these materials were developed by the group of Nobel laureate John Goodenough [5]. This fact suggests that the embeddings capture conceptual similarity among the battery materials that previously required one of the brightest minds of the time to be discovered. Note that our method properly links substituted materials and the original material without being confused by ad hoc supercell expression (e.g. Li 4 Co 3 NiO 8 = LiCo 0.75 Ni 0.25 O 2 ). This advantage is particularly noticeable in comparison with embeddings constructed using conventional features (table 2). This result indicates that our approach can recognise the essential structural features without being affected by superficial differences (i.e. the number of atoms or the size of the unit cell).
Additionally, we analysed the vicinities of magnetic materials, including 2D ferromagnets, which are attracting much attention for their interesting properties [41], and commercial samarium-cobalt (Sm-Co) permanent magnets. Again, the embeddings capture meaningful similarity in these material classes, as shown in figures 7 and 8, which is often not evident to non-specialists (see appendix A1 in the supplementary information (SI) for more discussions and detailed results).
These in-depth analyses across diverse materials consistently support the conclusion that our ML model recognises similar functionalities of materials behind different structures without being explicitly taught to do so. We anticipate that when a material with beneficial properties is found, we may be able to screen for new promising candidates based on the conceptual similarities captured in this embedding space.

Performance validation as a materials descriptor
Here we provide quantitative insight into characteristics of embeddings. Particularly, we analyse the performance of predicting material properties using trained embeddings as input. As we are more interested in predicting functional material properties, we conducted a binary classification task of materials concepts, in which an ML model predicts whether a material belongs to a particular material class or not.
We expect that our embeddings contain the information of materials concepts. If so, we can rapidly screen materials with a desired concept from a material database by combining the embeddings with an ML model. However, properly labelling materials with their concepts requires experiments or consideration by experts, and thus the number of available labelled data for a given concept is likely to be limited. Therefore, as a benchmark and a use case for our embeddings, we evaluated the materials concept classification in the settings of few training data.
As benchmark materials, we used superconductors and thermoelectric materials for their complex and interesting properties. We used the Crystallography Open Database (COD) as the data source. The number of positive data used for training was 469 for superconductors and 286 for thermoelectric materials. Embeddings of these materials were obtained by the crystal structure encoder trained on the MP dataset via deep metric learning, and were used as input to a random forest classifier. As a baseline for comparison, we used latent feature vectors of crystal graph convolutional neural network (CGCNN) trained for total energy prediction, as in appendix C. We evaluated the prediction performance with leave-p-groups-out cross-validation while varying the training data size. Here, both the training and testing splits were made to contain balanced positive and negative samples.
As shown in figure 9, the classifier using our embeddings obtained good classification performance for both superconductors and thermoelectric materials. In particular, when the number of training data is very small (around 10), our method shows significantly better performance than the baseline. We will more discuss these results in the next section.

Discussions
As assumed, materials concepts were exposed spontaneously in an abstract space. As we confirmed in the numerical evaluations of the training task of metric learning (appendix B in SI), this space was shown to successfully unify the two complementary factors of crystal structures. We hypothesise that these remarkable properties of our embeddings were made possible by the following two key features of our method that are distinctive from the existing material embedding methods [25][26][27][28][29]. First, we used deep metric learning, which directly optimises the spatial arrangements of the embedding vectors via a loss in terms of the Euclidean distances between them. This procedure is critically different from the existing methods [25][26][27][28][29], We compare the top-50 neighbours of LiCoO2 obtained by using our embedding and two hand-crafted descriptors (Ewald sum matrix and sine Coulomb matrix) [22]. The query material, LiCoO2, is one of the most crucial lithium-ion battery cathodes. In the list of our embedding, the many neighbours of LiCoO2 are occupied by LiCo 1−x MxO2 families with the same layered structure as LiCoO2 but partly substituted with different transition metals M. Since these partial substitutions are represented as supercells, the system's apparent size is larger than the original unit cells. Our approach is unaffected by these apparent differences and can recognise the essential similarities. While most of our list is filled with lithium oxides, the other two lists obtained by the existing descriptors do not suggest this consistent trend. These results suggest that our model recognises the concept of lithium-ion battery cathodes, which is not captured by the existing descriptors. See also the SI (appendix A3) for the detailed procedures of the descriptor computations and more discussions.

Figure 7.
Crystal structures of the 2D ferromagnet Cr2Ge2Te6 and its neighbours in the embedding space. The double discoveries of 2D ferromagnets in 2017, after long questioning their existence, are gathering great interest from the magnetic materials community [41]. When we analysed the neighbourhoods of one of these 2D ferromagnets, Cr2Ge2Te6 (mp-541449), our embedding space successfully captured CrSiTe3 (mp-3779), a compound known as a potentially 2D-ferromagnetic insulator, as the first neighbour and even the other 2D ferromagnet CrI3 (mp-1213805) as the 15th neighbours among 122 543 materials. More detailed results and discussions are given in the SI (appendix A1).

Figure 8.
Crystal structures of the Sm2Co17 permanent magnet and its neighbours in the embedding space. Here we highlight two compounds in the neighbourhood list of Sm2Co17: SmCo5 and SmCo12. Sm2Co17 and SmCo5 are the two major components in Sm-Co magnets often used in high-temperature environment, whereas SmCo12 is one of the compounds with the so-called 1-12 structure that has been drawing attention for its potential for permanent magnets. In the neighbourhoods of Sm2Co17 (mp-1200096) in our embedding space, we found SmCo12 (mp-1094061) as the 255th neighbour and SmCo5 (mp-1429) around the top 0.5% neighbourhoods. It is well known in the community that the crystal structures of Sm2Co17, SmCo5, and SmCo12 have close connections with each other [4]. However, without the literature context and proper visualisation, it is difficult for a human analyst to recognise these connections. More detailed results and discussions are given in the SI (appendix A1).
which learn embeddings indirectly as DNN's latent vectors. Although these latent vectors should encode essential information about materials, the explicit metric optimisation of embeddings is equally important for map creation and similarity learning. Second, our self-supervised learning is enabled by exploiting two forms of inputs expressing complementary structural characteristics: a set of atoms in the unit cell with their connections as the local characteristics and the XRD pattern, which is essentially a Fourier transformed crystal structure [1], as the periodic characteristics. Representation learning is known to be generally more well-informed when diverse multi-modal data are used for training [44]. In contrast to approaches that rely on single forms of materials data expression [25][26][27][28][29], our model benefits from learning across two forms of expression, or cross-modal learning.
The results of the materials concept classification (figure 9) clearly support these hypotheses. Remind that the baseline method (CGCNN [12]) learns embeddings as latent vectors in a DNN with only crystal structures as input, whereas our method uses the same DNN but trains it along with another DNN for XRD patterns in cross-modal deep metric learning. Thus, the performance advantage of our method directly indicates the benefit of the proposed cross-modal deep metric learning approach. We believe that our method using both crystal structures and XRD patterns helped the ML model to capture local motifs and lattice more effectively, which contributed to better learning of structural patterns correlated to material functionality and thus better recognition of materials concepts. We expect that incorporating more diverse structure representations of materials such as electronic structure into our multi-modal learning framework will further benefit the representation learning of materials. We leave such extensions as future work. Figure 9. The prediction performance of materials concepts. For superconductors and thermoelectric materials, embeddings obtained by our DML approach show higher performance especially when the size of the training dataset is very small. The embeddings of the baseline were latent vectors of CGCNN trained to predict total energy from crystal structures, as done by Xie et al [12].
To provide more insight into the difference between our method and the baseline (CGCNN), we further analysed the performance of these methods for physical property prediction (see appendix C in SI for details). Similarly to the materials concept classification, we trained random forest models to predict materials properties, such as total energy, space group, and density, from learned embeddings. Our embedding outperformed the baseline in predicting density and space group and performed comparably in total energy and magnetisation (figure S1 in SI). This result confirms that our embeddings indeed capture lattice information in crystal structures more effectively than the single-modal baseline using only crystal structures. Performing comparably in total energy prediction is also notable, because the embeddings of the baseline are trained to specifically predict total energy itself using rich supervision from density functional theory calculations while our embeddings are not.
A major interest in the proposed method given its good predictive power is whether it has potential utility for new material discovery. To investigate such possibilities, we conducted a simple test to see if our model can re-discover superconductors known in the literature but not included in the training dataset. To this end, we borrowed the COD's superconductors from the concept classification ( figure 9) and, after removing overlaps with the MP's training dataset, we mapped their embeddings in the MP's embedding distribution presented in figure 3. As shown in appendix E, these COD's superconductors are most intensively concentrated around the superconductor cluster in the MP's training materials, despite the fact that these COD's materials are novel to the model. This result suggests a screen method of new candidate materials by using our model trained on a database of known materials.
Another notable strength of our method over existing material embedding methods is that it does not require costly annotations and can be trained using only primitive structural information (i.e. crystal structures and their XRD patterns). This makes our method applicable to a wide range of datasets. Even when annotations are available, our self-supervised approach will benefit many users as a means of pre-training. Pre-training is a general ML technique performed on a large-scale dataset to help an ML model for other tasks where annotated training data are limited [45]. Our self-supervised learning is suitable for this purpose, because it can be performed given only crystal structure data and can thus utilise various material databases at scale.
When compared to classic material descriptors such as the Coulomb matrix variants [22], our method has advantages in terms of its scalability and ability to capture high-level material properties. See tables 1, 2 and the SI (appendix A3) for analysis results and more discussions.
Since the focus of our study was on learning material similarity from unannotated structural data, the resulting map requires manually interpreting clusters on the basis of our knowledge of materials concepts.
Interestingly, a word2vec model [26] has been applied to text symbols appearing in the materials science literature, thus learning relationships such as the connections between 'Fe' and 'metal' and between 'Sm-Co' and 'magnet' . Use of this technique may further automate the interpretation of our results with literal knowledge.

Conclusions and broader impacts
In summary, we have demonstrated the self-supervised learning of material embeddings solely from crystal structures using DNNs. Careful inspection of the embedding space, in terms of both the global distribution and local neighbourhoods, has confirmed that the space recognises functionality-level material similarity or materials concepts. Our techniques for the materials space visualisation and the similarity evaluation between crystal structures will be useful for discovering new underlying relationships among materials and screening for new promising material candidates. Since these techniques are not strongly affected by human bias, they could give rise to a new view of materials that can stimulate efforts to break through our knowledge barriers.
Our result is also applicable to material retrieval systems that can search for conceptually similar materials in a database given a query material. This approach will enable us to rediscover materials that have never been recognised to have preferable properties.
Furthermore, constructing a functionality-aware representation space of crystal structures is a first step towards the inverse design of materials [8,46], a grand challenge of materials informatics. This workflow would allow us to design materials in the functionality space and inversely map the functionality attributes to synthesisable crystal structures with the desired properties. We hope that this study will pave the way for breakthroughs in the ML-assisted discovery and design of materials.

Data acquisition and pre-processing
We used the Materials Project as the data source for this study. We collected data for up to quintet systems, excluding monatomic crystals, on 8 July 2020, using the Material Project APIs, which resulted in a total of 122 543 materials (93% of the source collection) as our targets. We additionally queried thermodynamic stability material attributes on 14 October 2020. We used VESTA [47] for crystal structure visualisation. We calculated the XRD patterns using pymatgen [48]. The x-ray wavelength was set to 1.54 184 Å (Cu K α1 ), and the 2θ angle ranged from 10 • to 110 • with a step size of 0.02 • ; thus, 5000-dimensional vectors of 1D-structured XRD patterns were produced. To ease the learning process, the intensity scale of each XRD pattern was normalised by setting the maximum intensity to 1.

Neural network architecture
As illustrated in figure 2(a), we used two types of DNNs as embedding encoders. For the crystal-structure encoding, we need to convert a set of arbitrary number of atoms (i.e. the atoms in the unit cell) into a fixed-size embedding vector in a fashion invariant to the permutation of atom indices. For this purpose, we used CGCNNs [12]. As input to CGCNN, the 3D point cloud of the atoms in the unit cell is transformed into a graph of atoms whose edge connections are defined by their neighbours within a radius of 8 Å. The atoms in the graph are represented as atom feature vectors and are transformed into a single fixed-size feature vector via three graph convolution layers and a global pooling layer. For the XRD patterns, we used a standard feed-forward 1D convolutional neural network designed following existing studies on XRD pattern encoding [49]. At the end of each network, we used three fully connected layers to output 1024-dimensional embedding vectors. Since one of these encoders is supervised by the output of the other in our self-supervised learning approach, training them simultaneously tends to be unstable compared to standard supervised learning. To stabilise the training process, we found that batch normalisation [50] is essential after every convolutional/linear layer in both networks except for the final linear output layers. We discuss this further in the SI (appendix B). Further details of our network architecture are provided in the SI (tables S6 and S7 in appendix D) and our ML model codes.

Training procedures
In each training iteration, we processed a batch of N input material samples. Let x i and y i be a pair of embedding vectors produced for the ith crystal structure in a batch and its XRD pattern, respectively. For each positive pair (x i , y i ), we randomly drew two kinds of negative samples x ′ i and y ′ i , representing a crystal structure and an XRD pattern, respectively, from the batch to form two triplet losses: where the negative sample x ′ i was chosen from {x k } k̸ =i to produce a positive-valued loss, L (i) nx > 0, and y ′ i was chosen similarly from {y k } k̸ =i (see also figure 2(b) for illustrations). Here, m > 0 is a hyperparameter called the margin. Equation (1) essentially requires that for each embedding y i , its negative samples x ′ i are cleared out of the area surrounding y i having the radius of the positive-pair distance ∥x i − y i ∥ (red circle in the top-right part of figure 2(b) plus the margin m (yellow area in the figure). Equation (2) is defined similarly. These losses are thus to ensure, given an embedding as a query, that its paired embedding is retrievable as the query's nearest neighbour. Note that the choice of the margin m is quite flexible because its value is relevant only to the scales of the embeddings, which are unnormalised and arbitrarily learnable. Here, m = 1. Our bidirectional triplet loss was then computed as the average of the losses for all samples in the batch as follows: This expression is similar to but simpler in form than a loss expression previously used in cross-modal retrieval [51]. We optimised the loss function using stochastic gradient descent with a batch size N equal to 512. Using the Adam optimiser [52] with a constant learning rate of 10 −3 , we conducted iterative training for a total of 1000 epochs for all target materials in the dataset. The training took approximately one day using a single NVIDIA V100 GPU. For details regarding our strategies for validating the trained models and tuning the hyperparameters (e.g. choices of the embedding dimensionality and training batch-size), see appendix B and table S5 in the SI.

Data acquisition for the concept classification tasks
For the materials concept classification, we collected the crystal structure data of superconductors and thermoelectric materials from COD. To collect positive samples for each category, we retrieved material entries containing certain keywords in their paper titles as positive samples. Specifically, the entries including 'superconductor' or 'superconductivity' in their titles were regarded as superconductors, and the entries including 'thermoelectric' or 'thermoelectricity' were regarded as thermoelectric materials. The same number of material entries without these keywords were randomly collected and used as negative samples.

Data availability statement
The materials data retrieved from the Materials Project, the trained embeddings of these materials, and the trained ML model weights are available at the figshare repository [53]. The list of the target materials used in this study, the lists of the neighbourhood search results, and interactive web pages for exploring the materials map visualisation and analysing local neighbourhoods are available in the GitHub repository (https://github. com/quantumbeam/materials-concept-learning).