The representative structure of graphene oxide nanoflakes from machine learning

In this paper we revisit the structure of graphene oxide, and determine the pure and truly representative structures for graphene nanoflakes using machine learning. Using 20 396 random configurations relaxed at the electronic structure level, we observe the presence of hydroxyl, ether, double bonds, aliphatic (cyclohexane) disruption, defects and significant out-of-plane distortions that go beyond the Lerf–Klinowski model. Based on an diverse list of 224 chemical, structural and topological features we identify 25 archetypal ‘pure’ graphene oxide structures which capture all of the complexity and diversity of the entire data set; and three prototypes that are the truly representative averages in 224-dimensional space. Together these 28 structures, which are shown to be largely robust against changes in thermochemical conditions modeled using ab initio thermodynamics, can be downloaded and used collectively as a small data set for with a fraction of the computational cost in future work, or independently as an exemplar of graphene oxide with the required oxidation.


Introduction
What is the actual structure of graphene oxide nanoflakes? There have been countless studies on the properties of graphene oxide (GO) [1][2][3][4] for electrochemical [5], bio-medicine [6], sensing [7], catalysis [8], energy [9] and environmental applications [10]. It has been well established that the performance of GO is strongly coupled to degree and type of oxidation, and virtually all of physical properties can be tuned by controlling the number and distribution of functional groups on the (upper and lower) surfaces, with far fewer studies focussing on what the representative structure look like, and none delivering atomic coordinates that can be used by others for ongoing work.
It has been known for nearly a century that GO is a hydrophillic two dimensional oxidized form of graphene, with O functional groups decorating (and disrupting) the sp 2 basal plane, ranging in size from a few nm to mm. The first model of the structure of GO was proposed by Hofmann and Holst, [11] such that oxygen was bound to a hexagonal carbon sheet by epoxy (1, 2-ether) with a formula of C 2 O. This model was revised by a series of authors over the last 80 years, taking into account the wrinkling of the sheet into cyclohexane chairs with the presence of axially bound OH groups and 1, 3-ether [12,13]; higher supersaturation of hydrogen that eliminate the ether groups and an enforced rigid linear structure [14]; the impact of fluorination [15]; and the development of oxygenation [16]. In 1998 the Lerf-Klinowski model was proposed [17,18] which has since been depicted in most GO publications (and all of the reviews cited above) since this time. While instructive this model has a number of limitations, reflecting only the chemical connectivity and not the spatial distribution of the functional groups, and assuming aromatic 'islands' intermixed with oxidized aliphatic regions containing C-OH, epoxide groups and double bonds. No defects or rings with more or less than 6-members were included; nor were out-of-plane distortions disrupting the nearly flat structure beyond slight perturbations due to tetrahedrally bonded OH groups, or intrinsic ripples [19,20]. More recently in 2006 Szabó et al [21] revised the Lerf-Klinowski model based on new evidence and reintroduced wrinkling via ribbons of cyclohexane chairs, and the possibility of larger out-of-plane distortions. The Lerf-Klinowski model still persists however, even though it is largely inconsistent with the structures obtained by computational studies at all levels of sophistication, [22][23][24][25] and observations using electron microscopy [2,21,26,27].
In this paper we revisit the structure of GO and determine the pure and truly representative structures for graphene nanoflakes using machine learning. We make no prior assumptions about the structure of GO, other than the range of concentrations of O and OH chemisorbing to the surfaces observed in experiments, and sample 20 396 random configurations at the electronic structure level. In all structures we observe the presence of ether, double bonds, aliphatic (cyclohexane) disruption, defects and significant out-of-plane distortions that depend on the size of the nanoflake. We identify 25 archetypal 'pure' GO structures which can be used to define the configuration space, capturing all of the complexity and diversity of the entire data set; and three prototypes that represent the averages of the three classes in the data set. Together these 28 structures, which are shown to be largely robust against changes in thermochemical conditions modeled using ab initio thermodynamics, can be downloaded and used together as a representative 'little' data set for future research, or individually when the profiles are matched to observations from conventional experiments.

Data set
To gather this data set a large number of virtual GO nanoflakes were systematically created to sample a wide range of sizes, topologies, oxygen concentrations, chemical groups and distributions. The data set contains 20 396 samples with areas ranging from 320 to 2457 Å 2 including four different main morphologies: hexagonal (49.5%), trigonal (14.3%), rectangular (30.5%), and rhombic (5.7%). The total number of atoms in each sample varies from 191 to 1949 including C, H, and O, and different ratios of armchair (AC) and zig zag (ZZ) edges were included. The density and distribution of oxygen groups have a significant role in deriving GO properties, so for each of the 24 primary pristine graphene nanoflakes numerous O/H concentrations were sampled, each with hundreds of random distributions (see figure S41 in the supporting information, which is available online at stacks.iop.org/NANOF/3/045001/mmedia). In each case the O/C ratio was between 4.05% and 52.08%, and the H/C ratio between 2.22% and 49.26%. In the majority of cases the edges were H-terminated (91.3%) with the remainder unpassivated to increase the structural diversity of the set. Figure 1 shows the distribution of ether and hydroxyl concentrations resulting from our sampling process. Note that the concentration of ether and hydroxyl is measured with respect to the total available oxygen, while the concentration of oxygen is measured with respect to the total number of atoms. A range of simple statistics are given in table 1 to provide an overview of the breadth of samples in our data set. The relaxed structure of each sample was determined at the electronic structure level using density functional tight binding (DFTB) method (as described in the supporting information). More than 30 million CPU hours were consumed in order to perform these calculations with this level of computational sophistication.

Feature extraction
Extracting a set of features that can well characterize the samples and explain the variance in our data set is a key factor to better understanding of the population of our ensemble. A comprehensive structural analysis was performed to extract 830 structural features. These features are grouped into four main categories: thermodynamic (1 feature), structural (19 features), chemical (20 features), and statistical (790 features). The thermodynamic feature is the probability of observation, which was obtained using a Boltzmann distribution via the enthalpy of formation from the DFTB calculations and the thermodynamic treatment described below.
The 19 structural features characterize different geometrical aspects of the flake such as its average diameter, minimum and maximum dimensions, total area of flake, the anisotropy in the flake, and the ZZ and ACedge ratios. Attributes characterizing defects and out-of-plane distortions are included in this group. The 20 chemical features include the number and distribution of atomic and chemical species on each nanoflake (counts and concentration of each species, where concentration is counts normalized by total number of atoms) to provide useful overall information about each structure. Of the 790 statistical features, 16 characterize the atomic coordination of C, H and O; 268 characterize the ring statistics; 91 describes bond lengths between different species; 132 describe bond angles; 268 describe oxygen density distributions; 2 define particle and mass density; and 13 describe the distance distributions between different oxygen groups. The 281 features characterizing the oxygen distribution capture the differences between nanoflakes with similar graphene (primary) structures and oxygen concentrations, but different patterning of oxygen groups on the surface. For this purpose the following Gaussian probability (density) distribution function (PDF) is used: where r ref is the reference location in space to evaluate the species-dependent probability distribution, r i is the distance of ith species of type s from the reference location, and α is a smoothing parameter. Here, α is set such that the Gaussian function would fade at a distance ∼4 Å (which is higher than van der Waals distance between the species on GO). This PDF can be used to define a local density at any point in space and for any type of species. However, these local densities are not normalized, which presents problems for nanoflakes of finite sizes. Lattice sites around the edges and corners of a nanoflake will have a different local environment than those in the interior; fewer carbons providing fewer opportunities for functionalization. To resolve this the local density of carbon atoms can be used to normalize each density distribution using: where ρ C is the local density of carbon atoms at a reference location (x, y, z). The density functions in equations (1) and (2) can provide local densities of carbon and oxygen atoms as well as ether and hydroxyl groups. The density functions are sampled at the location of each carbon atom, then different statistics such as mean value, standard deviation, skewness, and kurtosis are calculated in order to characterize the distribution. While, these statistics provide useful information on oxygen heterogeneity, they do not provide insights into inter-oxygen locality. For instance, a nanoflake with oxygen accumulated at the center could have similar density statistics as a nanoflake with oxygen accumulated at another location, such as near an edge or corner. To distinguish such cases and include localities the moment of inertia of the density functions are also calculated around the center of the GO nanoflakes using:  calculated. In each case (x i , y i , z i ) is measured from the geometrical centroid of all carbon atoms as it closely represents the geometrical center of each nanoflake. The summation is over sampling points which is set to the location of carbon atoms, noting that the PDF is a continuous function and it can be sampled at any location. Sampling at the location of carbon atoms provides a good resolution of density and is intuitive. Additional features can also be determined by normalizing the inertia moments as follow: Similarly, the above normalization can be applied for all other inertia components.
To characterize the distribution of species that are on or near an edge a shape (weight) function f E is introduced that becomes 1 at the edges and approaches zero at the center. By multiplying the PDF by this shape function, we can obtain the density distribution of species in proximity to edges. This shape function is given by: where, the summation is all over the carbon atoms which are located on edges and r ref denotes the location where the density is calculated. In order to have a smooth filtering effect, a Gaussian distribution is used in equation (6) but any other function with a similar weighting property could be used. Using the shape function in equation (6), a set of additional features are extracted, taking the list to 830. The complete set of features with their descriptions are provided in table S1 in the supporting information. Many of these features are highly correlated and some have significant imbalances, which must be eliminated to avoid biasing future analysis. To reduce the number of dimensions features with more than a 99% imbalance (i.e. more than 99% of samples have the same feature value) were removed. The correlation between the remaining features were analyzed using a correlations matrix with a threshold of 95%, and only one of the highly correlated features was retained. Following this feature selection process the number of features was reduced to 224, which are marked with an asterisk in table S1.

Archetypal analysis (AA)
AA is a matrix factorization method where an n×m matrix X represents a multivariate data set with n observations and m variables; in our case n corresponds to the GO nanoflakes in the data set and m corresponds to the structural features [28]. AA aims to find an p×m matrix Z that corresponds to the archetypal or 'pure patterns' in the data in such a way that each data point can be represented as a mixture of those archetypes. In other words, the AA yields the two n×p coefficient matrices α and β which minimize the residual sum of squares: with α ik 0 and i=1, K, n.
To solve (7) a general-purpose constrained nonlinear least squares algorithm can be used, but this could be timeconsuming for large data sets. Therefore we have followed the alternating optimization procedure suggested by Cutler and Breiman, in which the algorithm alternates between finding the optimal αʼs for a given set of x mixtures and finding the optimal x mixtures for a given set of αʼs. The latter problem effectively optimizes β for a given value of α [28]. In this iterative scheme each step requires the solution of several convex least squares problems of the following form: Find w 1 , ..., w q to minimize   å = w u t k q k k 1 2 for given u and t 1 , K, t q , subject to w k 0 and å = = w 1 k q k 1 . An in-house python package is developed to implement this algorithm and extract the archetypes for this study which as been made available online for download [29].

Clustering
Clustering methods are unsupervised pattern recognition techniques that group samples based on a similarity index. k-Means clustering which uses distance between the samples as a similarity index, is the most widely used technique due to its computational efficiently [30]. k-Means is a vector quantization method that partitions n observations (samples) into k clusters. Each observation belongs to the cluster with the nearest mean where the mean of each cluster (centroid) is the prototype. The observations in each cluster can then be represented by its prototype structure. The number of clusters (k) must be defined a priori, so that k-Means can group the n observations (x 1 , x 2 , K, x n ) into k(n) sets S=S 1 , S 2 , K, S k such that the within-cluster sum of squares is minimized: where a Euclidean distance is used as a measure of similarity and μ i is the mean of points in S i . The distance between the points is calculated in a d-dimensional space, which d is the number of variables (features).
A presumption in k-Means clustering is that the samples form clusters with approximately spherical shapes with similar size, which often yields poor results for dispersed and nonlinear data. Another drawback is in specifying the number of clusters k; an inappropriate choice for k can also result in poor performance. While k-Means is one of the easiest clustering algorithm to apply, it can give misleading results as, given in input k it will cluster the data, even if there is no clusters present (null case).
Density-based clustering techniques overcome this limitation and are capable of identifying clusters of various sizes, shape and dispersivities. One of the most robust density-based clustering techniques is densitybased spatial clustering of applications with noise (DBSCAN) which, unlike k-Means, does not require the number of clusters to be specified a priori. DBSCAN does require two hyper-parameters, ò and the minimum number of points to define dense region, which can significantly affect the quality of clustering [31,32]. Another important issue of DBSCAN is in dealing with data sets with large variations in densities; setting parameters to high density can result in labeling clusters of low densities as noise, while a low density estimate may result in incorrect grouping of clusters.
Recently, a new clustering method has been introduced that has the advantage of including hyper-parameter optimization [33]. Iterative label spreading (ILS) is based on a general definition of a cluster and the quality of a clustering result, and is capable of predicting the number and type of clusters and outliers in advance of clustering, regardless of the complexity of the distribution of the data. ILS can be used to evaluate the results from other clustering algorithms, or perform clustering directly. It has been shown to be more reliable than alternative approaches for simple and challenging case (such as the null and chain cases) and to be ideal for studying noisy data with high dimensionality and high variance, as we have for our GO nanoflakes.
We have tested all three clustering techniques in this study.

Thermodynamics
The thermodynamic feature mentioned above (the probability of observation) can be determined using a Boltzmann distribution, by combining the results calculated from electronic structure simulations at the ground state, and the extensive thermochemical data measured at the standard state. Under the assumption that the change in electronic energies represents the most significant contribution to the change in free energy (i.e. other contributions cancel out), the free energy change in a reaction at the standard state can be calculated as , and is usually omitted. In this method, the gap between the ground state and the standard state in the chemical potential is filled by the connection energy where T 0 is 0 K, and μ 0 is the chemical potential under the standard pressure (P 0 =1 atm). The chemical potential μ 0 (T) can be calculated using the heat capacities, C p , and entropy, S, such that: in which A to G are Shomate coefficients. Equation (12) can further be extended to include finite pressures, P, using: where P 0 is the vacuum, and R is the gas constant. By calculating the chemical potential for each constituent element, the temperature-and pressure-dependent Gibb's free energy can be obtained as: We can also see that although we have a large number of GO nanoflakes in the set, this set is by no means exhaustive, as there are regions of the archetype-space that are not populated. In addition to significantly reducing the dimensionality by transforming our feature-space into archetypespace, AA is capable of identifying archetypal structures that were not present in the original set, which can be characterized (in the 224 dimensional feature-space) to gain some insight into unseen data. By employing an empirical cumulative distribution function (ECDF), the histogram profiles of the archetypes are extracted and included for all 25 archetypes are provided in the supporting information. Due to the large number of dimensions in the feature-space each profile is split into four histograms. The histograms reveal how influential each feature is in decerning the identity of the archetype; each archetype is high in some features and low in others. The archetypes are also ordered by ranking them based on the   a å -= X Z i n i i k k 1 2 value, which is the summation of errors in approximating data points with an individual archetype.
As mentioned above, archetypes do not need to be in the original set to exist and to be identified, but in a sufficiently large set it is possible that they will be present, or that some very close matches have been captured. In our case we calculated the Euclidean distance between each archetype and the original data points in the featurespace, and used the ECDF to extract the profile of these best matching nanoflake. The profiles of the best matching nanoflake is overlayed with each archetype in the supporting information to quantify the quality of the match, which is captured more succinctly in the radar plot in figure 2(c). This plot indicates how close each matched GO nanoflake is to the corresponding archetype; highlighting that a large population of similar structure does not always guarantee finding a pure type a priori. A20 is present in our data set, but this region of the archetype-space is sparsely populated. The region around A8 is heavily populated, but the best match to A8 is not a perfect match. It would take hundreds of thousands (possibly millions) of samples to spontaneously capture the pure types without AA to identify what their profiles look like, but once identified these small collections of matched structures have been shown to be invaluable in summarizing the properties of larger data sets [34] or providing a guide as to which structures are worthy of further investigation [35,36]. These best matching structures for each archetype are illustrated in figure 3, and are available for download [37].
To provide some straightforward interpretation of the diversity of the archetypal GO nanoflakes table 2 lists a set of selected features taken from the detailed profiles. Here we can see that the archetypes span a wide range of sizes, with vastly different concentrations of hydroxyl, ether and ZZ edges (inputs into the DFTB+ simulations), and point defects or planar distortion (outputs from the DFTB+ simulations). The defect concentration appears universally low, which reflects the fact that few defects (broken C-C bonds) formed during the DFTB+ relaxation, but even there the entire range is represented. In general there is no underlying pattern (such as smaller nanoflakes having higher ether concentration, etc) since these are by definition the extreme cases around the convex hull. The only exception to this observation is that larger nanoflakes exhibit larger distortions, which is simply due to larger nanoflakes being able to accommodate distortions with a magnitude or period that is larger than the smaller nanoflakes are in size.
Beyond providing specific information on the nature of the pure types of GO nanoflakes, AA is also a powerful tool for mapping the high-dimensional feature-space to 2-dimensions, and like other visually interpretable dimension reduction methods high-dimensional patterns in the data set can be seen by color coding each point with additional information [38]. Figure 4 presents the same simplex plot from figure 2(b) encoded by the size, shape and composition of each sample. In each case the shape of the symbol reflects the shape of the nanoflake (hexagonal, rectangular, rhombic or trigonal), and the relative size of each symbol reflects the relative size of the nanoflakes. Figures 4(a) and (b) are colored by the ether and hydroxyl concentrations, respectively. Figures 4(c) and (d) are colored by defects concentration and distortion index. While these plots are highly complex (and a number of sample are mapped into similar regions in the convex hull, obscuring the detail) it is easy to see distinct groupings where structures with high similarity in 224 dimensions are adjacent in 25-dimensional archetype space, and to understand why the archetypes are important structures. For the purposes of clarity these plots are separated by shape and reproduced in the supporting information. Color encoding of this type is not limited to the features used to extract the archetypes; any of the 830 original features, or any future property label, can also be visualized in this way.

Prototypes
While archetypes are an excellent way of defining the structure space and describing all possible structures in it, they cannot average over the sample as a whole. Prototypes are the representative data points, defined by the centroids of clusters with similar characteristics in 224 dimensions, and represent the average nanoflakes in the data set. To identify the prototypes we tested three different clustering techniques (k-Means, DBSCAN and ILS), with ILS providing superior performance. Both k-Means and DBSCAN suffer from some limitations that are challenged by our highly nonlinear and imbalanced data set, and require subjective hyper-parameters to be estimated a priori (i.e. the number of clusters, epsilon and minimum number of samples, respectively). The elbow criteria and a grid search technique was used to optimize these parameters, but the simplicity and advantages of ILS verifiably identified our three clusters (which k-Means did not) without mistaking the noise (as DBSCAN did). ILS clustering does not require a priori hyper-parameter estimation or optimization, but must be initialized so the results were repeated with various starting points and always returned same three clusters. The results for the ILS clustering, and the identity of the centroid prototypes are described below (along with the profile histograms in supporting information), and the corresponding results for k-Means and DBSCAN are provided in the supporting information for completeness.
To provide insight into how the samples are clustered a simplex plot of data set is shown in figure 5 where each sample is colored by the ILS cluster, in the low dimensional archetype-space. The prototypes are highlighted by the crosses, and are separated individually in radar plots to show the contribution of the 25 archetypes to the description of the three prototypes identified using ILS ( figures 5(b)-(d)). The largest cluster (shown in red) contains 77.8% of the samples, while the other two include 15.8% (green) and 6.4% (purple) of the nanoflakes, respectively. This is similar to the results for DBSCAN (shown in the supporting information), which also recognized a large cluster containing 76.5% of the data (with two more clusters with 5.1% and 1.4% , and four tiny clusters containing only a few points) but interpretted 15.0% of data as noise. k-Means clustering was performed with five clusters (see supporting information) resulting in clusters with population of 36.5%, 40.4%, 7.2%, 9.3%, and 6.6%, which reflects its pre-assumption of a spherical distribution of 5 enforced clusters. As shown in the supporting information, k-Means splits the large cluster into two smaller clusters for no Figure 3. The closest matching archetypal graphene nanoflakes. Each structure is available for download [37]. apparent reason. This cluster-encoded simplex plot has been separated by shape and re-plotted in figure S37 in the supporting information for clarity.
To understand the characteristics of the ILS prototypes the histogram profiles are extracted and compared with their closest matching nanoflake, as shown in figures S38, S39 and S40, and a number of selected features are listed in table 3 (for both the prototypes and the closest matching nanoflakes). In contrast to the archetypes, all of the prototypes differ mainly in their total area, and the degree of out-of-plane distortion (which is related to the area, as mentioned above), which can be confirmed by visual inspection of the marker sizes in figure 5(a). This trend is remarkably consistent amongst the clustering techniques (see supporting information), even when clusters have been mistaken (DBSCAN and k-Means). As confirmation that the prototypes are averages of the nanoflakes in each cluster, we generated radar plots of the best matching prototypical nanoflakes in the 25dimensional archetype-space (figures 5(b)-(d)), where we can see that they are centrally located near the origin and require numerous archetypes to define them.
The atomic configurations of the three ILS prototypes are shown in figure 6 and provided along with the archetypes for download [37].

Discussion
One of the advantages of using GO in various applications is the ability to tune the properties indirectly via structural interventions; for example, GO can be a semiconductor or an insulator depending on the degree of oxidation. However, the indirect nature of the interventions make tuning GO challenging, since the structure is difficult to characterize due to its nonstoichiometric composition that depends on the thermochemical conditions, its strongly hygroscopic nature, and its slow decomposition above 60°C-80°C. A lack of control can turn these advantages into disadvantages. Incorrect assumptions about the concentration of oxygen or oxygencontaining groups, the distribution of oxygen on the surface, the location and nature of defects, and the degree of out-of-plane distortion can lead to inappropriate data sets being generated and used, which ultimately lead to misleading results. Using the archetypes and prototypes ensures that the most inclusive and representative structures are assumed, and dramatically increases efficiency. A computational study using the archetypes exclusively will capture all of the complexity and diversity inherent in our 20 396 nanoflake data set, effectively reducing the computational cost by three orders of magnitude (99.87%). In contrast, a computational study using the prototypes exclusively will provide an average of all of the 20 396 nanoflake data set, effectively reducing the computational cost by four orders of magnitude (99.98%).
One question that arises however, is how these results relate to the low energy nanoflakes, which are often incorrectly assumed to be 'representative' of polydispersed samples? The low energy structure, or stereotypes, are generally special cases and exist as outliers, but it is interesting to see where they reside in the 25-dimensional archetype space with respect to the prototypes. In figure 7(a) we provide a simplex plot showing the location of the 77 stereotypes in archetype-space, colored by the total electronic energy per atom. We can see from this plot that the low energy stereotypes are not distributed across the space, and are therefore not representative of the data set; instead being group in the vicinity of A5. We can also see that they are all hexagonal, lacking the diversity of the archetypes; and relatively small, lacking the diversity of the prototypes. In figure 7(b) we can see the linear contribution of the archetypes needed to describe the lowest energy nanoflake, and note the significant imbalance. Using low energy stereotypes to represent the data set would clearly be misleading.
Another question is how robust the archetypes and prototypes are with respect to changes in the data set? Firstly, AA assumes that the data set remains unchanged, and removal of samples will impact the outcome. The archetypes for one set will be different from the archetypes of another set, which is why we have used over 20 000 samples covering such a large range of configurations here. Secondly, AA uses a random seed and so it is best to use as recursive algorithm to test random seeds until consistent set of archetypes is obtained [36]. Previous work has shown that the profiles of archetypes is sensitive to the type of distribution in the data set [39], but remarkably insensitive to the size range [36]. This leaves the remaining issue of how robust the archetypes are with respect to experimentally relevant conditions, such as changes in temperature, pressure or the surrounding environment.
To extend the ground state energy to finite temperatures we have used a first principles thermodynamics method [40], with energies obtained using the original electronic structure simulations (temperature T≈0 K, pressure P=0 Pa), and calculated thermodynamic probabilities of nanoflakes at different temperatures and pressure. An advantage of this method is the ability to estimate the probability under different environmental conditions, such as aerobic or anaerobic, humid or dry, by adjusting the oxygen supersaturation or the thermodynamic reference state. It is also possible to combine results (probabilities) weighted by competing reference states to predict the impact of (for example) 40% humidity. The thermodynamic probability can also be expanded to three (or more) extra features to study how the identity of the archetypes or prototypes are influenced by different processing or storage conditions; how pure, or representative, they really are.
To investigate the effect of environmental conditions such as temperature, pressure, and degree of ambient humidity, the thermodynamic probabilities were calculated using a Boltzmann distribution and the Gibb's free energy (as described above, using the QuickThermo software [41]). These three parameters were varied in intervals (δ) in a range as follows: T={300 K, K, 750 K; δT=50 K}, P={100 kPa, K, 200 kPa; δP=50 kPa}, and humidity, H={0%, K, 100%; δH=20%}. For each combination of (T, P, H) the  thermodynamic probability was calculated for all samples in our data set and AA repeated. To track the (T, P, H)dependent changes (T, P, H)=(300 K, 100 kPa, 0%) was selected as the baseline, and each subsequent archetype mapped to the convex hull of the baseline set. Figure 8(a) presents the simplex plot of the changes in the archetypes with respect to variations in T, P and H, where each point is encoded by the temperature or humidity level as defined in figure 8(b). As we can see, the identity of the majority of the archetypes remain unaffected by changes in the environmental conditions, remaining on the vertices of the convex hull, regardless of T or H. A1, A2, and A7 however, experience a sudden and considerable displacement at higher temperatures, with A8 and A18 also exhibiting a temperature-dependent displacement. These archetypes describe regions of data set which are sparsely populated and experience high variance. In all cases changing the pressure in the specified range did not affect the archetypes. The same procedure was used to track changes in the prototypes and stereotypes, using only the extreme conditions, and it was found that they are unaffected by the changes in the environmental conditions. This was expected since the average (over all results) and the lowest energy scale with temperature, pressure or humidity.

Conclusions
The structure of graphene oxide is critical to its performance in a variety of applications, and so the configuration of model structures is also critical to predictions made in advance of synthesis and characterization. Using a large and diverse data set of over 20 000 GO nanoflakes fully relaxed at the electronic structure level, characterized by an exhaustive list of 830 chemical, structural and topological features we have isolated the pure and representative form of GO using machine learning. We find that the 830 dimensional space can be reduced to 224 dimensions, and used in conjunction with multivariate AA to identify 25 archetypal structures, and clustering analysis to identify 3 prototypes that can represent the entire ensemble set. The archetypes and Figure 6. The closest matching prototypical graphene oxide nanoflakes. Each structure is available for download [37]. prototypes bear similarity to the Lerf-Klinowski model, but include much more structural complexity (such as defects and distortions) and capture the importance of distributions of oxygen groups, in addition to the type and concentrations. They cover the entire size range of the ensemble, with different ether, hydroxyl and defect concentrations, and different oxygen distributions. By applying ab initio thermodynamics we show that these representative GO nanoflakes are largely robust against changes in the temperature, pressure and supersaturation of water in the surrounding environment. These 28 GO nanoflakes can be used collectively as a small data set with a fraction of the computational cost in future work, or independently as an exemplar of graphene oxide with the required oxidation to match experiments.