Digital Fingerprinting of Microstructures

Finding efficient means of fingerprinting microstructural information is a critical step towards harnessing data-centric machine learning approaches. A statistical framework is systematically developed for compressed characterisation of a population of images, which includes some classical computer vision methods as special cases. The focus is on materials microstructure. The ultimate purpose is to rapidly fingerprint sample images in the context of various high-throughput design/make/test scenarios. This includes, but is not limited to, quantification of the disparity between microstructures for quality control, classifying microstructures, predicting materials properties from image data and identifying potential processing routes to engineer new materials with specific properties. Here, we consider microstructure classification and utilise the resulting features over a range of related machine learning tasks, namely supervised, semi-supervised, and unsupervised learning. The approach is applied to two distinct datasets to illustrate various aspects and some recommendations are made based on the findings. In particular, methods that leverage transfer learning with convolutional neural networks (CNNs), pretrained on the ImageNet dataset, are generally shown to outperform other methods. Additionally, dimensionality reduction of these CNN-based fingerprints is shown to have negligible impact on classification accuracy for the supervised learning approaches considered. In situations where there is a large dataset with only a handful of images labelled, graph-based label propagation to unlabelled data is shown to be favourable over discarding unlabelled data and performing supervised learning. In particular, label propagation by Poisson learning is shown to be highly effective at low label rates.


Introduction
Materials informatics, namely the use of data science and informatics approaches to improve processing and performance of materials, has advanced significantly in recent years, see e.g.[2,3,4].More recently, microstructural informatics, the process of collecting, organising, sharing and simulating microstructural data and relating microstructure to processing and properties has garnered interest, particularly since the the uptake in machine learning (ML).
A material's microstructure can be viewed as a platform to build bridges between processing and properties (processing determines the microstructure which, in turn, determines mechanical properties).Suitable representation of microstructural image data should allow for these bridges to be constructed.Such a representation is here dubbed a "microstructural fingerprint" and there are already several options for construction within the literature, although these representations are often viewed as a means of improving the accuracy of a specific ML task, rather than simply seeking a truly representative compressed form of microstructural image data (which could have universal applications).The question is then whether such representations can describe the "essence" of the microstructure itself, or just a particular image.Some potential options for such representations already exist and have been shown to enable ML predictions of various mechanical properties.A traditional approach would be to make measurements of metrics such as grain size/distribution, lamellar spacing etc. and concatenate these metrics into a vector to form a fingerprint [5].This is shown to be useful and can predict elastic properties such as tensile strength [6].However, this approach is rather limited by the number of metrics at our disposal and, also, by the fact that different metrics apply in each case, depending on the microstructure.Two-point correlation functions, reduced via principal component analysis (PCA), have been shown to provide a useful representation of microstructure and can also be applied to predict elastic properties [7,8].Prediction of elastic properties in the form of full stress-strain curves has also been shown to be possible, with the use of synthetic pole figures constructed from electron backscatter diffraction (EBSD) data as input [9].Generative algorithms, such as generative adversarial networks (GANs) have been shown to enable construction of synthetic datasets [10,11], which is very useful in its own right, but the compressed representations defined by the latent space which initialise the GAN may also be considered as fingerprints and are yet to be fully exploited.
Finding efficient means of fingerprinting microstructural information is critical towards capitalising on data-centric learning approaches [12].Within the last decade, there has been a drive towards population-based dimension reduction and feature extraction [13,14] and subsequent supervised learning (SL) tasks [15,16], which operate with a set of training data labelled with relevant properties.There are a growing number of works on unsupervised learning (UL), which can operate with the same features as SL, but without any labels [14,17,4,18,10].A natural addition to these are the semi-supervised learning (SSL) methods, which operate on a set of N inputs with a set of pN labels, where p ∈ (0, 1).This provides the gamut from UL (p = 0) to SL (p = 1).A few works on SSL have appeared in recent years [19,20,21,22].The present work adds to this by (i) presenting a unified framework for fingerprinting, which includes classical feature extraction methods from computer vision known as visual bag of words (VBOW) [23,24,25,26,13] and vector of locally aggregated descriptors (VLAD) [27,28,14], and more recent transfer learning with pretrained convolutional neural networks (CNNs) [29,30,31,14,17,4] and (ii) exploring the potential utility of some graph-based SSL methods [32,33] in this context.
VBOW and VLAD are based on summary statistics of what are referred to here as base features, which can be keypoint-based features, such as scale-invariant feature transform (SIFT) [34,35] and speeded-up robust features (SURF) [36], as well as more recently adopted transfer learning approaches to feature extraction based on pretrained CNNs [37,29,14].There are by now numerous choices of CNN to use for transfer learning, and one example which has shown some success in this application, developed by the Visual Geometry Group at Oxford, is known as VGG [38,14].
The graph-based classification methods considered aim to capture the intrinsic geometry of the data manifold and construct a low-dimensional representation, if one exists.As such, they can be used for UL [39,40] and, in the case that a goal and some labelled data are identified a priori, SSL [32,33].These methods are mature by now, and there is a growing body of mathematics literature relating to convergence of the graph Laplacian to the Laplace Beltrami operator on the feature space and consequences for the various related algorithms [39,40].It is shown here that, for a priori known classification tasks, these SSL methods are able to significantly improve upon the accuracy of UL methods with very few labels (as few as 5%) and even approach the accuracy of SL methods.The methods therefore represent a useful contribution in the context of microstructure characterisation (as well as other imaging domains), where the amount of data is vast, but the process of labelling can be cumbersome and time-consuming.
In this study, we present a comparison between VBOW and CNN methods for feature extraction and fingerprint construction.We begin by outlining the general approach for fingerprint construction from base features in Section 2. Section 3 then covers several candidates for base feature extraction.A framework for performing statistical analysis of base features is proposed in Section 4. This forms the final stage in fingerprint construction.
Learning methods (namely SL, SSL and UL) are described in Section 5 and methods for assessing the accuracy of such learning methods are discussed in Section 6. Section 7 provides details of the data that was used for testing and outlines the classification task that will form the basis of our fingerprint quality assessment.Results are then presented in Section 8 (supplementary code can be found at [1]) and discussed in Section 9. A concise summary of the conclusions is provided in Section 10.

General Approach to Fingerprinting
A fingerprint of an image can be considered as a compressed numerical representation of image data, i.e. a feature mapping: where x is an m 1 ×m 2 image and f (x) is the fingerprint, an n-dimensional vector, such that n m 1 m 2 .Note that m 1 and m 2 can depend on x, i.e. the number of pixels can vary from image to image.However, this is excluded to simplify the notation.A microstructural fingerprint refers to an image fingerprint computed from a micrograph.Here, we consider images acquired via optical microscopy and scanning electron microscopy (SEM).It is supposed that a suitable fingerprint may be able to capture the "essence" of the microstructure, enabling new methods for material characterisation.
There exist a variety of options for construction of the fingerprint, f (x).The general approach considered here is centred around constructing base features from the image and clustering said features, before applying summary statistics on the clustered features.Base features are constructed via a map S : R m 1 ×m 2 → R J(x)×d , which delivers a set of J(x) base features, {S 1 (x), . . ., S J(x) (x)}, for each image, where each feature is represented as a d-dimensional vector.Suppose there are N images.A population, P, of all features from all images can then be constructed, such that where P denotes the total number of features in P.
A clustering function is then learned from P, which assigns a label, t p ∈ {1, . . ., K}, to each base feature, S p ∈ P, resulting in "similar" features being assigned the same cluster label.k-means clustering [41] was utilised at this step for all results presented in this paper.k here is arbitrary and refers to the number of clusters to fit to the population of features.Cluster centres, µ k , for k = 1, . . ., K, are learned to minimise distance between cluster centres and data points.The µ k are initially randomised and subsequently updated according to where S k := {S p ∈ P : t p = k} and clusters are assigned according to This process is repeated in a loop in which each feature is assigned a cluster label according to the current nearest cluster centre.Each cluster centre is then updated via (1), which effectively results in the cluster centre moving towards the centre of mass of those features currently belonging to that cluster.Spectral clustering [39] was considered, but was too computationally expensive to validate.The Nyström extension [42] offers an approximation to spectral clustering, however, the results were very poor during testing.We will revisit spectral clustering in Section 5.3, as it is a viable option for clustering fingerprints (rather than clustering base features) for unsupervised learning classification, provided there is a relatively low number of classes in the dataset and/or fingerprints are relatively low-dimensional (this will depend on computation power available).
Finally, various summary statistics of {S j (x)} J(x) j=1 are constructed for each sample image, x.Summary statistics, of order l, are denoted H l : R m 1 ×m 2 → R n l for l = 0, 1, 2, . . ., where n l depends on the order of the fingerprint and the number of clusters.One or more summary statistics are ultimately combined to form the map f (x) ∈ R n .

Base Features
There are several options for constructing base features.Here, we consider keypoint-based features and transfer learning features from pretrained convolutional neural networks (CNNs).In either case, we can consider each feature as a local descriptor of a given patch from the image.Although, in some cases (for CNN-based fingerprints), the patch may be defined as the entire image itself.

Patch-Based Maps
Patch-based methods first identify a set of patches z i ∈ R m×m , for i = 1, . . ., J(x), from a given image, x.We then define S i (x) = s(z i ), where s : R m×m → R d , i.e. each feature can be considered as a vector describing a local neighbourhood centred at each keypoint (patch centre).
The set Ω(x) := {z 1 , . . ., z J(x) } can be constructed according to a particular span and stride, e.g. if one has square images (m 1 = m 2 ) and m 1 is divisible by m , then letting m = m 1 /m and selecting vertical and horizontal stride of m produces a partition of x into (m ) 2 patches, z ∈ R m×m .If one allows Ω(x) to be random, then it could consist of patches around randomly selected points, which can be viewed as a version of bootstrapping for data augmentation.

Keypoint Features
The set Ω(x) can also be designed using keypoint-based methodology, which has proven very valuable in the field of computer vision, particularly for object recognition.Such methods identify a set of J(x) keypoints from an image, x, together with orientation and scale information.This is used together with interpolation to define the patches.One could, in principle, ignore the scale or orientation information, or both.There are several candidates for keypoint feature extraction.Here, we consider the scale-invariant feature transform (SIFT) [35] and a lower-dimensional alternative, speeded-up robust features (SURF) [36].
The SIFT algorithm detects keypoints via Difference of Gaussian (DoG).Gaussian blur is applied to the input image several times, each to varying degrees.The blurred images are then stacked and the DoG is computed via Euclidean distance between adjacent Gaussians.Local extrema are identified and their locations serve as the keypoints.A 16×16 neighbourhood is then constructed, centred at each keypoint, and divided into 16 4×4 subregions.The scale of this neighbourhood depends on the Gaussian scale at which the keypoint was detected.An 8-bin histogram of orientation gradients (HOG) is then computed for each of the 16 subregions and concatenated to form a 128-dimensional feature vector for each keypoint.Inclusion of this orientation information ensures the features are rotation invariant.For further discussion of the finer properties of these methods, the reader is referred to [35].The top left of Figure 1 shows an example of a micrograph annotated with SIFT descriptors that are represented by coloured rings.The size of these rings correspond to the Gaussian scale at which the keypoint was detected and the dominant orientation is represented by a straight line emanating from the centre of the ring.The remainder of the figure provides a visual representation of feature population, P, construction and subsequent clustering.
SURF descriptors are similar to SIFT, in the sense that they also contain information Figure 1: Visual representation of feature population construction and clustering from SIFT keypoint features, where each feature is represented as a ring, centred at a keypoint, with size and orientation corresponding to the scale at which the feature was detected and the dominant orientation of the feature, respectively.regarding gradient distribution.However, they are much faster to compute and benefit (as far as compression is concerned) from being only 64-dimensional vectors, compared to SIFT which produces features of twice that size.To achieve this, the DoG is first approximated via convolution with box filters, where the size of the filter corresponds to the amount of Gaussian blur in the DoG.Keypoints are then identified as the location of local extrema across scales, as with SIFT.Keypoints are described locally via the Haar wavelet transform [43].A local neighbourhood is taken around each keypoint and split into 16 subregions, which form a 4×4 grid centred at the keypoint.The Haar wavelet response is then recorded for each subregion as a 4-dimensional vector, [dx, |dx|, dy, |dy|], where dx, dy is the Haar wavelet response in the horizontal and vertical direction, respectively, and |dx|, |dy| are the absolute values of the responses.Haar wavelet responses are computed for all subregions and concatenated into a single vector.A 4-dimensional vector for each of the 16 subregions results in the 64-dimensional descriptor.See [36] for more details.

CNN Features
Convolutional neural networks (CNNs) have proven very successful in computer vision tasks, most notably perhaps classification of the ImageNet dataset [44].In this study, AlexNet [45] and VGG19 [38] (pretrained on the ImageNet dataset [46]) are utilised.An extensive collection of pretrained CNN architectures are available.These two were selected somewhat arbitrarily, the main aim being to compare a relatively low-dimensional network (AlexNet) with a more complex higher-dimensional network (VGG).Using the recommended input image size (224×224) yields a 9126-dimensional output from AlexNet and a 25088-dimensional output from VGG19.
It is important to note here that the input size of a pretrained CNN is fixed by the first fully connected layer [47,37,29,14].The first fully connected layer delineates the beginning of the classification portion of the CNN.As we are using pretrained networks, this is specific to classification of the ImageNet dataset and can be discarded to leave just the feature extraction section of the CNN.Therefore, the output of any convolution layer can be computed for an input image of arbitrary size and recommended input sizes can be safely ignored.In fact, forcing the recommended input size can cause problems.There are two options for image transformationcropping and deforming.Cropping the image may result in a non-representative volume element, i.e. too much data is discarded.Stretching will deform the grain morphology and at different rates across the image set dependent on variations in scale.Again, this can mean that the preprocessed image is not representative of the microstructure itself.However, preprocessing inputs by converting to RGB and normalising relative to the ImageNet data mean and variance across each RGB channel is crucial, as the weights associated with each CNN layer are trained to extract features from image data distributed in such a way.
Inputting raw images (not transformed in any way) yields an output from the final CNN layer with its size dependent on the size of the input image.This output will be a tensor of size d 1 (x)×d 2 (x)×d, where d 1 (x), d 2 (x) depend on both the size of the input image and CNN architecture and d depends only on the CNN architecture.We can then define the feature matrix, S, by these J(x) d-dimensional feature vectors, where J(x) = d 1 (x)d 2 (x), from the final convolution layer and either max pool, resulting in a single d-dimensional output for an image, flatten in to a single d 1 (x)d 2 (x)d-dimensional feature vector, or utilise the H l framework of summary statistics described in Section 4, treating each row in the J(x)×d output as essentially a keypoint feature.

Reduced Feature Population
Depending on which method is utilised for base feature extraction, the number of features extracted for each image may be relatively large, i.e.J(x) d.The larger J(x) is, the more information we have at our disposal to describe each image.Intuition says that maximising feature information for each image should result in the highest classification accuracy, at the cost of longer run times.However, it is possible to reduce each individual image's feature set from a J(x)×d array to a d×d array (i.e.reduce a set of J(x) features to a set of d features), before concatenating to form the population of features, P ∈ R dN ×d , where N denotes the number of images in the dataset.One approach to this is to transpose the feature set before applying principal component analysis (PCA) and transposing back again.This results in a reduction in the number of features, rather than in the dimensionality of each feature, which is the usual aim with PCA.Population compression can be large in some cases, and retains important information from across the entire feature array for each image, whilst speeding up the clustering process.

Fingerprints
We propose a general framework, derived from statistical moments, for constructing fingerprints from clustered base features.We denote the fingerprint by H l , where l is the order of the fingerprint.Cases l = 0, 1, 2 are defined below and higher order moments follow analogously, but are not tested here.

H 0
The zeroth order fingerprint, H 0 , is constructed as a histogram of feature occurrences, where each bin represents a group of "similar" features (defined by k-means clustering).This results in a K-dimensional vector, where K is the number of clusters and each entry in the vector denotes the number of features assigned to that cluster centre (up to normalisation).This can be defined more rigorously as where δ k,T (S i (x)) is the Kronecker delta function and k = 1, . . ., K. The full fingerprint is then given by This is referred to as the visual bag of words (VBOW) method [23] and has been considered in [13,14] with SIFT features.Figure 2 shows a schematic to help visualise fingerprint construction in this manner.

H 1
H 1 is a first order moment and considers mean features across each cluster centre.This produces higher-dimensional fingerprints than H 0 (K×d rather than K, where d is the length of each feature vector being clustered).This is a result of retaining a full-length feature vector (the mean) from each cluster, rather than just the number of occurrences of each cluster label.For k = 1, . . ., K, define the 1×d fingerprint of level l = 1 associated to cluster k as such that T (S j i (x)) = k for i = 1, . . ., J k (x).The full K×d fingerprint is then given by This is a matrix where the k th row corresponds to the sample mean of the k th cluster associated to the input image, x.H 1 therefore does not contain information about the prevalence of features associated with the different clusters, but it can be combined with H 0 to form the fingerprint K×(d+1) , which can be written as f (x) = vec(F (x)) ∈ R n , where n = K(d + 1).
H 1 contains information about the character of the average feature at each cluster centre for a given image.Depending on which feature extraction method is employed, some fingerprints may contain large numerical values which cause these fingerprints to dominate over others when training classifiers.To help alleviate this, we can centre H 1,k (x) by subtracting the k th cluster centre, µ k , before L 2 -normalising, i.e.
This is then exactly the vector of locally aggregated descriptors (VLAD) [27,28].Such fingerprints have been considered before in the context of microstructural fingerprinting in [14], and are shown to perform better than H 0 .However, it is worth noting that n = Kd in comparison to n = K above.Therefore, some balance should be struck between complexity and accuracy.VLAD fingerprints will be denoted H v 1 .

H 2
The second moment, H 2 , is constructed similarly and is effectively encoding variance of features across each cluster.These are higher-dimensional again (now K×d×d for full H 2 ), due to the fact that we are now considering distances between features belonging to each cluster centre.This results in the K×d×d fingerprint of level l = 2 associated to cluster k, given by In general, the full fingerprint H 2 (x) ∈ R K×d×d is a third order tensor.It is possible to take the diagonal to reduce this, in which case H 2 (x) can be summarised as a K×d matrix, like H 1 (x).Nonetheless the dimensionality quickly grows, along with the finer information in higher moments.We will also consider a VLAD-like second order moment, where (

Multi-Scale H l Fingerprints
For a fingerprint of order l with K cluster centres applied to an image x, the single-scale H l framework provides fingerprints described by the map Multi-scale fingerprints can then be achieved through concatenation of fingerprints with different numbers of cluster centres, i.e.
To extend this approach, an arbitrary number of cluster models can be included by considering K i for any i > 2. This sort of multi-scale fingerprint might prove useful in situations where several different descriptors are required to capture different classes of features, or where a correlative approach to microstructure characterisation provides multiple images with the same field of view via different techniques.

CNN Fingerprints
As discussed in Section 3.

PCA-Reduced Fingerprints
In the case of higher-order fingerprints, such as H 1 , H 2 (especially when applied to CNN features), the length, D, of each individual fingerprint may be much larger than the number of fingerprints, N , in the dataset, i.e.D N .If so, then PCA can be applied to the stack of fingerprints, F ∈ R N ×D , to reduce from an N ×D array to an N ×N array.This speeds up classifier training post-fingerprinting and compresses fingerprints, whilst retaining geometric information encoded into the features.

Learning Tasks
The ultimate goal of microstructural fingerprinting is to develop the tools to produce informative and compressed representation of the microstructure associated with a particular population of microstructures.This representation should be generic enough that fingerprints can be utilised for solution of a variety of problems, e.g. the development of machine learning models for classification and/or regression tasks.In the present work, classification tasks are considered.Classification tasks can be split into three main categories, namely supervised learning (SL), semi-supervised learning (SSL) and unsupervised learning (UL).
Suppose that a fraction p ∈ [0, 1] of the data is labelled.Then, SL is the case where p = 1, UL has p = 0 and SSL covers everything in between with p ∈ (0, 1).Of course, when p ∈ (0, 1) one could always discard the proportion 1 − p of unlabelled data and proceed with SL, which may often produce similar results, especially when p and/or the original dataset is very large.However, conventional wisdom says that if a small proportion of data p is labelled and a large amount of data is available, then one should try to extract information from both labelled and unlabelled data towards the ultimate learning goal.Labels can be propagated through the unlabelled images in the dataset and this forms the basis of SSL.

Supervised Learning: p = 1
A plethora of methods are widely available for SL.Here, we compare support vector machine (SVM) [48] with random forest (RF) [49] approaches.

Support Vector Machine
SVM is one of the most widely used SL classifiers and has previously shown success for microstructure classification tasks [14].The aim is to find the optimal hyperplane for separating the data (the hyperplane with maximum margin, where the margin is defined as the distance between the hyperplane and the nearest data point).In other words, we search for optimal values, w, b, such that where F i ≡ f (x i ) is the fingerprint corresponding to image x i for i ∈ {1, . . ., N }, given a set of N images, w is a vector of weights, b denotes a measure of bias and y i is a label assigned to F i such that y i = +1 if F i is on one side of the hyperplane and y i = −1 otherwise.As the data may not be linearly separable, a slack variable ξ i is included for each training fingerprint to allow for some misclassification.We then optimise by minimising where C is a parameter to be specified and is typically set to 1, although cross-validation can be utilised to tune this.The solution can then be given by where the α i are to be determined numerically.This can be done by considering the Lagrange dual problem and solving the Karush-Kuhn-Tucker (KKT) conditions.Further details can be found in [50].We can then define class prediction, g(x), for a test image, x, with fingerprint F as Note that this is a binary classification, based on the sign of the solution.This can be extended to multi-class problems and is discussed at the end of this section.
The input space (the space containing our fingerprints) may not be linearly separable.To alleviate this, a map can be constructed to some feature space, φ(F i ), which is selected to transform the input space into a space which is linearly separable.Choosing the correct feature space can be challenging and an alternative approach that removes the need for such a map is to apply a suitable kernel.If we choose our kernel, K, such that K(F i , F j ) = φ(F i ) • φ(F j ), we can rewrite our prediction (2) as which is free of φ.
A linear kernel results in essentially reverting back to solve equation ( 2).This works well for CNN fingerprints.A χ 2 kernel is utilised for VBOW, as recommended in [14], however, this requires F i ≥ 0 for all i ∈ {1, . . ., N }.If F i has been normalised in such a way that this condition is violated, a linear kernel should be applied.
As discussed above, SVM is a binary classifier, but there are several options to extend to multi-classification tasks by concatenating multiple binary classifiers.Here we use a so-called error-correcting output code (ECOC) [51], which assigns a unique binary string to each class label.The binary strings then allow for multiple binary classification problems, which can be combined to solve the multi-class problem.

Random Forest
Random forests are an extension to decision tree classifiers.A decision tree is a set of nodes and decisions, ultimately resulting in a prediction [52] .They may be used for classification problems (as in our case), but they can also be applied to regression-based problems.The root node at the top of the tree will contain all fingerprints.The set of fingerprints is then split and propagated into a new pair of nodes.Several iterations of this process form a tree.At each node (including the root) a binary decision is made as to whether or not for some fixed value a (selected during training to minimise a cost function), where i ∈ {1, . . ., N } indexes the images and j ∈ {1, . . ., J(x i )} indexes the features in the i th image.
The Gini index, G, is a measure of the expected error-rate at a given node [53] and is given by where πc denotes the probability that a fingerprint drawn at random from the node will be of class c.G is determined for each pair of nodes being fed and is combined as a weighted average across both nodes to provide the Gini impurity of the split.The Gini criterion can then be used to prune the decision tree by removing splits with relatively large Gini impurity [50].Each decision splits the fingerprints into subsets, which are subsequently fed into other nodes until, finally, the nodes at the end of each branch contain only fingerprints from one class (or the specified maximum depth is reached).These final nodes are known as leaf nodes.Once trained, test data can be propagated through the decision tree to form a prediction, based on which leaf node the test data reaches.
Decision trees have a high variance.To alleviate this, several decision trees can be trained and the ultimate prediction is given by the prediction with the most votes, i.e. the answer which the majority of the decision trees arrived at (or the average prediction for regressionbased tasks).This is a process known as bootstrap aggregating (bagging) [54].Bagging decision trees that are trained in this way can cause the decision trees to be correlated, due to the fact that the whole set of fingerprints received at each node is utilised.This can lead to overfitting.Random forest is essentially bagging, however, during decision tree training, a random subset of fingerprints is utilised at each split (rather than the whole set).This reduces correlation between decision trees, resulting in a random forest.

Unsupervised Learning: p = 0
For the UL task, we can use k-means clustering, as described already in Section 4, as well as spectral clustering [39].The latter is a graph-based method that requires suitable transformation of the image data before essentially applying k-means clustering to the constructed graph.
A graph is comprised of nodes and edges.Each node denotes a fingerprint from an individual image and there are several options available for defining the edges/connections.Here, we consider k nearest neighbours [55] with k = 10, where neighbours are identified based on minimising Euclidean distance.An adjacency matrix, A ∈ R J(x)×J(x) , is constructed such that Note the directionality of this behaviour.A is not necessarily symmetric.
The degree matrix, D ∈ R J(x)×J(x) , is also constructed.This is a diagonal matrix such that D ii denotes the number of nodes connected to node i.The Graph Laplacian, L, is then given by L = D − A.
Eigenvalues and eigenvectors are computed for L. Eigenvalues can be used to estimate the optimal number of clusters (if not previously defined in the scope of the classification task) by plotting and searching for the first discontinuity in the plot.This may be subjective at times, given the discrete nature of eigenvalues.Finally, for a k-class problem with N fingerprints, the first k − 1 non-zero eigenvectors are calculated and form a k×N matrix.Each row contains information about an optimal cut to make on the Graph Laplacian to cluster the fingerprints (nodes in the graph).k-means clustering is performed on the eigenvectors, resulting in an N -dimensional vector of labels.This serves as our classification for each of the N images.
Labels assigned in this way do not directly correspond to class labels (in theory the true class labels don't even exist for UL).However, this is useful if labels are not known and the aim is to group "similar" data together, rather than classify each group.Regardless, we do have labels in this case and it is possible to construct a map from the predicted labels to the actual labels.This enables us to quantify the accuracy of the unsupervised learning methods.

Semi-Supervised Learning: p ∈ (0, 1)
For the SSL task, we again consider the Graph Laplacian, L, defined in Section 5.2.However, now that we have some label information, we can look to propagate those labels through the graph, rather than seeking optimal cuts as in UL.Here, we consider two methods for label propagation -Laplace learning and Poisson learning.Jeff Calder's Python package, GraphLearning [56], was used to test these learning methods.

Laplace Learning
Label propagation via Laplace learning was proposed in [57] and is based on solving Laplace's equation, i.e.
where Lu(F i ) denotes the Graph Laplacian, L, p is the fraction of data that have labels associated to them and N is the total number of fingerprints.This is solved for u(F i ), such that u(F i ) = [u 1 (F i ), . . ., u k (F i )] for the k-class problem.The labels y i , for pN < i ≤ N , are then predicted as

Poisson Learning
An alternative method for label propagation is presented in [33], called Poisson learning.It is proposed that this method offers a performance boost over Laplace learning at low label rates.
We define the average label vector, ȳ, as Note that no boundary conditions are required for Poisson learning: label information is instead encoded into a forcing term.

Fingerprint Quality Assessment
In a practical sense, the quality of a microstructural fingerprint is a balance between the amount of compression achieved and the amount of useful information that is retained.To assess this, we must first define what we mean by useful information, but this is dependent on the intended use of the fingerprint.In this case, we construct a classification task.Classification tasks require the data to be split into a training set and a test set.The test set will remain hidden from the model during training.Following training, the test set is fed into the trained model and an accuracy score is computed.
It is important that several iterations of the model are implemented to minimise random error.To this end, 10-fold cross-validation [58] is utilised.This requires the data to be split into 10 distinct subsets (with even sampling across each microstructure class).10 iterations of the classification task are then performed, with each subset utilised as the test set exactly once.For each iteration, the remaining 9 subsets form the training set.
Following classification, the accuracy score for a task with K classes is computed as where C ∈ R K×K is the confusion matrix whose i th row indicates the true class and j th column indicates the machine prediction for the test data.In particular, if no test data is mislabelled, then the accuracy score will be 1, whereas, if all test data is mislabelled, then the accuracy score is 0. Note that this will vary each time the simulation is run, due to the randomness of the train/test split.One can report error bars over a series of realisations, but this may be computationally expensive.
In the case of the supervised and semi-supervised learning problems, the confusion matrix is computed on the test data.In the case of an unsupervised learning problem, predicted data labels are defined up to permutation of labels.In other words, the data can be split into sets based on similarity, but the labels themselves carry no real meaning.To compute an accuracy score in this case, the label mapping between actual labels and predicted labels which yields the highest accuracy is selected and the predicted labels are permuted accordingly.

Dataset 1: LFTi64
Dataset 1 (LFTi64) was curated within the LightForm group at the University of Manchester and is available on Zenodo [59].The dataset consists of 40 optical micrographs of Titanium-6Al-4V.20 images in the dataset are classified/labelled as a bi-modal equiaxed microstructure, whilst the remaining 20 images are labelled as lamellar.Figure 3 provides some examples from the dataset.
The images in this dataset are relatively large (a small patch can be extracted as a representative volume element), benefit from homogeneous illumination and are completely devoid of any artifacts.On the surface, this may seem ideal and, in some sense, this is true.However, fingerprints constructed from this dataset were too distinct for each class to yield anything other than accuracies of 100% across the board.Hence, classification results generated from this dataset have been omitted from this report.The main benefit to applying fingerprinting techniques to this dataset is the visual interpretability of the fingerprints themselves.Values assigned to each cluster centre in the fingerprint may not be interpretable, in the sense that we can't glean anything specific about the characteristics of the microstructure, however, provided suitable ordering of the cluster centres, fingerprints from separate classes are clearly distinct by eye.Also, fingerprints from micrographs within the same class are arguably "similar", at least by eye.A distance metric could be employed to quantify this level of similarity.
Figure 4 shows some examples of SURF H 0,10 fingerprints (zeroth order fingerprint with 10 cluster centres constructed from SURF keypoint features).Cluster centres are ordered such that the mean fingerprint across all bi-modal micrographs is in ascending order.Hence, bi-modal fingerprints shown here are approximately in ascending order, but not necessarily monotonically increasing.

Dataset 2: UHCS600
Dataset 2 was curated at Carnegie Mellon University and is presented in [60].The dataset contains 961 micrographs of ultrahigh carbon steel samples with seven different morphologies (listed in Table 1).There are lots of variations in illumination/contrast and there is an abundance of defects/artifacts.Hence, this is a much richer dataset for fingerprint quality assessment.spheroidite micrographs, 200 carbide network and 200 across pearlite and pearlite + spheroidite.We will refer to the pearlite + spheroidite as just pearlite from now on.It is worth noting that the presence of spheroidite in the pearlite micrographs has the potential to cause misclassification as spheroidite.This does not cause too much of an issue, but, as mentioned later in Section 9, this attributes to a bottleneck in accuracy.This reduced dataset is analogous to what was called UHCS600 in [14].However, the images are selected randomly and so the realisation used here is likely different from that work and the results cannot be exactly/directly compared.Examples of representative microstructures from each class are shown in Figure 5.

Numerical Results
10-fold cross-validation accuracy scores were generated across various cluster sizes for fingerprints of order l, where l = 0, 1, 2, considering SIFT, SURF, AlexNet and VGG19 fingerprints.The UHCS600 dataset was utilised to pose a 3-class classification task.SVM and random forest were tested for solving the supervised learning task, k-means and spectral clustering for unsupervised learning and, finally, Laplace and Poisson learning for semi-supervised learning, with p = 0.05.The supplementary Python package can be found on GitHub [1].Package requirements include Scikit-Learn [61] for SL and UL classifiers, GraphLearning [56] for SSL classifiers and PyTorch [62] for CNN-based feature extraction.Tables 3, 4 and 5, which can be found in the appendix, show results for SIFT, SURF and CNN fingerprints, respectively.Standard deviation is not included in the tables to save space.Regardless, standard deviation was consistent and within O(0.01) across the board.Highest accuracy scores for each learning method are emboldened to highlight the optimal fingerprint structure for solving each task.
In Table 5, subscripts "F" and "MP" refer to flattened and max pooled final convolution layers, respectively.The superscript "full" refers to full-scale images (no transformation when preprocessing) and "resize" refers to input images having been deformed to meet the recommended input size of 224×224.All cases where "full" or "resize" is not specified are to be regarded as "full".Reduced dimensionality fingerprints are labelled with superscript "redP" and "redF ", corresponding to reduced feature population prior to fingerprint construction and reduced fingerprint dimensionality via PCA, respectively, and superscript "diag" denotes H 2 fingerprints reduced from K×d×d to K×d via diagonalisation.

Discussion
Generally, CNN-based fingerprints offer a performance boost over keypoint-based fingerprints.However, the margin is relatively small.For the SL task, the highest classification accuracy comes from applying H v 1,30 to the final convolution layer of VGG and employing SVM for classification, which gives an accuracy score of 0.986 ± 0.0144.This is in close agreement with the findings in [14].This accuracy score is matched by the same fingerprints with subsequent PCA reduction down to 600-dimensional vectors (VGG H v, redF 1,30 ), which has a score of 0.986 ± 0.0150, and is closely followed by H v, redF 2,10 applied to AlexNet features, which has an accuracy score of 0.984 ± 0.0155.It is important to note here the inherent randomness in the crossvalidation train/test splits.To ensure reliability of these scores, 100 iterations of the 10-fold cross-validation for SVM were computed and averaged.The bottleneck in accuracy at around 98.5% can be attributed to certain images in the dataset.In fact, the same 6 images were misclassified 100% of the time across these 100 iterations for VGG.These were either images from the "pearlite + spheroidite" set that were given a true class label of "pearlite", but actually consisted of mainly spheroidite and were consistently misclassified as such; or they were relatively low-magnification images from the spheroidite set that were consistently misclassified as pearlite due to individual grains not being resolved at the low magnification and appearing to merge, resembling lamellae.The other SL classification method tested (random forest) deals well with lower-dimensional fingerprints, but breaks down slightly for high-dimensional fingerprints and becomes very computationally expensive, unlike SVM.
Image preprocessing is crucial to obtaining such accuracies.If the input image is resized via deformation, the cross-validation score drops (in our case, from 0.967 ± 0.0224 to 0.922 ± 0.0289 for max pooled AlexNet).Therefore, CNN results in Table 5 (unless labelled "resize") were generated using full-scale input images and resizing to match the recommended input size is best avoided.This is perfectly fine, as we are only interested in utilising the feature extraction portion of the pretrained CNNs in this case.
As suggested in [33], Poisson learning offers an improvement over Laplace learning for SSL at low label rates (p = 0.05), in most cases.Simply max pooling the final convolution layer provides the highest accuracy scores for SSL (0.913 ± 0.0248 for AlexNet and 0.940 ± 0.0098 for VGG).This is very close to the SL accuracy scores for the same fingerprinting method in each case (0.960 ± 0.0238 and 0.977 ± 0.0170, respectively).One of the largest difficulties currently faced in machine learning for materials science is in the curation and distribution of suitably large datasets.There are several decades-worth of microstructural image data that have been collected, but the majority are not suitably labelled and it is a laborious and costly task to label such massive amounts of data.The fact that SSL performs so well with just 5% of the data being labelled suggests that perhaps we do not need to label data on such a large scale, but just a handful of representative images may be sufficient.
For the UL task, VBOW fingerprints offer the highest accuracy scores, with a maximum of 0.835 ± 0.0598 achieved by applying spectral clustering to SIFT H 0,50 fingerprints.Spectral clustering outperformed k-means almost entirely across the board.The upshot here is that graph-based methods can provide an incredibly useful representation of micrograph data.
Increasing the order of the H l fingerprints from 0 to 1 can boost performance.Centring and L 2 -normalising (VLAD) can offer a further performance boost.Generally, H l appears to reach the point of diminishing returns at H 2 , in this context.This may be due to the fact that the dataset is relatively small and the H 2 fingerprints are high-dimensional.However, H 2 does offer a performance boost when applying SVM to AlexNet fingerprints (provided suitable PCA reduction) and, also, for spectral clustering applied to VGG fingerprints for UL and when Poisson learning applied to SIFT fingerprints for SSL.H 2 is computationally expensive, though, so to reduce the computational cost and enable H 2 for CNN-based fingerprints, either the diagonal was taken from each K×d×d fingerprint to form a K×d fingerprint (results with the superscript "diag"), or PCA was applied in two different ways.Primarily, feature arrays generated from each image individually were reduced prior to feature population construction, not in dimensionality, but in the number of features, by transposing the feature array before and after applying PCA.This resulted in accuracy scores approaching the highest performing methods applied to the whole feature population (0.982 ± 0.0189 for H v 2,20 fingerprints constructed from reduced population of VGG features compared to maximum of 0.986 ± 0.0144 for non-reduced feature population).Performing H v 1 on the same reduced features results in a drop in accuracy and so H v 2 should be considered if vast amounts of features are extracted.Secondly, PCA reduction of the fingerprint space to reduce dimensionality (as opposed to reduction of the number of features present in the population P) can offer a performance boost for the SL task, as discussed above, but this enhancement seems to be restricted to SL in this setting and fails to yield results much better than random for SSL and UL tasks.
SIFT was shown to outperform SURF overall (the highest accuracy achieved by SIFT of 0.980 ± 0.0170 with H v 1 compared to 0.977 ± 0.133 for SURF).However, this is a relatively small difference and the added compression afforded by SURF, alongside the reduced run time, could be considered favourable in this case.It takes 336 seconds for SIFT dictionary creation from all 600 input images, compared to 237 seconds for SURF.This has a knock-on effect to clustering and, therefore, fingerprint construction (584 seconds for k-means clustering SIFT features with k = 20 and only 305 seconds for SURF).This difference would become more significant for larger datasets and larger values of k.
Introducing the multi-scale VBOW fingerprints can offer a further improvement in crossvalidation accuracy score, whilst maintaining compression.For example, H 0,20+30+50 outperforms H 0,100 for random forest applied to SURF-based fingerprints (with accuracy scores of 0.947 ± 0.0379 and 0.938 ± 0.0252, respectively), but both are 100-dimensional representations of the input images.
Reducing the training dataset size for SL to p times its initial size allows us to quantify the performance boost offered by label propagation via SSL over training an SL classifier on available labels and discarding unlabelled data.Figure 6 provides plots of the classification accuracy for SVM, RF, Laplace learning and Poisson learning applied to SURF H 0,20 and max pooled AlexNet features.Both plots confirm a performance boost offered by label propagation at low label rates.This is most apparent with the AlexNet max pooled fingerprints, but both methods benefit from label propagation via Poisson learning for p ≤ 0.1.This value is likely dependent on the size of the dataset in question.Poisson learning does offer an improvement over Laplace learning for low label rates (as proposed in [33]), but Laplace learning overtakes at approximately p = 0.1, alongside the SL methods.

Conclusions
We have compared various options for image-based feature extraction on microstructural image data, namely SIFT, SURF and transfer learning from two pretrained CNNs (AlexNet and VGG), pretrained on the ImageNet dataset.A statistical framework for constructing fingerprints from said features was proposed, which incorporates some classical computer vision approaches (VBOW and VLAD).The quality of these fingerprints was assessed via a range classification tasks -supervised learning (SL), semi-supervised learning (SSL) and unsupervised learning (UL).The main conclusions are summarised below.
• Fingerprints constructed from CNN features offered a performance boost over those constructed from keypoint features for SL and SSL.
• Zeroth-order fingerprints (VBOW) constructed from SIFT features work best for UL.
• VGG offered an improvement over AlexNet, but it is much more expensive to compute and improvement is marginal.The same applies to SIFT and SURF, respectively.
• H 1 fingerprints generally outperformed H 0 fingerprints, except for UL and some cases for SSL.
• Centring and normalising fingerprints (VLAD, when considering H 1 ) generally offered a further performance boost.
• H 2 offered a performance boost over H 0 , H 1 in some cases, but it is much more expensive to compute and improvement is marginal.
• SVM performs best on higher-dimensional fingerprints, whereas random forest is best for low-dimensional.
• Spectral clustering offered a performance-boost over k-means clustering for UL in almost all cases.
• SSL at low label rates (0 < p < 0.1) can outperform SL trained on an equivalent subset of the training data with unlabelled data discarded, i.e. label propagation to unlabelled training data is more beneficial than discarding unlabelled data and performing SL.
• Preprocessing images to satisfy the recommended resolution for pretrained networks resulted in a reduction in accuracy and is only required if utilising the classification portion of a pretrained CNN.
• PCA reduction of fingerprint dimensionality can increase classification speed without compromising classification accuracy for SL.

Future Work
The current work has shown that CNN-based fingerprints generally offer a performance boost over keypoint-based fingerprints, towards solving a microstructure classification task.There are a variety of other options available for fingerprint construction that are not discussed above, as well as a variety of other options for fingerprint quality assessment.The main focus going forward will be on CNN-based fingerprints.
Variational autoencoders (VAEs) [63] will be investigated, alongside generative adversarial networks (GANs).In both cases, a pair of CNNs are trained in tandem.The main difference is that a GAN aims to generate synthetic images that are different to the input image, but have inherently the same properties/morphology, whereas VAEs aim to encode image data into a vector and then reconstruct the input image exactly from the encoded representation.The encoded space that is learned during training of a VAE (which can be thought of as a space containing valid image fingerprints) is somewhat akin to the so-called latent space, which describes the space containing suitable GAN inputs.This parity could enable VAEs to be used as a tool for furthering our understanding of the GAN latent space and provide some element of control over synthetic images generated by a suitably trained GAN.

Figure 2 :
Figure 2: Schematic of H 0 fingerprint construction from SIFT features.

Figure 4 :
Figure 4: Example fingerprints from LFTi64 dataset, constructed from SURF features clustered with 10 cluster centres.These are ordered such that the mean of bi-modal microstructural fingerprints is in ascending order.

Figure 6 :
Figure 6: Reduced training set of size pN for SL compared with SSL label propagation with pN labels, where N is the total number of images (600 in this case) and 0.01 ≤ p ≤ 0.5 denotes the fraction of data that have labels.
2, pretrained CNNs can be utilised for image-based feature extraction, by considering the final convolution layer of the feature extraction portion of the network.The output from this layer is a tensor of size d 1 (x)×d 2 (x)×d, where d 1 (x), d 2 (x) depend on both CNN architecture and input image size and d depends just on CNN architecture.We consider three methods for fingerprint construction from this tensor.
The dataset was reduced to 600 images across 3 morphologies by randomly sampling 200

Table 2 :
Table of parameters utilised to generate results.