Quantification of 3D spatial correlations between state variables and distances to the grain boundary network in full-field crystal plasticity spectral method simulations

Deformation microstructure heterogeneities play a pivotal role during dislocation patterning and interface network restructuring. Thus, they affect indirectly how an alloy recrystallizes if at all. Given this relevance, it has become common practice to study the evolution of deformation microstructure heterogeneities with 3D experiments and full-field crystal plasticity computer simulations including tools such as the spectral method. Quantifying material point to grain or phase boundary distances, though, is a practical challenge with spectral method crystal plasticity models because these discretize the material volume rather than mesh explicitly the grain and phase boundary interface network. This limitation calls for the development of interface reconstruction algorithms which enable us to develop specific data post-processing protocols to quantify spatial correlations between state variable values at each material point and the points' corresponding distance to the closest grain or phase boundary. This work contributes to advance such post-processing routines. Specifically, two grain reconstruction and three distancing methods are developed to solve above challenge. The individual strengths and limitations of these methods surplus the efficiency of their parallel implementation is assessed with an exemplary DAMASK large scale crystal plasticity study. We apply the new tool to assess the evolution of subtle stress and disorientation gradients towards grain boundaries.

Most commonly, such computer simulation studies report descriptive statistics of tensorial state variable values. These are either presented as histograms or exemplary as colorful renditions of the entire RVE domain or specific sections thereof. For characterizing how state variable values are spatially distributed in the RVE volume, though, one can call such characterization strategies unnecessarily qualitative if not even insufficient. Instead, one should better quantify the distance of each material point in the RVE to its closest interface and evaluate the distributions of state variable values as a function of these distances, i.e. as spatial distributions.
We acknowledge that there are practical challenges associated with such quantification task. Especially when one aims for highest possible statistical significance and when full-field spectral method models are used [38][39][40][41][42][43][44][45]: firstly, simulations for heterogeneity build-up need sufficiently more finely resolved RVE domains than are typically used for flow curve predictions. The former create higher computational costs because statistical significance demands to include not only as many grains as possible but also an as fine as possible spatial resolution for each of them. In fact, dislocation density or orientation gradients build-up in three dimensions and evolve as a function of grain orientation and neighborhood [11]. For simulations which demand fine discretization, spectral method models with periodic boundary conditions provide an advantage because of comparable lower numerical costs than similarly resolved finite element crystal plasticity models [44].
Nevertheless, this practical advantage comes with a second challenge: the spectral method typically uses volume sampling approaches instead of tracking explicit material points on the grain or phase boundaries. In other words, the advantage of the spectral solver for flow curve computations turns into a disadvantage because evolution of the interface geometry must be determined during post-processing.
The requirement to analyze distances in the deformed configuration, i.e. distances as they are measured in an experiment where the sample is typically prepared after having been deformed, defines a third challenge: successive shape distortion of the RVE domain has to be accounted for. This is a particular challenge when executing above simulations with periodic boundary conditions. In such case, grains in contact with and wrapping around the RVE domain boundary should ideally get analyzed as well as to not make compromises in statistics. These grains, though, are not only frequently physically fragmented as a result of deformation but show up as eventually multiple logical fragments on opposite RVE domain wall sides. Another challenge to master during any quantification of spatial correlations is to collect ideally results from all material points, again to get highest statistics. For simulations with several million material points and eventually dozens of strain steps, though, this calls for efficient post-processing tools to machine off datasets in the order of possibly several hundred gigabyte volume. In other words, the post-processing tools should be parallelized. As a fifth challenge, any post-processing of state-variable-tointerface-distance correlations for spectral method models calls for grain reconstruction methods because grains accumulate internal misorientation as they get deformed.
At least for the grain reconstruction challenge, post-processing tools were developed in the SEM/EBSD community [46][47][48][49][50][51]. Consequently, we assessed these [49,[52][53][54] with respect to their potential for immediate application to solve our reconstruction and distancing task. We found not a single tool, though, which combined grain reconstruction functionality for arbitrarily shaped, three-dimensionally distorted surplus periodically constrained point cloud data and documented application for multi hundred million point cloud datasets.
This gap motivated our research contribution. Therein, we develop methods which cope with all above listed challenges. Specifically, we implement statistical quantification methods which characterize how state variable values are distributed as a function of material point distance to grain boundaries. We implement two grain reconstruction methods. Next, we apply these in three different distancing methods to compile a comparative assessment of long range spatial distributions for state variable values. In the case studies we focus on single-material-point-resolved quantification of the Cauchy stresses and quantify also the disorientation with respect to the mean orientation of the grain. Cauchy stress and disorientation are quantities of frequent interest and controversy with respect to their tendency to form spatial gradients at interfaces. Furthermore, we report how such methods can be implemented into an efficient strong and weak scaling open source tool for a workstation or computer cluster. Therewith, we supplement the Düsseldorf Advanced Material Simulation Kit (DAMASK [38]) with a tool which encourages the crystal plasticity community to explore also their spatially resolved data in further quantitative detail.

Defining the data analysis task
This work presents a specific analysis pipeline to post-process DAMASK [38,43,44,55] simulation results. Two grain reconstruction, three material point to interface distancing methods, and additional post-processing algorithms were developed. The pipeline enables the quantification of spatial correlations between specific material point state variable values and each material points' closest projected distance to an interface. The individual steps of the pipeline are detailed in the appendix. The pipeline was implemented into a parallelized original C/C++ program. Given that the entire source code is provided as open source, we report only the key steps of the pipeline and the strategies we employed for an efficient implementation.
The pipeline answers the following data mining question: how are state variable values distributed spatially as a function of the projected distance to the nearest interface for a given collection of strain steps? We define a strain step as a state variable value dataset which is probed from an RVE crystal plasticity simulation at a specific true strain value. From now on, it is assumed that this dataset contains results for every material point which supports the RVE domain volume.
Interfaces are either grain boundaries or phase boundaries. In this paper, we focus on single phase microstructures, and thus study grain boundaries. Our methods are formulated general enough, though, to be applicable to phase boundaries, provided their geometry is implicitly spatially resolved with material points. We report our methods for three-dimensional applications, the simplification to two dimensions is straightforward.
Under above provision the input data to the pipeline consists, for each strain step, of the full set of N material points. These points, identified as p i , define positions x i in the deformed configuration. We define and analyze our results in the laboratory Cartesian coordinate system ∈ R 3 . As such, all local displacements of a material point versus its initial position in the unloaded case are accounted for [43]. This makes the results comparable to the situation in experiments where a plastically deformed microstructure is probed typically by a posteriori sample preparation. For this coordinate system convention the initial grid of integration points becomes successively and irregularly distorted (Fig. 3a). This calls for the development of point cloud processing methods which back out the geometry of each grain and consider the fact that typical DAMASK simulations use full three-dimensional periodic boundary conditions. Each material point i has an associated set of state variable values {s i }. Members of this state variable set are for instance the local deformation gradient F i or the orientation q i . Although, not all these variables are state variables in the microstructural sense, we have opted to refer to these computed quantities as state variables for simplicity.
Depending on which full-field model of the crystal plasticity community is employed, different strategies are in place to define an initial grid of integration points to represent the microstructural volume inside the RVE domain. Cubic grids are the most commonly employed strategy. This results in an implicit representation of the grain or phase boundaries. Full-field spectral method crystal plasticity simulation tools like DAMASK [38,44] also use such volume sampling strategy. This avoids explicit bookkeeping of the interface network geometry and thus cuts numerical costs. As a disadvantage, though, the interfaces have to be reconstructed from the point cloud data using post-processing methods, like the ones described in the following, to enable a quantification of above distance correlations. The present pipeline provides an original solution to solve this task. The pipeline is implemented in a supplemental DAMASK post-processing data mining tool which we call damaskpdt for short, where the acronym pdt stands for post data treatment.

Pre-processing stage
The processing of the strain step data is distributed across the nodes of a cluster computer. Each process handles complete strain steps (at least one). The processes independently read all material point and respective state variable values from the simulation results file. In this work the input is parsed from the traditional binary container format file. An extension of our tool to use the recently implemented HDF5 container format [38,45] remains as a straightforward task for the future.
After loading the strain step data, each process evaluates the data independently. This pre-processing stage includes the calculation of the material point positions x i in the deformed configuration and of the equivalent stress and strain tensors for each material point plus an RVE-averaging of these quantities reimplementing methods previously described [38,43,44,55]. Once accomplished, three point clouds for each strain step n are defined to simplify all subsequent processing steps: the first point cloud is P 0 n . It specifies the ensemble of the original N material points in the deformed configuration. The second point cloud P 0+ n specifies the ensemble P 0 n surplus all its 26 periodic image points inside a bounding box with thickness about the bounding box to P 0 n . Here, defines a small fraction of the largest edge length of the box. P 1 n , the third point cloud, specifies all members of P 0 n and all their 26 periodic image points. Subsequent processing involves region queries on all three point clouds. Special data structures were used to accelerate these queries (section 2.5).

Grain reconstruction
Remark on classical grain reconstruction with DAMASK In the past, attempts were made to reconstruct the grains via executing a clustering analyses on P 0 n . Specifically, the grains were built as a cluster of material points with points within a critical distance of a few point-to-point distance units and less than an a priori defined disorientation among each other. However, such method cannot rigorously merge all periodic images of a grain. Especially not those on opposite sides of the RVE domain unless a very large critical distance is chosen. Such choice, though, will occasionally merge second-or even higher-order neighboring grains into a single object. Especially so when these neighbors have similar orientations. Therefore, in this work two different approaches are proposed to reconstruct the grains. The methods borrow conceptually from achievements made in the SEM/EBSD community [46,49,54] but implement these concepts into a tool taking into account the periodic boundary conditions and showing a much better scaling in parallelization.
For any implicit representation of an interface network, here exemplified for grain boundaries, a reconstruction of the interfaces is a necessary step to enable us to build correlations between state variable values at the material points and the distances to the closest interface. Only in the special case when stress to distance correlations are sought and disorientation based distancing used, grain reconstruction is an optional step in our approach. In what follows, two methods for partitioning the material points into a polycrystalline aggregate are specified: Modified grain reconstruction methods developed in this work The first grain reconstruction method interprets the texture ID, or texture index, initially assigned upon microstructure synthesis, as the grain ID. This has been the common strategy so far when studying in-grain orientation gradient build-up: either explicitly [9,12] or implicitly by reporting orientation spread via pole figures [11]. This grain reconstruction method is tagged TEX in the results section.
The second grain reconstruction method addresses the challenge that when grain fragmentation becomes significant, referring still to the initially instantiated grains is no longer correct. Then, the individual grain fragments should be distinguished. Three requirements should be fulfilled for this purpose: the method should work for irregular point cloud data; it should be able to handle full, i.e. three-dimensional periodic boundary conditions; the method should be sensitive to gradual orientation gradient build-up.
Graph clustering is one method which fulfills all these criteria. Application examples for characterizing microstructures with graph clustering methods, though, remain few. One is the Fast Multiscale Clustering (FMC) [56,57] method known in the SEM/EBSD community. Other examples, on which this work settles, are community detection algorithms [58,59] that find frequent application in analyzing human communication and interaction patterns for social media platforms. Exemplarily, this work uses the community detection method employed in the study of Dancette et al. [59]. Specifically, we use the open source implementation of the Louvain community detection method by Blondel et al. [60,61]. This grain reconstruction method is tagged LOU in the results section.
Extraction of a single periodic image per grain Once all material points are tagged with a grain ID, in principle one can compute the distance correlations. For arbitrarily deformed domains with periodic boundary conditions, though, practical challenges remain: first, macroscopic and microscopic shape distortions of the RVE domain walls have to be accounted for. Second, grains in contact with the RVE domain walls are fragmented into multiple pieces, possibly laying on opposite sides of the RVE domain.
It is useful to merge these pieces into a single grain object with simpler geometry. To place this object into a cuboidal box is even better. Thereby, all downstream processing operations specific to this grain can be easier handled numerically. Consequently, a geometry simplification step is executed. Therein, first the individual periodic fragments are fused using a clustering algorithm. Thereafter, a single representative periodic image is chosen. Once chosen, all material points associated to the merged grain are packed into a grain-local bounding box. Given the a priori labeling of the material points with grain IDs, the geometry simplification step is a trivial parallel task. This allows us to employ efficient multithreading approaches developed in the grain growth modeling community to speed up the analyses [62][63][64].
In detail the simplification step works as follows: each grain k is processed independently. First, all material points in P 0 n are periodically replicated to define the superset P 1 n . Using the grain reconstruction, points are tagged with their grain ID k. Subsequently, a clustering algorithm is applied to merge the individual fragments into a number of complete grain periodic images. In this study, the DBScan [65] algorithm was used. It assigns all points which are connectable by spherical regions with radius R to the same cluster. Specifically, a radius of √ 3 times the initial point-to-point distance was used. This strategy is applicable as long as no grain wraps periodically around the entire domain or, similarly, periodic fragments in P 0 n on opposite sides of the RVE domain come closer to one another than the above defined critical distance. Next, one representative cluster is picked. Using this strategy the entire point cloud P 1 n is constructed using multithreading. Finally, we compute the global bounding box to all grains' local bounding boxes.
Rediscretizing the global bounding box Next, we define a rediscretization of this global bounding box volume. The actual rediscretization, however, was applied to volume inside the local bounding boxes only to improve efficiency. Cubic voxels of edge length 0.5 times the initial point-to-point distance were used. Once rediscretized, the information contained in P 1 n is used to identify the nearest neighboring material point to each voxel.

Distance quantification
Disorientation based For this distancing method, tagged DIS in the results section, the challenge of grain reconstruction is ignored completely. Instead, position and orientation data for points in P 0+ n are used. In this work a guard zone thickness of 0.1 times the initial RVE domain edge length was used. With these definitions the distance values are computed by finding, for all material points in P 0 n , all their respective neighbors in a spherical region of radius R = 0.1. Again R is in multiples of the initial RVE domain edge length.
Next, the identified neighbors are sorted in increasing order with respect to their Euclidean distance to the inspected material point. Finally, the distance value for the closest, above a critical threshold Θ c disoriented pairing, is taken as the distancing result. In this work, a threshold of Θ c = 15 • was used. If no closest neighbor among the candidates was found, the material point made no contribution to the correlation statistics.
In a nutshell, the advantage of the DIS distancing method is that no grain reconstruction is required as long as no disorientation-to-distance correlations are desired. The independence of each material point is another advantage because calculations are trivial parallel. Without taking the spatial arrangement of neighboring points into account, though, a clear disadvantage of the DIS distancing method is that only the distance but not the displacement is accessible.
Signed distance/ voxelization based Enabling the identification of normal distances to the grain boundary is the motivation for and key strength of the second distancing method. It is tagged as SDF in the results section. The key idea is to combine attractive mathematical properties of signed distance level set functions with latest method developments achieved for scalable grain coarsening simulations [64,[66][67][68].
Level set function methods parameterize the position of interfaces implicitly as the iso-contour of a real-valued level set function Φ(x) to an a priori defined iso-value. A signed-distance level set function (SDF) is a particular useful definition to choose as a level set function because when making this choice surplus defining the grain boundary contour for each grain as Γ := {x ∈ Ω | Φ(x) = 0} with Γ ∈ Ω ∈ R 3 , a distance level function defines the (normal) distance of continuum points to the the grain boundary contour. This choice, i.e. Φ(x) ∈ Ω ∈ R 3 , has the key advantage that ∇Φ(x) = 1 holds for most points ∈ R 3 one practically encounters inside and outside the contour. Consequently, a consistent outer unit normal vector can be computed to each point on the contour by evaluating n( numerically. In effect, this allows us to calculate arbitrarily projected distances to the contour. In other words the SDF provides access to a distance vector. A two step procedure was employed to compute a signed distance function for each grain. The procedure builds on the previously defined discretization of the bounding boxes about each merged grain. First, Φ(x) was initialized to positive values (+h) for all voxel assigned to the grain under inspection and negative values (−h) for points outside. The scalar h was set equal to the cell width used for the global rediscretization of the domain. In a second step, the distance information was propagated with the fast sweeping method [64,69].
Tessellation based Also the third method, tagged VOR in the results section, delivers vectorial distances. Furthermore, it explicitly reconstructs a contour hull for each grain.
The key idea of the method is to build each grain volumetrically from a collection of Voronoi polyhedra. These can be directly obtained from a Voronoi tessellation of the respective material point cloud for each grain. Formally, the tessellation based method is based on the approach detailed by Bachmann et al. [54]. We introduce two improvements, though: first, we generalize the method for applications on distorted point clouds with periodic boundary conditions. Second, we introduce a hybrid parallelized solution which executes substantially faster and might provide an avenue to explore for MTEX as well.
The parallelization alleviates much of the higher numerical costs, which the VOR method has compared to the DIS or SDF methods. It is not necessary within VOR to rediscretize the bounding boxes. Instead, we directly operate on the specific grain-local subset of material points from P 1 n , specifically those points inside the bounding box to each grain k.
In a first step, this local point cloud is extracted and each Voronoi cell computed. We tag each Voronoi cell with the grain ID k associated to the material point. Next, we identify which of the cell facets have first-order neighboring Voronoi cells with a different associated grain ID than the currently processed grain tag. Such difference is a signature that the currently processed Voronoi cell is in contact with the grain boundary. The resulting collection of all such facets of Voronoi cells about points tagged with the grain ID under inspection is used to construct the shape of the exterior contour hull (Fig.  1b). The contour hull extraction algorithm is executed in parallel for each grain k.
We used the Voro++ tessellation library [70] to process the individual grain-local tessellations. Specifically, we re-implemented the library wrapping approach of Peterka et al. [71] around calls to this library. To avoid truncating Voronoi cells of the inspected grain, we enlarged the individual grain-local bounding boxes using a guard zone of three times the initial point-to-point distance. Furthermore, possible Voronoi cell contact with the local bounding box domain walls was detected using Voro++ library functionalities. Thereby, we cured any possibly incorrectly computed Voronoi cell, eventually through fattening the guard zone.
Once the exterior contour hull has been defined, the material point to contour hull distances can be computed. Given that projections on the polyhedron facets can be used, the VOR methods computes projected distances. The implementation is tricky, though, in detail because consistence and speed are desired: Algorithm to compute vectorial distances within VOR First, a consistent outer unit normal vector for each contour hull facet is computed using the edge circulation conventions of the Voro++ library. Next, the normal and geometry of the facet is used to compute a consistent right-handed local Cartesian coordinate system for each contour hull facet. Such coordinate system can be computed because each Voronoi cell is a convex polyhedron [72]. By virtue of construction each facet is a convex polygon in 3D space. With the aid of the local coordinate system, though, the geometric description can be reduced to two dimensions. This accelerates point to polygon inclusion tests.
It remains to compute the actual distances. First, the general approach used is explained. Second, we detail how the processing of the facets has been accelerated using bounded volume hierarchy querying algorithms.
The actual distances are Voronoi cell interior-point-to-exterior-contour-hull-projected point distances. Distances were evaluated for each interior point. It suffices to sketch the procedure for a single interior point. The main task is to probe which facet of the contour hull lays closest to the point. Given that the contour hull was constructed from Voronoi cells, each facet is a polygon. While probing the facets for the absolute shortest projected distance we proceed as follows: firstly, each interior point gets normal projected onto the facet plane. Secondly, we probe whether the projected point lays inside the facet polygon on this plane or outside. If the point is enclosed or on the boundary, the point is considered inside. In this case the distance value is eventually used to update the currently closest distance.
If, though, the projected point is outside the polygon, we proceed by normal projecting the currently tested interior point on each edge segment of the polygon. If such a projected point lays on the edge, the computed distance eventually updates the current closest distance.
If all these projections, though, do not map on a polygon edge segment, we evaluate the Euclidean distance of the currently tested interior point to each vertex of the polygon. Eventually thereby also the current closest distance gets updated. Once finished with the vertices, we proceed with the next facet of the contour hull until no further facets remain for processing.
To the best of our knowledge advanced procedures like above are necessary to handle cases where, due to the possible non-convexity of the contour hull, a consistent normal distance is not immediately defined. In fact, such cases are frequently possible. One particular simple example is illustrated with the grayed-out region in Fig. 1a. Evidently, there is not necessarily a projected point on a contour facet for every position inside an arbitrarily shaped non-convex polygon (here for the brighest yellowish region). At least not if one projects exclusively parallel to the direction of the facets' outer unit normals.
While getting access to a defined contour hull for each grain surplus projected distances is a clear advantage, the VOR method has also disadvantages: most importantly, the method is computationally more costly. More subtle, the edge-on compositing of a contour hull from a collection of polygons faces discontinuities of the curvature at the polygon edges. These can be cured, though, with state of the art SEM/EBSD grain reconstruction software [49,52] which is another possible avenue to explore and use for reconstruction of the interface network in DAMASK in the future.
A strategy for accelerating the distance computations Without additional tricks, the algorithm defines a quadratic time costly computational geometry task -for each interior point each contour facet needs inspection. Therefore, a pruning strategy was designed to reduce the total number of facets to test. The key idea is to use the fact that while we iteratively narrow down the search for the absolutely closest distance also the total number of facets which could at all provide even shorter distances reduces. Therefore, a bounded volume hierarchy (BVH) was built from the contour hull facets [73][74][75][76]. This utility data structure organizes the locations of the facet polygons into spatial regions, such that with narrowing down our search fewer and fewer candidates need testing. The  querying structure was used during above distance computations as follows: first, an arbitrary facet was probed. Next, the resulting distance value was used to query the BVH to narrow down which candidate facets to probe next. Thereafter, the algorithm works iteratively: once the current distance value gets lower than 0.9 times the value after which the last candidate list update was queried, the candidate list gets updated. This procedure is repeated until the absolute shortest distance per interior point is found.
Distilling spatial distributions of disorientation and stress The results of each damaskpdt post-processing tool run are two ensembles of Cauchy stress tensor/distance and orientation quaternion/distance value pairs per material point. These data were finally processed into spatial distributions of stress and disorientation. Scripts for MATLAB (v2017a) and the MTEX [77,78] texture toolbox (v5.0.3) were used.
A two stepped protocol was executed to quantify the disorientation of each material point to the respective mean orientation of the reconstructed grain: first, a mean orientation was computed for each grain. For this task documented methods for inferential quaternion statistics [79] were used. Secondly, these mean orientations were evaluated against the individual orientation of each material point.
The so generated value/distance pairs were binned into distance classes on the interval [0.0, 24.0] using 0.2 step. Length units are reported in multiples of the initial [100]-direction point-to-point distance between neighboring material points. Quantile values of the resulting sub-distributions were quantified with MATLAB . Grain size distributions report how many material points are assigned per grain.

Parallel implementation and orchestrated hierarchical data placement
Above pipeline was implemented into a hybrid parallelized tool. Inherent parallelism in the data mining task was exploited: specifically, each strain step is independent. We distribute the processing of the strain step ensemble using process data parallelism. Specifically, calls to the Message Passing Interface (MPI) library [80,81] were implemented. Strain steps were distributed across the MPI processes in round-robin fashion. In addition, the processing of each strain step was accelerated via Open Multi-Processing (OpenMP) [64,[82][83][84][85][86][87]. Instead of using the commonly employed referencing of data items via global arrays, we actively enforce a partitioning of the material point positions and state variable values. Specifically, we split all derived quantities into thread-local data chunks. These chunks are placed preferentially in memory locations with fast connection to the execution core. This has two advantages: first, it places grain-local data closer in memory which facilitates faster reading from memory. Second, the partitioning reduces cache coherency traffic. We admit that our implementation currently uses a static load partitioning scheme. A possible disadvantage to investigate in the future is that load imbalances may be stronger.
To efficiently execute all linear algebra operations of the pipeline, fast Fourier transformations (FFT), singular value decomposition, and eigenvalue decomposition functionalities of the Intel Math Kernel Library (IMKL) were used.
The subsequent grain segmentation and characterization of the spatial descriptive statistics involves range querying tasks. Specifically, they demand the identification of all neighboring points inside a spherical volume of radius R. To accelerate these queries, we employed 3D box binning of P 0+ n , and P 1 n . Cubic buckets of width R were used. This allows us to execute constant (O(1)) computational time complex queries. A disadvantage of this technique compared to other established fast position querying techniques [73,88,89] is the larger memory footprint. This detail remains as future work for a possible code optimization to reduce memory consumption.

Verification and validation simulation setup
Microstructure instantiation We applied our method for studying the evolution of the spatial distribution of stresses and disorientation in full-field crystal plasticity simulations. Simulations were executed with the DAMASK spectral solver [38,44,55]. We used the DAMASK Poisson-Voronoi tessellation microstructure synthesis and orientation sampling routines. Specifically, a three-dimensional single phase cubic face centered polycrystalline representative volume element (RVE) domain was instantiated. Full three-dimensional periodic boundary conditions were used. The RVE domain contained 500 grains. These were discretized by 256 3 grid points, i.e. material points. In  processes are assigned always complete strain steps. These are machined off using OpenMP multithreaded data parallelism. Each process partitions the heavy data further to allow for a mapping of thread-local data into specific memory sections to improve spatial and temporal memory locality. effect, each grain has an average spherical equivalent radius of 20 material points. Grain orientations were sampled randomly and assigned in a spatially uncorrelated manner.
Constitutive model We used a phenomenological crystal plasticity model [90]. The tensor of elastic constants was parameterized based on experimental data for Aluminium [91]. The tensor components were C 11 = 106.75 GPa, C 12 = 60.41 GPa, and C 44 = 28.34 GPa, respectively. Dislocation slip on the 12 primary slip systems was assumed as the exclusive deformation mechanism. The phenomenological model was parameterized with a reference strain rate ofγ 0 = 0.001 s −1 and a stress exponent of n slip = 20. Stress parameters of this constitutive model were set to τ 0 = 31 MPa, τ sat = 63 MPa, the hardening parameters were assumed as a slip = 2.25 and h 0 = 75 MPa, respectively. Isothermal conditions and initially stress free grains were assumed.
Load case The RVE was subjected to uni-axial compression along the z-axis, simulated by a deformation gradient rateḞ (Eq. 1). First Piola-Kirchhoff P stress boundary conditions (Eq. 1) were set. Stars identify unconstrained values.
The resulting deformed RVE domain is visualized in Fig. 3a exemplified for the deformed configuration in the lab coordinate system at a final equivalent von Mises strain vM = 0.21.  Simulation settings and monitoring Default settings of the DAMASK spectral solver were used. At least 4 and at most 40 iterations per strain step were allowed. The simulation was executed on an in-house workstation (Tab. 1) using it exclusively and employing one MPI process with 36 OpenMP threads without pinning the threads. Microstructure snapshots were written at 28 selected strain values which the black dots in Fig. 3b detail. The following quantities were logged for each material point i at each strain step point: the volume, the deformation gradient F , its plastic F p and elastic part F e , the first Piola-Kirchhoff stress tensor P , orientation quaternion q, and a reference ID, the so-called texture ID T i . The latter integer specifies the discrete orientation ID which we assigned initially to each material point. The resulting binary spectralOut file occupied 162 GB disk space.
Simulation data post-processing Post-processing was executed on TALOS, a cluster computer (Tab. 1). Different levels of hybrid, i.e. process and thread parallelization were benchmarked. Computing nodes were used exclusively and threads pinned (OMP PLACES=cores). Explicit calls to the MPI Wtime and omp get wtime functions were used to monitor how much time the individual pipeline steps took. I/O and non-I/O operations were distinguished. The main virtual and resident set size memory consumption was probed at the node level by parsing on the fly from the /proc/self/stat system file.
The strong multithreading speed of the tool was benchmarked with runs on a single TALOS node. Specifically, quasi-sequential runs with one process spawning one thread were compared to repetitive runs of the same study with one process spawning 40 threads. The multi node weak scalability was probed by comparing the single node results with runs employing 28 MPI processes each of which spawning 40 threads. I/O operations were performed using a GPFS parallel file system built on top of 12 logical disks. Files were striped across a RAID6 array with 10 disks each (in 8+P+Q [92] configuration). The low level stripe size was 4 MB. Supplemental MTEX post-processing was performed sequentially using a desktop PC. Given the processing consumed only a few core hours in total, these analyses were not benchmarked.

Quantifying the spatial distribution of stresses towards grain boundaries
The first application of above methods is the statistical quantification of how stresses are distributed in the RVE volume with respect to the distance of each material point to the nearest grain boundary. Using the two developed grain reconstruction methods in combination with the three distancing methods allowed for rigorous and comparative analyses on the same dataset. The results are summarized in Fig. 4. They document the statistics at final strain when the stress distributions are broadest and highest in terms of absolute values. All distances are reported in multiples of the unit distance between two DAMASK material points with respect to the unloaded start configuration. Specifically, all methods identify with very similar significance and quantile values that the average σ 33 compressive Cauchy stress is 140 MPa. Only an at most 10 % absolute variation from this average value is observed when probing deeper into the grain interior. Distances larger than approximately 15 distance units from the boundary should not be interpreted because the corresponding numerical support is finite counting limited, i.e. only very few points belong to this group due to an average grain radius of 20 distance units.
Comparing the distributions for each distance class and method combination (Fig.  4a) conveys that all methods capture the flattening of the grains in the compression direction. Consequently, higher counts for the same distance class are observed for the strain step at yield versus the one at final strain.
The results for the lower distance classes document a rigorous quantification of the numerical effects inherent to spectral based methods in general, and those specific for the DAMASK spectral solver: stresses at material points in the vicinity of numerical discontinuities, here the grain boundaries, show consistently different mean stresses (Figs. 4b/4c) (10 % to 14 %). Finding this difference consistently for material points at the discontinuities suggests that this is attributable to the Gibbs phenomenon [94,95]. Latter is known as a systematic intensity overshooting in frequency space when attempting to Fourier expand at discontinuities.

Quantifying grain orientation spread accumulation towards boundaries
Motivation for this second application exercise comes from frequent literature findings which reported that material volume in the vicinity of grain [96] and phase boundaries [97] is differently disoriented than volume in the grain interior. Possible presence of such localized volume with different orientation has been identified as one reason why static and dynamic recrystallization nuclei [98] nucleate frequently on the faces or junctions of the grain boundary network [16,17,99].
The results in Fig. 5 deliver quantitative evidence that spatial gradients of disorientation are detectable with all grain reconstruction and distancing methods: material points in the proximity of a grain boundary show on average larger disorientation to the mean orientation of the grain than do points in the grain interior. It can be excluded that these differences are the result of averaging orientations from multiple material points, as it is the case when measuring e.g. kernel average misorientations (KAM) at boundaries using the SEM/EBSD technique. In fact, when quantifying KAM values a kernel with multiple neighboring material points is evaluated. Upon probing with the kernel into neighboring grains this could contribute high words, no kernel was used. It can also be excluded that the observed gradients are random spatial correlations for the following reasoning: if a grain contains multiple material points in different orientations in the fundamental zone, it is expected that non-directionally correlated point-to-point disorientations are measured if one eventually compares the disorientation for two arbitrarily picked material points. If the orientation variation is strong and the grain small, i.e. the grain has few points support, eventually larger disorientations are measured on average. However, this scatter should also be correlated if there are orientation gradients with a strong component normal to the interface within the grain. Vice versa this scatter should remain practically uncorrelated if such gradients are absent or the grain contains practically uncorrelated small regions with spurious higher disorientation. Given the strength of the gradient and mean disorientation value, one can conclude that the grains are in an incipient stage of fragmentation. They have not yet accumulated localized regions of significant point-to-point disorientation in the highangle boundary regime (15 • ), except for a minority population of material points in an at most 2 pixel wide zone at the boundaries. Qualitatively, their higher disorientation is expected as in this zone also the absolute Cauchy stress values are also slightly higher (Figs. 4).
There are two additional observations to make with practical significance for characterizing orientation gradients: first, only the methods with normal distancing capability (SDF, VOR) report similar gradient slope (≈ 4 • 15px ). Compared to disorientation based distancing, the slope is moderately lower. Second, disorientation gradient characterization with the graph clustering method shows a strong parameter sensitivity (Fig. 5b).

Comparing the two grain reconstruction methods
We identified that one practical challenge of using the graph clustering method for grain reconstructing is the strong parameter sensitivity. Therefore, it was worth studying the grain size distributions (Fig. 6a) and consistency with which individual grains were (re-)identifiable (Fig. 6b) as a function of the penalization parameter K L (algorithm B). Figure 6 summarizes the key findings.
The larger the penalization parameter K L is set the stronger every accumulation of high disorientation between nodes of the same community gets penalized. In effect, more grains with lower in-grain orientation variation are found on average for a stronger penalization (K L = 1000).
Neither the values for a low nor for a high penalization can avoid systematic challenges when using the reconstructed grain ensemble for characterizing the distance correlations. In fact, if the penalization is weak, the algorithm reconstructs different grains and comes not even close to the initially synthesized number. Figure 6b proofs that this is a systematic consequence of the method's tendency to merge neighboring grains with low disorientation at low K L values.
In effect, the mean orientation of many grain pairs is an average of at least two orientation ensembles with low intra ensemble but possible noticable inter ensemble disorientation. With respect to spatial disorientation gradients this has two consequences: first, such gradients are flatter and second they are shifted to larger disorientation on average. This explains why the gradients for weakly penalized graph clustering (Fig. 5b) are shallower.
One could avoid the possible bias introduced from the systematic grain merging by using strong penalization, such that eventually the initial grains are re-identified (Fig. 6a). Figure 6a shows the detrimental effect of such procedure when applying it to analyze the (highly) deformed configurations: oversegmentation with an associated qualitative change of the grain size distribution from unimodal to bimodal is observed. A detailed inspection of the respective material points identified that this spreading into eventually bimodality is a consequence of the fact that the strongest disoriented material points at the grain boundary get now segmented into individual small grains.
These findings have a methodological and a practical implication. The methodological implication is that graph clustering based grain reconstructions should (a) Number of reconstructed grains versus their size for a weak (WP) versus a strong (SP) penalization (b) A single grain resolved assessment of the graph clustering method [59] identifies that for a weak penalization K L = 75 preferentially neighbors with low disorientation are merged, thereby reducing the total number of grains. Figure 6: Texture ID versus graph clustering grain reconstruction: compared in terms of how many grains were reconstructed and how individual grains were (re-)identifiable for different penalization. Initial versus the final strain step results are compared. always be backed up by rigorous quantitative parameter sensitivity studies. One strategy could be to report always unnormalized distributions of grain sizes as a function of the segmentation parameter. Otherwise, different physical conclusions might be drawn even though one uses the same method. One example, disorientation accumulates towards the boundaries versus it does not, is exemplified with Fig. 5b in this work.
The practical implication reads as follows: grain fragmentation should be better characterized ideally via the evolution of the grain boundary network rather than to continue insisting logically on the initial grain as the decisive object. In fact, not the homogeneous regions within a deformed grain but the heterogeneities are the key microstructure locations and first descriptors to investigate where and how annealing microstructure evolution initiates.

Implications of numerics-induced scatter for non-local and coupled crystal plasticity/interface migration spectral method models
Evincing such numerics-induced scatter suggests to apply above quantification protocol in the future regularly. Especially, it should be applied in so-called non-local crystal plasticity models. These predict the local material point values via evaluating a kernel of neighboring integration / material points. Examples are continuum scale crystal plasticity solvers with incrementally and locally coupled dislocation flux sub-models, such as the one from [31,100] or full-field crystal plasticity models with incremental coupling to phase-field solvers.
One should always start inspecting such solvers from an in-depth monitoring of the (state variable) values and the non-local flux terms for each material point as a function of strain. Methods, such as the ones developed in this paper, could be used to assess via such monitoring in how far numerical noise remains uncorrelated and low to avoid a systematic biasing of the results towards larger strains.
Another example in which rigorous quantitative monitoring of state variable values is useful are incrementally coupled grain boundary migration / crystal plasticity models. These typically evolve microstructures along complex transients. As such, any permeation of correlated numerical noise into the flux or interface migration terms should as best as possible be reduced to keep rigorously controlled conditions, physical accuracy and precision.
Knowledge in the discrete dislocation dynamics community teaches us that uncontrolled stress spikes should better remain numerically controlled, for instance by supplementing the integration with some time trajectory before making choices [101,102]. For these future challenges the work delivers an approach based on which to build further tools for assessing the numerical and physical quality of such predictions.

Comparing computational efficiency
Strong scaling multithreading performance To document the strong scaling efficiency of the multithreading capabilities, Fig. 7 summarizes wall clock timings. Furthermore, the figure documents key computing time contributions for all combinations of above postprocessing methods applied to the same data, using the same computer, and exclusive execution. The left column of each method column pair in Fig. 7 documents the sequential execution time. The results identify that signed distance / voxelization based distancing (SDF), in combination with either of the two grain reconstruction methods, is the least costly combination. Results for a single strain step and sequential execution are available after 2 h approximately. Given the capability to identify normal distances is another argument to choose the SDF method over disorientation based distancing. Figure 7: Key results from benchmarking all three methods in combination with both grain reconstruction protocols using the same strain step ( vM = 0.21) and the same cluster computer. The numbers above the right column of each column pair denote the respective speedup achieved for the most costly method specific processing step when executing in parallel. Abbreviations identify the disorientation based (DIS), the signed distance/voxelization based (SDF), and the tessellation based distancing (VOR). Grain reconstruction was based either on the initial assignment (i.e. texture ID based TEX) or graph clustering (LOU). Sequential execution is compared with threaded execution. Analyses using the LOU method have, different than those for TEX, a different sequential bottleneck (indicated e.g. for DIS/LOU, 40 by the green column portion): the community detection stage, which was in this work not parallelized. Figure 7 documents that all methods concentrate at least 87.5 % to 99.3 % of the total execution time in not more than three algorithmic code sections. For the disorientation based distancing these steps are the querying of neighboring points and computing point-to-point disorientation angles. For the signed distancing method it is the identification of periodic images, volume rediscretization, and the remapping of every voxel to a closest unique material point. For tessellation based distancing the evaluation of the projected normal distances for each material point demands contour facet querying.
In this work, all these most costly steps are parallelizable. Thus, specific care was taken to implement them in parallel. In addition, data placement was orchestrated such that GB-sized data chunks were placed close in the memory of the executing core taking into account the memory hierarchy of the cluster computer.
In effect, this speeds up the execution of every method combination by 13.6 to 24.4 times the sequential baseline. With 40 OpenMP threads required to achieve this, a strong scaling efficiency between 61 % to 34 % is documented. The main obstacle to achieve higher efficiency in this study was load imbalance. Such was strongest for the tessellation (VOR) method because grains were distributed round-robin, i.e. statically to the threads. When building tessellations, however, the grain size critically affects how many facets the contour hull contains; and thus how dissimilar the distancing costs are between different grains. In light of this splitting a population with only 500 grains across 40 threads will equip the threads with possibly too few grains to compensate for the large variety of grain sizes, i.e. cost and facet count differences in the contour hull processing stage.
A possible improvement for the future is to change the implementation and employ dynamic work scheduling. A possible solution could be to use the OpenMP task construct to delegate the processing of the grains to a queue which the threads then machine off dynamically. However, this eventually demands for a compromise: the current round-robin grain distributing procedure allows one to place them in a controlled memory location [84]. A dynamic scheduling, in turn, will typically keep the cores more frequently busy at the cost of typically more memory traffic.
Another potential improvement of this work, with immediate practical benefit, is to parallelize also the graph clustering algorithm for which CPU-and GPU-parallelized community detection solvers have been developed in the past [103][104][105][106].
The results document additional practical improvements in which the present work advances the field: the first is improved speed, when compared to hitherto reported values for DAMASK post-processing [38]. This was achieved with sequential optimization of the file accessing strategy surplus employing parallelized I/O to cut I/O costs by at least an order of magnitude.
As an additional strategy to improve sequential performance, state of the art linear algebra libraries were used to compute point-wise tensor quantities. Such improvement is of practical use not only as DAMASK employs it for solving an RVE but also during post-processing to compute e.g. a flow curve. As the second improvement, this work details how to combine these strategies with multithreaded execution surplus trivial data parallel processing of strain steps on multiple nodes of a cluster computer.
Having the possibility to construct contour hulls to each grain in the deformed configuration is the third improvement. On the one hand because it allows to quantify a volume for each grain not only by counting the number of material points it contains but by accumulating the volume of its supporting Voronoi cells. The resulting polygon mesh allows for volumetric rendering of the grain as a polyhedron, as exemplified in Fig. 1b. These may be useful to allow further analyses using other microstructure characterization tools like DREAM3D [53] or QUBE [52].
Hybrid execution performance During hybrid execution every MPI process evaluates a different strain step and spawns an own group of threads. Data are read independently from a priori known sections of the DAMASK results file. In effect, this enabled to evaluate all 28 strain steps at once using 1120 cores. Figure 8 summarizes the key results for this hybrid execution mode. Specifically, two sets of common analysis cases were probed: a flow curve extraction versus analyses of aforementioned state variable values to boundary distance correlations, six of them in total. Figure 8: Key results from benchmarking all three methods in combination with both grain reconstruction protocols on the entire strain step ensemble. Different computing costs are quantified via the wall clock time. The results identify that the amount of computational work is different across the strain step ensemble. Taking the first strain step as the reference, relative complexities were established through monitoring e.g. how many voxel were processed for the ensemble of grain bounding boxes and how many Voronoi facet polygon inclusion and edge comparison tests were required to find the closest distances.
There are two key findings of practical importance. First, hybrid processing enables to cut wall clock time by an additional order of magnitude. In effect, the evaluation of a flow curve, which took damaskpdt 3.2 h sequentially, was solvable in 76 s when using parallelism. Similarly, distancing results for all strain steps were analyzed after 1.5 h when using tessellation (VOR), the most costly method. Distances were even faster accessible, namely in 8 min, using the signed distance method.
The practical benefit of these findings is put in perspective by the following observation. While the SDF method was already finished for the entire ensemble, the classical DAMASK post-processing routines had not even completed the processing of the flow curve for the first strain step.
The second key finding is that parallel performance gains are limited. Respective arguments for the multithreaded analyses were already identified (Fig. 7). Not parallelizing all post-processing computational steps is another apparent reason for limited scalability [107]. The most significant obstacle to cut execution time further, though, are microstructure-induced differences. Specifically, differences which are caused by dissimilar interface geometry networks across the strain step ensemble. Consequently, the individual algorithms show different solving time resulting in workload differences across the strain step ensemble.
Two observations are important to add here. First, these inter strain step work load differences would remain even if the multithreaded execution of each strain step gets better work load partitioned. Second, it is in fact the successive reduction of the relative numerical complexity for a strain step with increasing strain which limits primarily the hybrid performance of MPI parallelization. Provided there were more strain steps to process very likely even more processes could be employed. This is one of the key advantages to our true scientific computing performance solution for post-processing DAMASK simulations.
One example is visible for signed distance based distancing. Given that the RVE shape change results in a moderate increase in the total bounding box volume, more voxel have to be probed for the higher than the lower strain steps. Another, by far stronger, example for microstructure induced work load imbalances is visible for tessellation based distancing (Fig. 8).
When using the strongly penalized grain reconstruction, more grains are detected at higher strains. Therefore, also more boundary area is generated on average, as a consequence of which the material points lay closer to a boundary, providing for faster distancing operations. In addition also different contour hull shape and facet compositions are generated. This modifies the composition of the individual bounded volume hierarchies, and thus their individual response and pruning effectiveness during the querying stage.
It is useful to compare the computational costs of post-processing to the actual simulation costs. The DAMASK deformation simulation occupied 36 cores for 354 h wall clock time, i.e. 12744 core hours were spent in total. The post-processing of the entire strain step ensemble using the most costly distancing method kept 1120 cores busy for 4822 s wall clock time, i.e. 1500 core hours in total, or at most 11.8 % of the simulation costs.
Lastly, the memory consumption of the simulation and post-processing should be reported. DAMASK allocated a total of 230 GB virtual memory, i.e. 14.3 kB per material point. Post-processing demands less virtual memory per process when evaluating the flow curve only (21 GB). However, virtual memory consumption peaks at 134 GB for the most costly strain step and using the VOR/LOU distancing method. One reason for the larger memory consumption are the storage demands for position values for a portion of the 26 periodic images to all material points. Another reason is that Voronoi polyhedra and their facets have to be stored for tessellation based grain reconstruction.
Another contribution is due to the conservative strategy we used to implement the querying structures. This could be optimized in the future with potential for a memory footprint reduction of at least a factor two, thereby making possible also to post-process as large RVE simulation data on cluster computers with individually smaller node main memory than we used in this work.

Conclusions
A set of strong and weak scaling post-processing methods were developed to quantify the accumulation of state variable values at grain boundaries in 3D full-field spectral method crystal plasticity simulations. Exemplified for the DAMASK spectral solver and a phenomenological constitutive crystal plasticity model the results substantiate: • The methods successfully reconstruct the grain boundaries and operate on arbitrarily deformed and periodically confined RVE domains with respect to the deformed configuration. This allows for direct comparison to experiments and rigorous numerical assessment of the spectral method.
• Long range gradients of material point disorientation, with higher values to the boundary than in the grain interior, were quantified. They are a signature of incipient grain fragmentation.
• As an alternative method to capture the individual fragments, also a graph clustering grain reconstruction method was assessed. Compared to the classical method of assigning grain IDs based on the initial conditions, though, the quantification of the gradients is very parameter dependent.
• Three methods to quantify the distance of a material point to the boundary were detailed. In effect, a combination of signed distance function representation and volume rediscretization delivered the lowest method affected quantification differences. Furthermore, such approach has the lowest numerical costs of all methods tested.
• All performance critical computational tasks were implemented distributed and shared memory parallel. Therewith, processing the entire set of 28 strain steps from a 256 3 RVE cube DAMASK simulation, worth 162 GB, was possible in 470 s wall clock time using 28 processes each spawning 40 threads. Compared to exclusive multithreaded (11 063 s) or sequential execution (150 000 s), this hybrid approach allowed for 320 times faster post-processing. The efficiency could be improved further if heterogeneous work load across the individual strain step data, which is method dependent, gets better balanced and remaining sequential code portions parallelized.

Data and code availability
The source code of the tool and its supplementary MATLAB scripts with which we post-processed all results are open source software. They are publicly accessible [108].
All DAMASK and post-processing parameter settings files, including the SLURM batch system submission scripts are available for download [109] ‡ . The entire repository of compressed post-processed data occupies 844 GB. It is available from the authors upon serious request.

Work distribution
MK designed and implemented the tool surplus performed and post-processed the simulations. MK and FR wrote the manuscript and discussed the results on a regular basis.

Appendices
A. Detailed specification of the post-processing pipeline Algorithmic details In what follows, the individual steps of the post-processing pipeline are detailed. For all steps with a larger than linear time complexity O(N x N y N z := N ), additional details are commented. In every step, N denotes the total number of unique material points supporting the RVE. Indices n specify the strain step. Three point cloud sets are distinguished: the first contains the N unique material points of the original domain P 0 n ∈ R 3 . The second, P 0+ n , contains P 0 n surplus all periodic images of P 0 n which lay inside a cuboidal bounding box about a cuboidal tight bounding box to P 0 n . The outer bounding box is fattened by = 0.1 in length units of the initial RVE domain. The third point cloud set, P 1 n contains N 1 n points, i.e. all members of P 0 n surplus all their 26 periodic copies. The members i of the material point ensemble P 0 n have positions p 0 i,n which are defined by their initial locations in the RVE in the reference configuration [38,44] x 0 i,n surplus a deformation-induced displacement which maps into the deformed configuration in the laboratory coordinate system ∆x 0 i,n . Every material point has associated state variable values s i,n . In addition, a reference ID and periodic image variant ID is stored to track which periodic image points are assigned to which unique material point of P 0 n . This allows dereferencing of state variable values ‡ They will be made available upon acceptance of the article s i,n rather than duplicating them. IMKL specifies in which steps the Intel Math Kernel Library is used. Threaded stresses and strains ∀i ∈ P 0 n via IMKL 10: RVE averaging F , P , σ vM , vM

12:
Compute periodic images P 0+ n 13: Build a spatial index B 0+ n that partitions P 0+ n 14: Threaded grain reconstruction 15: if Initial assignment / texture ID based then 16: RECONSTRUCT GRAINS TEXTUREID (see B) 17: if Community detection based then 18:

GRAIN GEOMETRY SIMPLIFICATION (see B)
20: Threaded state variable/distance to boundary quantification 21: if Disorientation thresholding based scalar distances then 22: ANALYZE DIS (see C) 23: if Signed distance / voxelization based normal distances then 24: ANALYZE SDF (see C) 25: if Tessellation based contour hull normal distances then Texture ID based grain reconstruction T i specifies the initial texture ID assigned during microstructure synthesis DAMASK [38,44].
Algorithm 2 Texture based grain reconstruction 1: procedure RECONSTRUCT GRAINS TEXTUREID 2: for i ← 1, N do 3: Assign point i a grain ID G i,n := T i Graph clustering based grain reconstruction N mv specifies the number of executed node relabeling operations in the current iteration step, Q the Newman-Girvan modularity [58] using a critical quality value of Q c = 0.01. K L specifies the penalization strength, i.e. whether a weak K L = 75 or a strong K L = 1000 penalization was used. [q i,j,n ] 0 denotes the rotation angle argument of the disorientation quaternion. The search radius during graph edge construction was chosen as R lv c = 2 256 distance units. Find all neighboring points p j,n ∈ P 0+ n | { x i,n − x j,n ≤ R lv c }

6:
Add one graph edge with weight w i,j,n for every position pair p i,n , p j,n 7: do 8: Iterative graph clustering according to [59][60][61]110] 9: while N mv > 0 and Q > Q c 10: for i ← 1, N do

11:
Assign i the final label of the top-level community as grain ID Grain geometry simplification N 0 k,n and N 1 k,n , respectively are the number of material points supporting the k-th grain in P 0 n and P 1 n respectively. A k,n specifies a tight axis-aligned bounding box about the positions N 1 k,n . A n is the tight cuboidal global bounding box which contains all A k,n . In this work a cell size of half the normalized initial material point point-to-point distance was used, i.e. d cell := 1 2·256 .
Algorithm 4 Threaded extraction of a single periodic image per grain 1: procedure GRAIN GEOMETRY SIMPLIFICATION 2: Re-organize members of P 0 n into disjoint sub-sets P 0 k,n per grain k In what follows, multithreaded processing ∀k grains 3: Threaded memory initialization 4: Threaded periodic images P 1 k,n and bounding boxes A k

5:
Threaded building of sparse spatial index B 1 k,n via B 1 n 6: Threaded merging of possible grain fragments via DBScan [65] O(N k log(N k )) 7: Threaded picking of one representative per grainÂ k,n

8:
Identify global bounding box A n enclosing allÂ k,n 9: Define a global voxelization L n of volume A n with cell size d cell for i ← 1, N do Multithreaded, dynamic scheduling 3: Find all N j neighbors p 0+ j ∈ P 0+ n | {x i,n − x j,n ≤ R lv c }

4:
Sort by distances d i,j,n := x i,n − x j,n O(N j log(N j )) 5: for j ← 1, N j do 6: Compute disorientation angle Θ i,j,n 7: if Θ i,j,n ≥ Θ c then 8: Report (d i,j,n , s i,n ) 9: break ignore i if no pair found Signed distance / voxelization based normal distances Abbreviations denote a signed distance function (SDF) and the fast sweeping method (FSM) [64,69].
Algorithm 6 Signed distance / voxelization based normal distances procedure ANALYZE SDF 2: Threaded voxelizing of k according to L n , d cell using B 1 n Specifically, ∀ voxel ∈ B 1 k,n identify closest member of P 0 n 4: Threaded initialization of SDF Φ k,n (x) 0 Threaded spreading of SDF via FSM Φ k,n (x) 1 as in [64] 6: Threaded report (Φ k,n (x), s j,n ) ∀ voxel with Φ k,n (x) ≥ 0

D. Tessellation based contour hull normal distanceŝ
A k,n specifies an axis-aligned bounding box about the identified single periodic image of grain k. The box is fattened by a guard zone of thickness 3 256 of the normalized initial material point-to-point distance. The fattening assures that the volume about each material point belonging to the grain k can be tessellated and no Voronoi cell gets arbitrarily cut off by the bounding box walls. The query structure of the Voro++ library was configured to include five points per spatial bin on average. For each grain k interior points are distinguished from exterior points with respect to their logical identity when building the exterior contour hull of the grain. Specifically, interior points ip originate from a material point in P 0 n with grain id k. BVH is short for bounded volume hierarchy. Threaded pulling of all pointsÂ k,n from B 1 n 3: Threaded mapping periodic image point to unique material point

4:
Threaded Voro++ build spatial index to accelerate tessellator

8:
Threaded construction of a BVH with all facet polygon 9: for ip ← 1, N k ip do 10: Pool all nf facets f of the contour as initial candidates 11: while f < nf do Normal projection x in on e 16: if x e proj on the edge = true then 17: Eventually update current shortest distance 18: Eventually re-query facet candidate pool 19: for v ← 1, N f v do Circulate facet vertices v of f 20: Compute distance d in,v = x in − x v

21:
Eventually update current shortest distance 22: Eventually re-query facet candidate pool 23: else 24: Eventually update current shortest distance 25: Eventually re-query facet candidate pool 26: Threaded report value pairs (d ip,n , s ip,n ) ∀ip