Machine learning models of plastic flow based on representation theory

We use machine learning (ML) to infer stress and plastic flow rules using data from repre- sentative polycrystalline simulations. In particular, we use so-called deep (multilayer) neural networks (NN) to represent the two response functions. The ML process does not choose ap- propriate inputs or outputs, rather it is trained on selected inputs and output. Likewise, its discrimination of features is crucially connected to the chosen input-output map. Hence, we draw upon classical constitutive modeling to select inputs and enforce well-accepted symmetries and other properties. With these developments, we enable rapid model building in real-time with experiments, and guide data collection and feature discovery.


Introduction
Our effort to produce viable models of plasticity from trusted data draws upon traditional constitutive modeling theory and newly developed machine learning techniques.
The application of machine learning (ML) to engineering dates back to at least the 1980's and covers a wide variety of problems. For instance, Adeli and Yeh [16] applied ML to the design of steel beams; Hajela and Berke [17] used a ML model as a surrogate for the exact response of structures to enable fast optimization; Cheu and Ritchie [18] applied ML to traffic modeling; and Theocaris and Panagiotopoulos [19] used it to model fracture behavior and identification. For further bibliography along these lines, see a review of neural network applications in civil engineering that appeared in 2001 [20].
Research on applying ML to constitutive modeling dates back to roughly the same time period. In solid mechanics in particular, Ghaboussi et al. [21] applied a neural network (NN) to data from experiments of beam deflection. They created a model which acquired increasing fidelity as experiment progressed via hierarchical learning and adapting new hidden layers. Furukawa and Yagawa [22] constructed an "implicit" model of linear viscoplasticity with a NN based on a state space formulation, where the NN provided the driving term for plastic evolution and the elastic response was assumed to be known. Notably, they expressed a need for variety in the training data.
More recently, a number of studies have appeared comparing NN plasticity models to other models calibrated on experimental data for specific materials. Lin et al. [23] built a NN model of the flow stress of low alloy steel based on only experimentally observable quantities. Bobbili et al. [24] constructed a NN model of high strain rate Hopkinson bar tests of 7017 aluminium alloy and compared it to a Johnson-Cook model. For T24 steel, Li et al. [25] compared a NN model to a modified Zerilli-Armstrong and strain-compensated Arrhenius-type model. They remarked on the opacity of the NN model and the need for extensive data. Desu et al. [26] made flow stress prediction of austenitic 304 stainless steel with support vector machine construct and compared it to a NN model. Asgharzadeh et al. [27] modelled the flow stress behavior of AA5086 aluminum using a NN with two hidden layers. (Also, in the realm of fluid mechanics, Ling et al. [28,29], Duraisamy et al. [30,31], and Koumoutsakos et al. [32] have been particularly active in applying machine learning techniques to model turbulence [33].) Unlike traditional models based on physical mechanisms and intuition, these ML models are purely data-driven and phenomenological. Recently, mathematical analysis has been applied to understanding the training and response structure of NNs, which have traditionally been treated as black boxes. The work of Tishby and co-workers [34] (and Koh and Liang [35]) is particularly illuminating and explores the trade-offs between information compression and prediction accuracy in the training process.
In the wider context of data-driven modeling, a number of recent developments [36,37,38,39,40] are also noteworthy. Alharbi and Kalidindi [36] constructed a database of Fourier transformed microstructural data and used this spectral information to drive evolution of crystal plasticity simulation. Kirchdoerfer and Ortiz [37] sought to subvert the traditional empirical model in the data-to-model-to-prediction chain and replace it with a penalization of the prediction response by its distance to closest experimental observation/data point. This approach of directly using a database is commendable (but lacked data interpolation which appears, for example, in Ref. [41]). The optimization was constrained by conservation principles like a Newtonian force balance and was applied to truss and elasticity problems. The authors explored the technique's robustness to noise and convergence. Versino et al. [39] applied a genetic/evolutionary algorithm and a symbolic regression to model Taylor impact test data. The symbolic regression machine learning technique selects a best model composed of given analytic building-blocks and is especially attractive since the resulting tree structure leads to a physically intepretable model based on the physics embedded in the building-block sub-models. Lastly, Bessa et al. [40] integrated design of experiments, simulation, and machine learning in materials discovery and design. It should be noted that the Materials Genome and similar material discovery and selection efforts [42,43,44] are a deep and active field of research but this classification problem has minor bearing on the constitutive modeling task at hand.
In the vein of designing the architecture NN suit to specific tasks, the method we adopt and generalize, the Tensor Basis Neural Network (TBNN) [45], is not simply a feed-forward, deep neural network. Unlike other NN mechanics models of components of output quantities, e.g. stress, TBNN models have built-in invariance properties. The TBNN formulation shifts the basis for the unknown coefficient functions from the (arbitrary) Cartesian basis of the training data to an objective basis made up of powers of the selected inputs, as representation theory [9,13] suggests. This comes with the cost that the coefficient functions and basis are not linearly independent i.e. they must be trained simultaneously. This representation is akin to the Gaussian Approximation Potential (GAP) with the Smooth Overlap of Atomic Positions (SOAP) basis [46] that is gaining popularity in molecular dynamics, in that this machine learning constitutive function uses a spectral basis to preserve rotational and permutational invariance. It also has goals in common with image transforms that embed invariance properties [47,48].
Motivated by the goal of achieving on-the-fly model construction, directed sampling/experiments, and discovery of features/trends in large datasets, in this work we show how classical constitutive modeling is needed to obtain viable ML models of constitutive behavior. In Sec. 2, we provide the fundamentals of representation and plasticity theories and connect them with our NN formulation of the components of plasticity, namely the stress and flow rules. In Sec. 3, we discuss how the data to train the models is obtained, the specifics of the learning algorithm, and the time integration algorithm used to predict the plastic evolution. One of the data sets is obtained from the elasticplastic response of an ensemble of oligo-crystalline aggregates, and so the resulting NN model can be considered a form of homogenization. The results of these developments are discussed Sec. 4 and include comparisons of various model architectures and inputs based on cross-validation errors and evaluations of stability and prediction accuracy. Finally, in Sec. 5, we discuss results and innovations, such as the generalized tensor basis architecture, the novel ways of embedding physical constraints in the formulation, and the exploration of data sufficiency, robustness, and stability.

Theory
In this section we provide a concise overview of representation theory and how we apply it in the context of constitutive modeling by (artificial) neural networks (NNs). Specifically, we employ a generalization of the Tensor Basis Neural Network (TBNN) [45] concept based on an understanding of classical representation theory. With it we construct models that represent the selected output as a function of inputs with complete generality and compact simplicity. This construction is distinct from the predominance of component-based NN constructions, for example those mentioned in the Introduction, in that basic symmetries, such as frame invariance are built in to the representation and do not need to be learned.

Representation theory
Representation theorems for functions of tensors have a foundation in group theory [49,50,51,52] with the connection being that symmetry is described as functional invariance under group action. In mechanics, the relevant invariance under group action are rotations (and translations) of the coordinate system, which is known as material frame indifference, invariance under super-posed rigid body motions or simply objectivity. 1 This is a fundamental and exact symmetry. Practical applications of representation theory to mechanics are given in Truesdell and Noll's monograph [13, and Gurtin's text [14,Sec. 37] and address complete, irreducible representations of general functions of physical vector and tensor arguments. For example, the scalar function f (A) of a (second order) tensor A is invariant if and a (second order) tensor-valued function M(A) is objective if for every member G of the orthogonal group.
Underpinning the representations of f and M are a number of theorems. The spectral theorem states that any symmetric second order tensor A has spectral representation : composed of its eigen-values {λ i } and eigen-vectors {a i } where i = 1, 3. The spectral representation of A makes powers of A take a simple form: A n = i λ n i a i ⊗ a i (and in particular A 0 ≡ I). The equally important Cayley-Hamilton theorem states that the tensor A satisfies its characteristic equation : where {J i } are the principal (scalar) invariants of A. The (generalized) Rivlin's identities [54,55] provide similar relations for multiple tensors and their joint invariants. Scalars that respect Eq. (1), such as {J i }, are called scalar invariants and are formed from (polynomials or, more generally, functions of) the eigenvalues of A. Hence, f (A) reduces to f (A) = f (I) (5) where I is a set of scalar invariants of A, and hence f is also an invariant. A set of invariants I is considered irreducible if each of its elements cannot be represented in terms of others and conveys a sense of completeness and simplicity. 2 Since the eigenvalues {λ i } are costly to compute, typically traces such as {tr A, tr A 2 , tr is a good starting point. The coefficient functions c i are represented in terms of scalar invariants as in Eq. (5). This power series representation can be reduced by application of the Cayley-Hamilton 1 Frame indifference is a special case of the more general principle of covariance with changes of the metric tensor [53,Sec.3.3]. 2 In some sense, a complete set of invariants are coordinates on the manifold induced by symmetry constraints and hence are clearly not unique in their ability to coordinatize the manifold.

theorem (4), in the recursive form
The transfer theorem (as referred to by Gurtin [14,Sec. 37]) states that isotropic functions such as M(A) inherit the eigenvalues of their arguments and implies the fact that these functions are co-linear with their arguments. Also Wang's lemma (I, A, A 2 span the space of all tensors co-linear with A) is a consequence of Eq. (3) and Eq. (4), and gives a sense of completeness of the representation: Eq. (7) evokes the general representation for a symmetric tensor function of an arbitrary number of arguments in terms of a sum of scalar coefficient functions multiplying the corresponding elements of the tensor basis. The general methodology for constructing the functional basis to represent scalar functions is given in Rivlin and Ericksen [56], and the corresponding methodology to construct tensor bases is developed in Wang [57,58]. Representation theory, like machine learning, does not determine the appropriate arguments/inputs and output for the constitutive functions. In mechanics, there is a certain amount of fungibility to both. For instance, the (spatial) Cauchy stress can easily be transformed into the (referential) first Piola-Kirchhoff stress, and left and right Cauchy-Green stretch have same eigenvalues but different eigen-bases. Also, any of the Seth-Hill/Doyle-Ericksen strain family [59,60,61] provide equivalent information on deformation, and any of the objective rates formed from Lie derivatives [62,63,64,65] provide equivalent measures of rate of deformation; however, some choices of arguments and output lead to greater simplicity than others.
Lastly, it is important to note that isotropic functions are not restricted to isotropic responses. The addition of a structure tensor characterizing the material symmetry to the arguments allows isotropic function theory to be applied so that the joint invariants encode anisotropies [66,67,68,69,70,12].

Plasticity models
Briefly, plasticity is an inelastic, history-dependent process due to dislocation motion or other dissipative phenomena. We assume the usual multiplicative decomposition [71,72] of the total deformation gradient F into elastic (reversible) F e and plastic (irreversible) F p components As a consequence, the velocity gradient in the current configuration, l ≡ḞF −1 , can be additively decomposed into elastic and plastic components : refer to [73,Sec. 8.2]. The assumption that F p is pure stretch (no rotation) reduces L p to D p = sym L p . The elastic deformation determines the stress, for instance the Cauchy stress T: and the evolution of the plastic state is determined by a flow rule, e.g. : where F p quantifies the plastic state and T the driving stress. Invariance allows the reduction of the argument of T to, for example, the objective, elastic Almansi strain e e = 1 2 I − b −1 e based on the left Cauchy-Green/Finger stretch tensor b e = F e F T e . Similarly, the state variable in the flow rule can be reduced by applying invariance, for example, b p = F p F T p . The driving stress can be attributed to the deviatoric part of the pull-back of the Cauchy stress T: σ = dev F −1 e TF −T e which is also invariant and also coexists in the intermediate configuration with D p . Furthermore, a deviatoric tensor basis element, such at σ, generates an isochoric flow which respects plastic incompressibility det F p ≡ 1. Other choices of the inputs and outputs of the stress and flow functions are discussed in Results section. Typically both the stress and flow are derived potentials to ensure elastic energy conservation for the stress and associative flow for the flow rule; however, in this work we to allow for a more general flow and non-differentiable NN model. (Experiments typically cannot measure potentials directly). 3 A few basic properties are built into traditional empirical models that need to be learned in typical NN models. First, zero strain, e e = 0, implies zero stress : and, likewise, zero driving stress should result in zero plastic flow : Also there is a dissipation requirement for the plastic flow. Generally speaking, the Coleman-Noll [74] argument, together with the first and second law of thermodynamics, applied to a free energy in terms of the elastic deformation and a plastic history variable results in: (a) the stress being conjugate to the elastic strain rate, and (b) the internal, plastic state variable, when it evolves, reduces the free energy via M · L p ≥ 0 where M is the Mandel stress This reduces to refer to Ref. [73,Sec. 8.2]. Also, given the physics of dislocation motion, it is commonly assumed that the plastic deformation is incompressible, det F p = 1, which implies the flow is deviatoric For more details see the texts Refs. [73,75,76].

Application to neural network constitutive modeling
We generalize the Tensor Basis Neural Network (TBNN) formulation [45] to build NN representations for the stress relation, Eq. (10), and the plastic flow rule, Eq. (11), that embed a number of symmetries and constraints. Both T and D p are required to be isotropic functions of their arguments by invariance. As discussed, classical representation theorems give the general form where A ≡ {A 1 , A 2 , . . .} are the pre-supposed dependencies/arguments of function f , I ≡ {I j } is an (irreducible) set of scalar invariants of A, and B = {B j } is the corresponding tensor basis. In Eq. (17), only the scalar coefficient functions are {f i } are unknown once the inputs have been selected and hence they are represented with a dense NN using the selected scalar invariants I as inputs embedded in the overall TBNN structure. In the TBNN framework, the sum the NN functions {f i (I)} and the corresponding tensor basis elements {B i } in Eq. (17) is accomplished by a so-called merge layer, and the functions {f i } are trained simultaneously (refer to Fig. 1 and more details will be given in Sec. 3.2). This formulation is in contrast to the standard, component-wise NN formulation: which is based on components of both the inputs {A 1 , A 2 , . . .} and the output f . For the stress, we assume a single symmetric tensor input selected from the Seth-Hill/Doyle-Ericksen elastic strain family, in particular e e , is sufficient, so that representation Eq. (7): is appropriate. Despite this formulation being based on strain, versus stretch, it does not embed the zero stress property, Eq. (12), and, hence, σ 0 (I) will need to learn that zero strain implies zero stress. Since we prefer to impose, rather than learn, physical constraints such as Eq. (12) since this reduces the necessary training data [45] and the exact satisfaction leads to conservation and other properties necessary for stability, etc. Exact satisfaction of Eq. (12) can accomplished a few different ways: (a) shifting the basis with the Cayley-Hamilton theorem (4) refactoring (b) some T = (I 2 σ 0 ) I + σ 1 e e + σ 2 e 2 e , or (c) all T = I 2 σ 0 I + σ 1 e e + σ 2 e 2 e of the coefficient functions {σ i } with I 2 = tr e 2 e . In general, any of these representations can be expressed on the spectral basis so there is a (weak) equivalence between coefficient functions of the various representations. Here, e e = i λ i a i ⊗ a i . As mentioned, we assume that the inputs to the flow rule are (a) a history variable b p , and (b) driving stress σ. A general function representation from classical theory for an isotropic function of two (symmetric) tensor arguments requires ten invariants [54] (see also [11, Ch.3, Eq. 9 and 11]): and eight tensor generators/basis elements where sym A ≡ 1 2 (A + A T ). To satisfy the zero flow condition, Eq. (13), we can shift basis for the second, stress argument and eliminate all basis elements solely dependent on the first, plastic state argument: Plastic incompressibility, in the form of deviatoric plastic flow, Eq. (16), can imposed by applying the linear operator dev, dev A = A − 1 3 tr(A)I, p σ + f 12 dev sym b p σ 2 Dissipation of plastic flow can be strictly imposed by requiring that the flow be directly opposed to the stress in Eq. (15) which implies: and f 1 (I) > 0 and f 3 (I) > 0. In this study we will rely on the learning process to ensure the positivity of the coefficient functions f 1 and f 3 but this could be accomplished exactly with the Macauley bracket (ramp function) applied to f 1 and f 3 , for example.

Methods
We train the NN models of plasticity with data from two traditional plasticity models. In this section we give details of (a) the traditional models, (b) the training of the NNs, and (c) numerical integration of the TBNN plasticity model.

Plasticity models
In an exploration of the fundamental properties of NNs applied to plasticity, we seek to represent responses of two models: (a) a poly-crystalline representative volume element (RVE) with grainwise crystal plasticity (CP) response (an unknown closed form model since the poly-crystalline aspect of the CP model obscures its closed form), and (b) a simple visco-plasticity (VP) material point (a known closed form model). Both are finite deformation models so that invariance and finite rotation are important; and both are visco-plastic in the sense of lacking a well-defined yield surface and strictly dissipative character. Briefly, crystal plasticity (CP) is a well-known meso-scale model of single crystal deformation. Here we use crystal plasticity to prescribe the response of individual crystals in a perfectly bonded polycrystalline aggregate. The theoretical development of CP is described in Refs. [77,78,79,80,81] and the computational aspects in reviews [82,83].
Specifically, for the crystal elasticity, we employ a St. Venant stress rule formulated with the second Piola-Kirchhoff stress mapped to the current configuration where the elastic modulus tensor C = C 11 J+C 12 (I−J)+C 44 (I⊗I−J) has cubic crystal symmetries with C 11 , C 12 , C 44 = 204.6, 137.7, 126.2 GPa, and E e = 1 2 F T e F e − I is the elastic Lagrange strain.
and δ ij is the Kronecker delta. Plastic flow can occur on any of 12 face-centered cubic (FCC) slip planes. Each crystallographic slip system, indexed by α, is characterized by Schmid dyads P α = s α ⊗n α composed of the allowed slip direction, s α , and the normal to the slip plane, n α . Given the set {P α }, the plastic velocity gradient is constructed via: which is inherently volume preserving in the (incompatible) intermediate/lattice configuration. Finally, the slip rateγ α is related to the applied stress through the resolved shear (Mandel) stress τ α = M · P α , for that slip system. We employ a common power-law form for the slip rate relatioṅ whereγ α0 = 122.0 (MPa-s) −1 is a reference strain rate, m = 20 is a rate sensitivity exponent, and g α = 355.0 MPa is a hardness value. These parameters are representative of steel.
With this model in Albany [84], we simulate the polycrystalline response using a uniform mesh 20 × 20 × 20 with the texture assigned element-wise (via Dream3d [85]) and strict compatibility enforced at the voxelated grain boundaries. Ten realizations with 15,15,17,18,18,19,19,20,20,21,22 grains were sampled from an average grain size ensemble and each grain was assigned a random orientation. Minimal boundary conditions to apply the various loading modes ( tension, shear, etc. ) were employed on the faces and edges of the cubical representative volumes. Also, we limit samples to a single, constant strain rate 1.0 1/s. The simple visco-plastic (VP) model consists of a St. Venant stress rule in the current configuration with Almansi strain: where C = λI ⊗ I + 2µI isotropic parameters λ = Eν (1+ν)(1−2ν) and µ = E 2(1+ν) with Young's modulus E = 200 GPa and Poisson's ratio ν = 0.3, together with a simple (associative) power law for the flow rule: where c = 0.001 MPa −1−p -s −1 and p = 0.1 are material constants.

Neural network representation and machine learning algorithm
A typical NN, such as the representation of Eq. (18), is a two-dimensional feed-forward, directed network consisting of an input layer, output layer and L intervening hidden layers where neighboring layers are densely connected. Each layer i consists of N nodes (ij). The vector of outputs, y i , of the nodes (ij), j ∈ (1, N ) of layer i is the weighted sum of the outputs of the previous layer i−1 offset by a threshold and passed through a ramp-like or step-like activation function a(x): where W i is the weight matrix for (hidden) layer i of the state/output of nodes of the previous layer x i−1 and b i is the corresponding threshold vector. In our application the input layer consists of the N I invariants I and the N B elements of the tensor basis B. The elements of I form the arguments of the coefficient functions, each having a L × N neural network representation, while the elements of B pass through the overall network until they are combined with the coefficient functions according to Eq. (17) to form the output via a merge layer that does the summation. After exploring the C0 step-and ramp-like rectifying activation functions commonly used, we employ the ramp-like (C1 continuous) Exponential Linear Unit (ELU) [86] activation function: to promote smoothness of the response and limit the depth of the network necessary to represent the response relative that necessary with saturating step-like functions. Training the network weights W i and thresholds b i is accomplished via the standard backpropagation of errors [87,88] which, in turn, drives a (stochastic) gradient-based descent (SGD) optimization scheme to minimize the so-called loss/error, E. We employ the usual root mean square error (RMSE) where D is the set of training data composed of inputs x k = {I k , B k } and corresponding output d k .
The gradient algorithm relies on: (a) the change in E with respect to each weight W i where Here a is the derivative of weight function, [a ⊗ b] ij = a i b j is the tensor product, and [a b] i = a i b i element-wise Hadamard-Schur product. The recursion seen in Eq. (36) gives back-propagation its name. The gradient defined by these expressions is evaluated with random sampling of subset of training data D called minibatches. Also, search for a minimum along this direction is governed by a step size called the learning rate in the ML community. These standard constructions are trivially generalized to the TBNN structure since the inputs B are not directly related to W i nor b i , and are merely scaled by the coefficient functions to form the output y, refer to Fig. 1. For more details of the SGD algorithm, see Ref. [89,Ch.2].
To begin the training, the unknown weights, {W i }, and thresholds, {b i }, are initialized with normally distributed random values to break the degeneracy of the network and enable local optimization. Since multiple local minima for training are known to exist, choosing an ensemble of initial weights which are then optimized improves the chances of finding a global minimum and the distribution of the solutions indicates the robustness of the training. Also, the full set of input data is divided into a training set D, used to generate the errors for the back-propagation algorithm; a test set T , for assessing convergence of the descent algorithm; and a third set V for cross-validation, to estimate the predictive capability of the trained network. Ensuring that the errors based on T are comparable to those on V reduces the likelihood over-fitting data with a larger than necessary NN. We chose to divide the available data in a T : D : V = 20:72:8 ratio. In addition, we sample individual stress-strain curves produced by the CP and VP simulators so as to maintain approximate uniform density of data based on curve arc-length (vs. based on strain) to capture high-gradient (elastic) and transition (yield) regimes. Also, it should be noted that we allow ourselves to train on inputs derived from the plastic deformation gradient, F p , despite the fact that this quantity is Basis:

Inputs:
Outputs: Merge: difficult to observe directly in experiments. A critical part of the training algorithm is normalizing the data so that the NN maps O(1) inputs to O(1) outputs since having W i , b i ∼ O(1) will achieve better SGD convergence. We also shift and scale the scalar invariants I so that they have a mean zero, variance one distribution. We normalize the other set of inputs, the tensor basis B, using the maximum Frobenius norm of the basis generators, e.g. b p and σ, over the training set D. During training, the output tensors are normalized similarly based on their maximum norms over D, so that Convergence is assessed by averaging the error with respect to T over previous iterations of the SDG (in this work we average over the last 4-10 iterations) and terminating when this average converges, but not before performing a minimum number of iterations (1000 in this work). More discussion of the training approach can be found in [45], although in that work the learning rate was held fixed rather than decaying as the training proceeds, as in this study.

Integration algorithm
We need a time-integration scheme to solve the differential-algebraic system Eq. (10) and Eq. (11). We assume it is deformation driven so that F = F(t) is data. To form a numerical integrator, we rely on the well-known exponential map which is an explicit/approximate solution to Eq. (11). In Table 1 we outline an adaptive scheme based on a midpoint rate at t n+α and interpolation of the deformation gradient: with ∆F = F n+1 F −1 n so F n+α = exp(α log ∆F) F n . Since we do not rely on the NN models of stress Eq. (10) and flow (11) being directly differentiable, 4 we use a simple relaxation scheme to enforce consistency: for [F p ] n+1 given [F p ] n and F ≡ F n+1 = F(t n+1 ). Here we have simply substituted stress and flow rules into Eq. (38) with the particular arguments T(e e ) and D p (b p , s). If any step has an increase in error formed from the residual of Eq. (40) the step size is cut; and, conversely, when a sub-step converges, the remainder of the interval is attempted.

Results
In this section we cover our investigations of: (a) optimal network size, inputs, and representation basis; (b) influence of training data on error and stability; and (c) the robustness and accuracy of the model predictions. As mentioned, we employ data from an unknown-form CP model (Eq. (26) and Eq. (27)) and known-form VP model (Eq. (29) and Eq. (30)). Training with the data from the CP model illustrates the NN model's ability to represent and homogenize the response of a complex system and the VP model is particularly useful for exploring NN representations since we know the true response and generating samples is computationally inexpensive.

Constructing and training the neural networks
We begin our numerical investigations with: (a) a survey of the possible representations for the models of stress and flow, (b) optimizing the structure and meta-parameters of the NN representations. To assess improvements in performance we used the traditional metric for evaluating NN performance, cross-validation error, where the training dataset D replaced by the validation dataset V in evaluating the RMSE formula, Eq. (33).
For this study we use data from the CP model to train the stress and flow TBNNs. In particular, we collect data using 3 tension and 6 simple shear loading modes averaged over 570 random textures for each of the 10 polycrystalline realizations. As mentioned in the Methods section, we give ourselves access to the (average) plastic state variables of the CP simulations and so we train the stress and flow TBNNs independently (and not simultaneously). Fig. 2 and Fig. 3 shows typical training data for the I3 stress and IF flow representations (refer to Table 2 and Table 3) with 3 × 4 and 5 × 8 NNs, respectively. The left column of Fig. 2 and Fig. 3 show the tension response and the right columns show the shear response. The upper panels show the (input) invariants and the (output) coefficient functions. In general, the inputs and outputs are smooth and correlated, and all coefficient functions contribute. The notable exception is the stress model in shear in which only the coefficient function of the linear basis element e e appears to contribute. Note that all invariants are arguments to each coefficient function. Note in Fig. 3 the zero invariant, tr σ ≡ 0, that becomes noise upon the input scaling described in Sec. 3.2. Apparently, the NN training learns to ignore this input since the outputs are smooth and regular. The lower panels show: (a) the correspondence of the model (lines) and the data (points), and (b) the error as a function of strain. The errors for each of the components are of comparable magnitude and tend to have an irregular pattern in the elastic region of the loading. Note that with a C0 activation function (e.g. , the Rectifying Unit a(x) = max(0, x)) we observed distinct scallops and cusps in error curves (not shown for brevity). Also it is remarkable that the errors of the flow model in shear are distinctly linear, which, perhaps, is related to the fact only the linear basis element is active.
These results are typical for a wide range of NN structures and (meta) training parameters. Fig. 4 shows the cross-validation errors of the stress and flow (scaled by s T and s Dp , respectively) using the full representations I3 and IF (refer to Table 2 and Table 3, respectively). As Schwartz-Ziv and Tishby [34] remark, trying to interpret the behavior of network from a single training tends to be meaningless; hence, we evaluate parametric and structural changes with an ensemble of at least 30 replicas models M k ∈ M in this and the following studies. (The replicas are obtained by using different random seeds to produce the initial weights and thresholds.) The insets show that the (initial) learning rate can have a strong effect on the errors, but once a small enough (< 10 −3 ) rate is selected the final errors are relatively insensitive to this parameter. The main panels show the typical trends in errors ranging from under-representation (too small a network) to over-fitting (too large a network). 5 For the stress TBNN, N = 4 nodes appears to be an optimum even for relatively shallow networks (N < 4) but the optimal number of nodes is relatively insensitive for L > 4. The flow TBNN shows analogous behavior but with a trade-off between nodes and layers, e.g. for L > 6, N = 4 appears to be best, while N > 4 is better for shallower networks. These findings are somewhat obscured by the noise in the trend lines, which persists despite using the average of 150 replica networks. Also, the convergence window (described in Sec. 3.2) is an important meta parameter. We obtained these results with a 4 iteration convergence window; a longer convergence window (e.g. 10 iterations) shifts the best cross-validation to smaller networks (but also induces larger variance in error between replicas). Since we want reliable error from each replica, we use a 4 iteration convergence window throughout the remainder of this work. Lastly, we do not believe cross-validation is sufficient for determining completeness of network; however, these results indicate that optimal number of nodes is less than the number of input invariants for flow but greater than this matrix rank-based criterion for the stress representation. Apparently, with respect to the training data, the NN is compressing the input for the flow network; and, hence, we conjecture that the NN is forming lower dimensional set of (alternate) invariants internally. Fig. 5 shows cross-validation error for the CP training data for various basis representation of the stress and flow functions (refer to Table 2 and Table 3 for the definition of the labels). For the stress TBNNs, all (overall) errors are comparable with the exception of the component-based representation and the one term E1 representation (with tensor basis B = {e e }). Clearly, the E1 basis is not sufficient since it is akin to a one parameter Navier model of stress. From the results of the two truncated bases, I2 and ID, it appears that correlated inputs, B = {I, e e } (I2), train comparably to linearly independent inputs, {I, dev e e } (ID, which uses a volumetric/deviatoric split). Also, the upper panel of Fig. 5a shows that the representations without embedded satisfaction of the zero-stress at zero-strain constraint, Eq. (12), generally violate this constraint by about 1% of the maximum stress. For the flow TBNNs, all (overall) errors are comparable with the exception of the reduced scalar and tensor basis representation S1. Also the other reduced representations (R3,R1,T3,T1) have slightly higher average errors than the full representations (UF,IF,IR,SF,DS,DS,DR,DZ) albeit with reduced variance. The consistency of the representations with the zero-flow-at-zero-stress condition Eq. (12) generally follows whether powers of e e are included or not. Clearly, cross-validation based on this limited dataset is not sufficient for decisive model selection but it does eliminate some representations. By comparison, the component-based representations, E ij and CM, display higher errors, larger variance in the performance and poor zero-input-zero-output results. Beyond the fundamentally different functional representation, these models are likely suffering from an insufficiency of data to learn the necessary properties (the ones embedded in the generalized TBNN framework) accurately, as demonstrated in Ref. [45]. Lastly, as discussed in the Theory section, we have embedded a number of properties in the representations, e.g. symmetry, deviatoric flow, dissipation, and, generally, the violation of the learned properties is on par with what we illustrate with the zero-stress and zero-flow conditions.
In preliminary studies we also trained networks with different inputs and outputs. In general, the cross-validation errors were comparable over a variety of choices, for example using a symmetrized Mandel stress for the driving stress input to the flow rule. We considered the rateĊ −1 p of the inverse of the plastic right Cauchy-Green deformation tensor C p = F T p F p (as in Simo and Hughes [75,Ch. 9]) as the output of the flow rule and obtained similar cross-validation (and prediction) performance. Also noteworthy, we employed both the elastic and the full left Cauchy-Green stretch tensors as history inputs and the full Cauchy stress as a driving stress input. These inputs resulted in similar cross-validation except when we paired the elastic Cauchy-Green stretch with the highly correlated Cauchy stress we observed slightly higher errors (and less variance among the errors). Fig. 6 and Fig. 7 show the response of the NN coefficient functions to tension and shear, for stress and flow, respectively. In these plots, each coefficient is scaled according to Eq. (37) so that coefficient functions of higher order terms can be plotted on par with those of lower order terms. First, we notice that the E3 basis achieves zero-stress at zero-strain satisfaction exactly at the expense of a more complex, larger magnitude per component response than I3, as the higher order term e 3 e apparently needs compensation by the component functions, refer to Eq. (21). Also, we see more evidence that the truncated representations, I2 and ID, have almost indistinguishable response despite ID having a linearly independent tensor basis. For the flow representation we only compare the DZ and T1 representations for clarity. Note that T1 is much simpler in form (one tensor basis element versus ten) than DR, which has a complete basis, and its response is simpler while achieving comparable cross-validation error to DZ. Also evident from both tension and shear response, DR builds a similar response to T1 by letting all/most components contribute. Also significant, the coefficient of σ 2 , C2, is essentially zero throughout the shear trajectory but not the tension, which implies that the NN may not be learning dissipation is an important property. This is in contrast with the DR representation (not shown) where the corresponding coefficient is essentially zero for both tension and shear. For both the stress and flow models, the coefficient responses generally resemble the trends in the stress and flow data, with large changes up to the elastic-plastic transition at strain > 0.002 followed by relatively constant values in the plastic regime. This is consistent with the expectation that in fully developed plastic flow (in a constant direction with negligible hardening) the elastic state and the plastic flow are constant.    Table 3: Flow representations sym has I (zero error) is dev has tr σ (noise invariant) has b p in tensor basis full scalar basis

Validation
Our validations studies include tests of: (a) completeness of representation and training data, and (b) robustness to perturbation/continuity. As we already have indications that a training set composed of only tension and shear may be insufficient, we computed the (E3) TBNN and (E ij ) component-based NN stress models' response  I1  I2  I3  I4  I5  I6  I7  I8  I9  I10   - INVARIANTS   I1  I2  I3  I4  I5  I6  I7  I8  I9  I10   -   Errors are scaled by s T and s Dp , respectively. Refer to Table 2 for stress representations and Table 3 for the flow representations. Fig. 8b, which is to be expected since each component is independently (and imperfectly) trained. Both models have regions of negative stress and large stresses outside the training limits. Fig. 8c shows expectation for the stress from the crystal elasticity underlying the response of the polycrystalline aggregate. The TBNN response has the most complex trends likely due to its formulation on invariants and the 11-stress response of the component model arguably represents the expected stress best, albeit with a distinct error in the gradient. This result illustrates that acceptable crossvalidation errors along limited training data does not necessarily lead to comparably acceptable interpolation, nor extrapolation, with NNs. To further investigate how much data and what variety of data is needed sufficiently train the . .

C1 C2
(b) shear Figure 6: Stress tensor basis coefficients in shear and tension using different bases.  constitutive models we employed the simple VP underlying model to generate data for loading modes that are symmetric and monotonic: F(t) = t of trajectories that each contain 100 points) but the variability in response is also decreasing with more data.
To measure of how much information has been gained by training the NN (relative to its untrained state), we use the Kullback-Leibler (KL) divergence: evaluated with the assistance of standard kernel density estimators. Here D n is a training set, T is the independent test data, and p(T |D n ) is the probability density function (PDF) of the predictions y j using an ensemble of models M k ∈ M and the (fixed) data inputs x j , j indexes the state and prescribed strain, and p(T ) ≡ p(T |D 0 ). In Fig. 9c,d we see that: (a) both the stress and flow models are steadily differentiating themselves from their untrained state with increased training data, (b) the largest changes appear to occur in the initial increases in training data and yet convergence of the KL divergence is not reached even with D 64 , and (c) the stress is gaining more information from the low strain data and the flow model is gaining the most information from the post-yield data, which is physically intuitive. As a prelude to studying the dynamic stability of our plasticity TBNN model, we test the TBNN formulations' sensitivity to noise by randomly perturbing the inputs by 1% using the CP training data. The response to the perturbations is fairly uniform over the range strains (not shown). As Fig. 10 shows, the output variance for most of the models is on-par with the input variance, the exceptions being tied to the presence of the noisy invariant tr σ. Clearly, pruning ill-conditioned invariants is crucial for stability.

Prediction
Generally speaking, errors in the predictions of the proposed TBNN plasticity models come from: errors in the elastic model, those in the flow rule, and those engendered by the integration scheme. In preliminary work, we integrated the rate given by training data to tune the tolerances of the integration scheme and ensure the integrator error is negligible.
In Fig. 11 we show Lyapunov-like stability tests using a E3(3×4)/T1(5×8) TBNN model trained on a D 64 dataset from the known closed form VP model. First, we perturb the initial conditions of state F p (0) for a random (monotomic) loading mode F(t) and compare the response of the underlying model (gray lines) to that of an analytic stress (Eq. (29))/TBNN flow model hybrid (colored) for this ensemble of initial conditions. The TBNN response is on par with that of the true model, albeit with a distinct bias toward higher stress. Second, we compare the same models with a F p (0) = I initial condition but with an imperfect stress model enacted by perturbing the Youngs' modulus E of the analytic stress model. Here again, the TBNN response is on par with the true model and yet artifacts in the trajectories are clearly present. Third, we repeated the first investigation with at TBNN with both a ML flow and a ML stress model. The results are largely similar to the response of the TBNN with the true stress model albeit with additional artifacts in trajectories. Lastly, we explore the sensitivity of the trajectory errors to flow model network size. Fig. 11d shows that the trajectory errors for the flow model trained on VP data are relatively insensitive to the NN dimensions. Also, since the variance of the results does not increase with strain, the errors are apparently primarily due to the stress representation. The inset of Fig. 11d demonstrates the necessity of sufficient variety of training data. Here we plot the fraction of the models that reach double the training strain stably. Apparently, in this application, training on at least 25 trajectories is necessary to achieve robust predictions outside the training data. Lastly, we return to models trained on the tension and shear CP data. Fig. 12 shows the predictions of a TBNN 3×4 E3 stress model with NN flow models of various sizes. Fig. 12a,b demonstrate that the predictions are essentially self consistent with the training data. Also the fanning out of the trajectories is generally consistent with accumulation of errors from integrating  For these modes the results are considerably less stable, especially in the mixed tension-shear mode which points to the stress model being the main issue (as discussed in the previous section). Even the tension phase of the non-monotonic loading leads to decreased stability and accuracy compared to the tension only case, apparently due to the change in strain rate. Lastly, in these modes none of the larger 5×12 network flow models tested were stable, which gives more evidence that the main issue is a lack of sufficient variety in the training data.

Discussion
In this work we generalized the TBNN framework to fully take advantage of classical representation theory. By embedding constraints and properties directly in the structure and formulation of the NNs for stress and plastic flow, we were able to reduce the amount of training required for valid models compared to current component-based NN models. The constraints of plasticity phenomenology and the trade-offs between learning and embedding the properties lead to a variety of models and hence to a model selection process. We showed that traditional cross-validation errors are not sufficient for the down-selection process and, for example, stability with respect to perturbation needs to be considered for a viable model. We also illustrated the facts that: given limited data, the formulations are insensitive to a number of meta-parameters and variables, in particular the selected functional inputs; and, there are trade-offs between model complexity and property preservation, such as preserving zero-stress-at-zero-strain. Using a known underlying data model, we demonstrated that the enhanced TBNN framework can provide robust and accurate predictions given sufficient data. Lastly, we demonstrated that the tension (and shear) experiments traditionally used in model calibration are likely insufficient to fully train a NN model, as formulated in the TBNN framework or via a component formulation that generally displayed worse performance. In future work we will develop an implicit time integrator based on derivatives of the neural network and explore means of obtaining sufficient variety of training data from experiments, for example using digital image correlation to obtain full-field data.