Sparse deep neural networks for modeling aluminum electrolysis dynamics

Deep neural networks have become very popular in modeling complex nonlinear processes due to their extraordinary ability to fit arbitrary nonlinear functions from data with minimal expert intervention. However, they are almost always overparameterized and challenging to interpret due to their internal complexity. Furthermore, the optimization process to find the learned model parameters can be unstable due to the process getting stuck in local minima. In this work, we demonstrate the value of sparse regularization techniques to significantly reduce the model complexity. We demonstrate this for the case of an aluminium extraction process, which is highly nonlinear system with many interrelated subprocesses. We trained a densely connected deep neural network to model the process and then compared the effects of sparsity promoting l1 regularization on generalizability, interpretability, and training stability. We found that the regularization significantly reduces model complexity compared to a corresponding dense neural network. We argue that this makes the model more interpretable, and show that training an ensemble of sparse neural networks with different parameter initializations often converges to similar model structures with similar learned input features. Furthermore, the empirical study shows that the resulting sparse models generalize better from small training sets than their dense counterparts.


Introduction/ motivation
Aluminum is extracted from the electrolytic Hall-Héroult process.In this process, alumina (Al 2 O 3 ) is dissolved in cryolite (Na 3 AlF 6 ) and then reduced to aluminum.The reaction is driven by a line current that is sent through the electrolytic bath [1].The dynamics of the aluminum electrolysis process contain nonlinearities and interrelated subprocesses which are challenging to model.Accurate dynamic models are crucial for optimizing product quality and energy consumption.Furthermore, the harsh environment of aluminum electrolysis cells requires extra effort to ensure safe operation.To this end, research efforts have focused on modeling the Hall-Héroult process accurately.Most of the models developed until recently falls in the first principle physics-based models (PBM).For example, authors in [2] developed a model of mass and energy balance based on the first law of thermodynamics.The resulting model includes, among other things, a complete control volume analysis, an extensive material balance, a 3D finite element model (FEM) for modeling resistance in the cell lining and shell, and Computational Fluid Dynamics (CFD) model for calculating gas velocity streamlines.In [3], a multi-scale, multi-physics modeling framework including magnetohydrodynamics, bubble flow, thermal convection, melting and solidification based on a set of chemical reactions was developed.These kinds of PBMs, although highly interpretable, cannot capture all the complex dynamics of the aluminum electrolysis due to the fact that they have numerous assumptions, incomplete physics, and uncertainties in the input parameters.For example, the magneto-hydrodynamic phenomena or the reactivity and species concentration distribution are phenomena that are difficult to model [4].Thus these first-principle model's predictions might deviate significantly from observation.
In recent year owing to the availability of copious amount of data, cheap computational resources, and major advancement in the algorithms, Data-Driven Modeling (DDM) is coming up as an attractive alternative to PBM.The DDM has the potential to accurately model even the poorly understood complex phenomena (mentioned earlier) directly from the data.As a result, a broad array of scientific communities explore their applicability in many engineering applications.Some examples are the use of neural networks to model simulated dynamics of a pressurized water nuclear reactor [5], identification of the dynamics of the production and purification process of bioethanol [6], prediction of chemical reactions [7].Authors in [8] demonstrated convincingly how DDM outperforms PBM in the absence of the full understanding of physics.However, despite their advantages they suffer from certain shortcomings [9]; they are difficult to interpret, difficult to generalize to solve previously unseen problems, and unstable to train.These are important shortcomings to overcome before the models can be used in high stake applications.We discuss each of these briefly.
Interpretability: This can be defined as the ability of a model to express itself in human interpretable form [10].A simple model like linear regression having very few trainable parameters can be a good example of an interpretable model.However, a DNN, which can be seen as a complex version of linear regression, and constituting of millions of trainable parameters can be extremely difficult or almost impossible to interpret.Reducing the complexity of DNN models can lead to more intepretable models.
Generalizaibility: This refers to the models' ability to predict outcome values for unseen data from the same distribution as the training data.Highly complex models are prone to overfitting, meaning that they do not generalize to areas of the input space that training data do not cover.DNN models have shown promising results in prediction accuracy by processing large amounts of data.However, in many complex physical systems, such as aluminum electrolysis, the cost, and challenges of data acquisition limit the training data to partial information about the system dynamics.As a result, many DNN models fail to generalize from limited data, and will provide inaccurate predictions in regions poorly represented by the training data.Reducing the complexity of the DNN can increase the generalizability of the model.

Stability:
The training of DNN requires solving an optimization problem where a cost function dependent on thousands or even million of parameters is minimized.Depending upon the complexity of the problems multiple minima can be encountered during the optimization, and the task of converging to the global minima gets increasingly difficult with the number of parameters.By stability in the context of the current work we mean that the optimization leads to a reasonable minima which even if not global is close to it.To the contrary an unstable DNN will be the one where the optimization process gets stuck in a bad local minima.
Combining PBM and DDM in a hybrid modeling approach to address these shortcomings is emerging for a wide range of scientific applications [9].In [8], the equations known from PBM are augmented by a corrective source term generated by a DNN.In [11], the estimate calculated by a PBM is subtracted from coarse measurements of the metal height in an aluminum electrolysis simulator to find residual measurements.These residual measurements are used by a compressed sensing algorithm to model the unknown disturbance signal of the dynamics.In [12], it is proposed a novel hybrid modeling approach called Physics Informed Neural Network (PINN).PINN utilizes known first principle knowledge expressed in partial differential equations and their corresponding boundary conditions to regularize a neural network that is trained to estimate certain variables.Authors of [13] introduces a method called Physics Guided Machine Learning (PGML).The main idea is to augment simplified theories relevant to the dynamics of the system that is modeled with the learning process of the resulting hybrid DNN model.The simplified theories are considered as features in the resulting model.These features are added to intermediate layers in the neural network architecture rather than to the input layer in order to emphasize their importance.As most of the examples above illustrates, neural networks are often highly important in hybrid models.While hybrid models themselves addresses the shortcomings mentioned above, DNNs utilized in hybrid models still suffer from them.This article focuses on addressing interpretability, generalizability and training stability of DNNs used in a purely DDM manner.This can also highly benefit hybrid models.
In the aluminum electrolysis process, DNNs have been applied to predict essential variables that are difficult to mea-sure continuously.In [14] and [15] neural networks were used to predict the temperature and the concentration of alumina, aluminum fluoride, and calcium fluoride in the bath.In [16], neural networks were used to model the carbon anode properties as a function of physical and chemical properties.Despite the cited works and many more available in the existing literature, issues related to interpretability, generalizability and training stability of DNN models in the context of the aluminum electrolysis process remains unaddressed.
To this end, in the current article, we apply the sparsity promoting 1 regularizer on the weights of each layer in DNN models that predict the states in the aluminum electrolysis.The regularization pushes most of the weights to zero, thus reducing the complexity drastically.We demonstrate that the resulting sparse models have increased interpretability, generalizability and training stability.
The article is structured as follows.Section 2 presents the relevant theory for the work in the case study.Section 3 presents the method applied in the paper and the experimental setup of the simulator for data generation.In section 4, the results are presented and discussed.Finally, in section 5, conclusions are given, and potential future work is presented.

Deep neural networks (DNN)
A DNN is a supervised machine learning algorithm [17] that can be denoted by where y ∈ R s is the output vector of the network model and s is the length of the output vector of the model.x ∈ R d is the input vector to the network model, and d is the input dimension.Here, θ ∈ R p denotes all trainable parameters in the network model.p is the number of parameters.Each layer j+1 operates on the output vector from the previous layer Z j ∈ R L j and outputs a vector Z j+1 ∈ R L j+1 : W j+1 ∈ R L j+1 ×L j is called the weight matrix, and b j+1 ∈ R L j+1 is the called the bias vector of layer j + 1. θ = {θ 1 , ..., θ j+1 , ...}, and θ j+1 = {W j+1 , b j+1 }. σ is a non-linear activation function.That is,

W j+1
i are the row vectors of the weight matrix W j+1 and b j+1 i , i = 1, ..., L j+1 are the entries of the bias vector b j+1 .Thus, the activation function calculates the output of each neuron in layer j + 1 as a nonlinear function of the weighted sum of outputs from the neurons in the previous layer plus a bias.Each neuron outputs one value, and the weight in the consecutive layer determines the importance of the output of each neuron in the current layer.The nonlinear activation function σ can, for example, be the sigmoid function, hyperbolic tangent function or the binary step function to mention a few.For the last decade or so, the popularity of the piece-wise linear (PWL) activation function Rectified Linear Unit (ReLU) has grown exponentially.This is in part due to its computational simplicity, representational sparsity and non-vanishing gradients.The ReLU activation function is given by: ReLU is the only activation function used in the current work.

Sparse neural networks and regularization
Dense neural networks are often overparameterized models, meaning that they have more parameters than can be estimated from the data and thus often suffer from overfitting.In [18], it is shown empirically that randomly initialized dense neural networks contain subnetworks that can improve generalization compared to the dense networks.These subnetworks, characterized by significantly fewer non-zero trainable parameters than their dense counterpart, are called sparse neural networks.Their utility can further be seen in terms of increased computational performance for inference and training, and increased storage and energy efficiency.Typically large-scale models that require millions to billions of parameters and arithmetic operations can highly benefit from such sparsification.To conclude sparsification of complex models will lead to simpler models which are relatively easier to interpret, generalize, and train.
Among the methods that can be used to sparsity a complex network, regularization techniques are the most popular ones.In regularization, penalty terms R(w) defined on the weights are added to the cost function C: The vector θ = {w, b} denotes the adaptable parameters, namely the weights w and biases b in the network model.Furthermore, {(x i , y i )} N i=1 is the training data, L(•, •) is a proper loss function and the positive scalar coefficient λ is a hyperparameter to weight the terms L(•, •) and R(•).The standard choice of loss function L(•, •) for regression tasks is the Mean Squared Error (MSE).In the training process, the cost function is minimized to find optimal values of the parameters: The most intuitive sparsity promoting regularizer is the 0 norm, often referred to as the sparsity norm: The 0 norm counts the number of nonzero weights.Unfortunately, the 0 norm has several drawbacks that make it less suitable for optimization.The 0 norm is nondifferentiable and sensitive to measurement noise.Furthermore, in terms of computational complexity, the problem of 0 norm is shown to be NP-hard [19].The 1 norm is a convex relaxation of the 0 norm, and is given by: Due to its geometrical characteristics, 1 minimization is sparsity promoting.However, the 1 norm usually does not reduce the weights to zero but rather to very small magnitudes.Thus, magnitude pruning can be applied after 1 minimization to achieve true sparse models.Fig. 1 illustrates how 1 regularization constraints the parameters to a sparse solution.In this case β 1 is pushed towards zero.

Region bounds for PWA neural networks
The complexity of neural networks with Piecewise Affine (PWA) activation functions such as ReLU can be analyzed by looking at how the network partitions the models' input space to an exponential number of linear response regions [20,21].For each region in the input space, the PWA neural network has a linear response for the output.Authors in [22] present asymptotic upper and lower bounds for maximum number of regions for a DNN with ReLU activation: d is the input dimension, L is the number of hidden layers, and n is the number of activation functions or neurons in each layer.The bounds in Eq. ( 9) are valid for networks with the same number of neurons in each layer.The bonds for networks with an uneven number of neurons show similar exponential results and are thus not included for convenience.Eq. ( 9) illustrates the exponential relation between the input dimension, number of neurons, and the depth of the network.For realistic amounts of data sampled from a physical system, the number of linear regions that a relatively small dense neural network partition the input space into exceeds the sampled data by several orders of magnitude.Thus, in order to generalize to larger areas of the models' input space, the number of regions needs to be reduced drastically.This motivates sparsifying the model.

Simulation model
The simulation model used in this article is based on the mass and energy balance of an aluminum electrolysis cell.
The derivation of the model is found in the Appendix A. In this section, the simulation model is put in a state space form, where constants, system states, and control inputs are denoted by by symbols k i , x i , or u i respectively.The simulation model can be expressed as a nonlinear system of ODE's with 8 states x ∈ R 8 and 5 inputs u ∈ R 5 on the form: where ẋ ∈ R 8 is the time derivative of the states x, and f (x, u) is a nonlinear function.Table 1 presents the mapping from physical quantities to variables.Anode-cathode distance cm The nonlinear functions in Eqs.11) -(15 that partly describes the dynamics of the system states are defined in advance of presenting the system dynamics in order to simplify the expressions in Eq. 10: g 1 is the liquidus temperature T liq , g 2 is the electrical conductivity κ from Eq. (A.33), g 3 is the bubble coverage φ from Eq. A.37, g 4 is the bubble thickness d bub from Eq. (A.36) and g 5 is the bubble voltage drop U bub .Notice that: With the nonlinear functions g 1 , ... g 5 described, the state space equations in Eq. 10 is described by the following: 3. Method and experimental setup

Sparsity promoting regularization
In this article, sparse DNN models are utilized to predict state variables in an aluminum electrolysis simulator.All the weight matrices in a DNN model are 1 regularized to impose sparsity on the model.Fig. 2 illustrates how weights are enumerated according to their input and output nodes.Layer j has i nodes and layer ( j + 1) has r nodes.Layer j = 0 corresponds to the input layer, and consist of measured or estimated states x(t) and control inputs u(t) at time step t.The output layer consist of the estimated time derivatives of the states ẋ(t) at time step t.The weight matrix W j+1 maps the weights from layer j to layer j + 1. W j+1 is arranged as follows: Regularization terms R 1 , j+1 are defined for each weight matrix W j+1 : where w i,k are the model weights from layer j to j + 1, or equivalently, the elements of W j+1 .The regularization terms for each layer are added to the cost function: where 1 2 is the MSE, and λ j+1 , j = 0, ..., 3 is a layer-specific hyperparameter that determines how the weights in W j+1 are penalized.Data for the aluminium electrolysis process is generated by integrating the non-linear ODEs given by Eqs. ( 17) -( 24) with a set of chosen initial values for the state variables x(t 0 ), and fourth-order Runge-Kutta (RK4) algorithm.The initial conditions of each variable x i for each simulation are randomly chosen from a given interval of possible initial conditions:

Experimental setup and data generation
Table 3 shows the intervals for initial conditions of each variable.For x 2 and x 3 , concentrations c x 2 and c x 3 are given.
Data-driven models depend on a high degree of variation in the training data to be reliable and valid in a large area of the input space.The input signal determines how the system is excited and thus what data is available for modeling and parameter estimation.Operational data from a controlled, stable process is generally characterized by a low degree of variation.Even large amounts of data sampled over a long period from a controlled process can not guarantee that the variation in the training data i large enough to ensure that the trained model generalizes to unseen data.Random excitations are added to the input signals to increase the variation in the sampled data.The intuition is that the random excitation will push the dynamics out of the standard operating condition so that variation in the training [580, 580] data increases.In general, each control input i is given by: The control inputs u 1 , u 3 and u 4 are impulses.The random term is zero for these control inputs when the deterministic term is zero.The deterministic term is a proportional controller.The control inputs u 2 and u 5 are always nonzero.These control inputs have constant deterministic terms and a random term that changes periodically.The random term stays constant for a certain period ∆T rand before changing to a new randomly determined constant.Choosing the period ∆T rand is a matter of balancing different objectives.On the one hand, it is desirable to choose a large period ∆T rand so that the system can stabilize and evolve under the given conditions to reveal the system dynamics under the given conditions.On the other hand, it is desirable to test the systems under many different operational conditions.By empirically testing different periods ∆T rand , and seeing how the dynamics evolve in simulation, it turns out that setting ∆T rand = 30∆T is a fair compromise between the two.Table 4 gives the numerical values of the deterministic term of the control input, the interval of values for the random terms, and the duration ∆T rand of how long the random term is constant before either becoming zero (u 1 , u 3 , u 4 ) or changing to a new randomly chosen value (u 2 , u 5 ).One simulation i with a given set of initial conditions is simulated for 1000 time steps, and each time step ∆T = 30s.The simulation generate the data matrix as in Eq. ( 29): The number k within the parenthesis of variable i indicate the time step for when x i (k) is sampled.The target values are then calculated as The training set consist of smaller training sets S k : Here, n is the number of simulated time-series X.Each training set S k from simulation k are put in pairs for regression: The number of time series simulations n varies in the experiments to evaluate the model performance as a function of training size.The test set also consists of several time series simulations generated in the same way described above.The test set is given by: k=1 is one simulated time series, and {x k } 1000 k=1 is being forcasted by the models.p = 20 is the number of simulated time series in the test set.In all experiments 20 models of each dense and sparse networks with different initialization are trained on a training set and evaluated on a test set.

Performance metrics
In aluminum electrolysis, data is generally sampled at rare instants during the operation.Thus, a model needs to accurately forecast several time steps without feedback from measurements to ensure safe and optimal operation.Therefore, the models' capability to estimate the states x over a given time horizon without measurement feedback becomes an important measure of performance.The initial conditions x(t 0 ) are given to the models.Then the consecutive n time steps of the states are estimated {x(t 1 ), ..., x(t n )}.This is called a rolling forecast.The model estimates the time derivatives of the states d xi /dt based on the current control inputs u(t i ) and initial conditions x 0 (t) if t = t 0 , or the estimate of the current state variables x(t i ) if t > t 0 : Then, the next state estimate x(t i+1 ) is calculated as The rolling forecast can be computed for each of the states x i for one set of test trajectories S test .However, presenting the rolling forecast of multiple test sets would render the interpretation difficult.By introducing a measure called Average Normalized Rolling Forecast Mean Squared Error (AN-RFMSE) that compresses the information about model performance, the models can easily be evaluated on a large number of test sets.The AN-RFMSE is a scalar defined as: where xi (t j ) is the model estimate of the simulated state variable x i at time step t j , std(x i ) is the standard deviation of variable x i in the training set S train , p = 8 is the number of state variables and n is the number of time steps the normalized rolling forecast MSE is averaged over.Hence, for every model f j and every test set time series S test (i), there is a corresponding AN-RFMSE.This generates a matrix, where each row represents individual model instances, and every column represents one test set simulation S test (i).Each entry in the matrix is the AN-RFMSE for a given model instance and a given S test (i).The matrix is given by: k is the number of models, and n is the number of time series simulations in the test set S test .There are two AN-RFMSE mat matrices, one for sparse models and one for dense models.Averaging over all columns at each row, that is, averaging over all test set time series for each model instance, generates a vector The elements of AN-RFMSE vec is the average AN-RFMSE over all test set time series for every model instance.

Results and discussion
In this section, we present and discuss the main findings of the work.In doing so, we will analyze the results from the perspective of interpretability, generalizability and training stability.

Interpretability perspective
As discussed earlier, the interpretability of a model is the key to its acceptability in high-stake applications like the aluminum extraction process considered here.Unfortunately, highly complex dense neural networks having thousands to millions of parameters which were the starting point for the modeling here are almost impossible to interpret.Refer to the Fig. 3 that shows the model structure of a dense DNN model learned for the generated data.
Figure 3: Model structure of each output function { f 1 , .., f 8 } for one of the trained dense neural network models.Blue circles represent neurons, inputs and outputs in the model.X i represent system variable i in the input layer, U i represents control input i in the input layer, and Z i represents latent variable i in the layer it is visualized.The directed edges indicate weights in the model.
Fortunately, through the regularization it was possible to significantly reduce the model complexity resulting in a drastically reduced number of trainable parameters (see the Figs.[4][5][6][7][8][9][10][11].It can be argued that the reduced model complexity of sparse neural networks increases the model interpretability.With domain knowledge about the aluminum electrolysis process, the sparse models can be evaluated as we will do in the remainder of this section. The results related to the interpretability aspect is presented in the form of model structure plots which can be used to explain the input-output mapping of the models.If the model structures are very sensitive to the initialization, then there interpretation will not make sense therefore, 100 DNNs with different initialization are trained independently and their common trends are emphasized in the discussions.We now present each of the model outputs { f1 (x, u), ..., f8 (x, u)}.It is worth mentioning that these outputs are estimates of each of the time derivatives of the states { ẋ1 , ..., ẋ8 } respectively.

Model output f1
The simulation model for the first output f 1 defined in Eq. ( 17) is a function of the features {x 1 , x 2 , x 3 , x 4 , x 6 , x 7 }.f 1 can further be divided into three sums h 2 = −k 2 x 6 and h 3 = g 1 (x 2 , x 3 , x 4 ).g 1 is a nonlinear function defined in Eq. ( 11).Fig. 4 shows the three most common learned structures of the first output of the neural network f1 (x, u).In total, these structures account for 86% of all learned structures of f1 .The top structure forms the resulting structure for 63% of the models.It is a function of seven inputs f1 = f (x 1 , x 2 , x 3 , x 4 , x 6 , x 8 , u 1 ).All input features are connected to the same neuron in the first layer.Moreover, it is only one neuron in each hidden layer.The upper bound in Eq. ( 9) states that this model structure only has n dL = 1 7•3 = 1 region.This is equivalent to stating that the model collapses to one linear model.The middle and the bottom structures of Fig. 4 has more than one neuron in the hidden layers.However, the structures can be divided into two disconnected subnetworks since the hidden neurons are not connected before they are added in the output layer.Hence, also these models collapse to linear models with a single region.This means that the neural networks do not capture the nonlinear dynamics in the simulation model.All structures of Fig. 4 are erroneously including x 8 and u 1 in their feature basis.Besides, x 7 which is present in the simulation model f 1 is not found by the top and bottom structures in Fig. 4. In fact, x 7 is found only in 22% of the models f1 according to Table 5.The exact cause of this erroneous feature selection is not trivial.However, x 8 which is the wall temperature correlates highly with the side ledge temperature x 7 .Thus, x 8 can possible have been learned as a feature of f 1 instead of x 7 .Moreover, the alumina feed u 1 on the other hand affects the time derivative of the mass of alumina ẋ2 , the time derivative of mass of aluminum fluoride ẋ3 and the time derivative of cryolite ẋ4 directly.All these variables affect ẋ1 through the liquidus temperature g 1 (x 2 , x 3 , x 4 ).To understand how this might cause the learning algorithm to find u 1 as a feature of f1 , consider the following: let u 1 be zero until time t.Then, {x 2 , x 3 , x 4 } will be updated due to u 1 at the next sampled time step t + 1.However, the fourth-order Runge Kutta solver splits the sampling interval into 4 smaller intervals {t + 0.25, t + 0.5, t + 0.75, t + 1} and solve the ODE eqautions at all these time steps.Thus, the state variables {x 2 , x 3 , x 4 } are updated already at time t + 0.25.Since ẋ1 is depending on these variables, x 1 will be updated at t + 0.5.Therefore, at time t + 1, when data is sampled, x 1 would also be changed.Hence, the learning algorithm finds u 1 to affect the time derivative ẋ1 .This could might have been solved by shortening the sampling interval.Furthermore, x 1 not included as a feature in 14% of the models.This might be a combination of parameter initialization and that x 1 is multiplied by the small constant k 0 = 2 • 10 −5 .

Model output f2
Fig. 5 shows the most common learned structure among the models f2 that models the time derivative of alumina f 2 = ẋ2 .97% of the structures end up as the structure in Fig. 5.The simulation model f 2 in Eq. ( 18) is a linear model dependent on {u 1 , u 2 }.The learned models f2 only finds u 1 as the relevant feature.The reason for this might be that u 2 is proportional to a very small constant k 3 = 1.7 • 10 −7 .Variations in the line current u 2 are not big enough to be of significance for the learning algorithm.The dynamics caused by u 2 is instead captured in a bias in the models.19) is a linear model depending on the features {u 1 , u 3 }, and in all structures in Fig. 6, only these two features are found.As shown in Table 5, these features are found in 100% of the trained models.The structures found are mainly linear models, however in the second and forth structures, there are some weights that connects the features in intermediate layers.

Model output f4
Fig. 7 show the four most common learned structures among models f4 that models the mass rate of liquid cryolite Na 3 AlF 6 in the bath, namely ẋ4 .ẋ4 is simulated by the simulation model ẋ4 = f 4 in Eq. ( 19).f 4 consist of the features {x 1 , x 2 , x 3 , x 4 , x 6 , x 7 , u 1 }.Table 5 show that {x 2 , x 3 , x 4 , x 6 , x 8 , u 1 } are included in 100% of the learned models, x 1 is included in 86% of the models and x 7 is included in only 21% of the models.As for f1 , x 8 is erroneously included in the basis of the model.This might be explained by that x 7 and x 8 highly correlates, and that the wrong feature is included.The structures in Fig. 7 are all forming linear models.However, the simulation model ẋ4 = f 4 is partly nonlinear.Thus, the approximation f4 oversimplifies the dynamics.This might be caused by a high weighting of the sparse regularisation term.If the loss function is less penaltized, it is room for more weights in the model and therefore also more nonlinearities.Fig. 8 show the most common learned structure among the models f5 and include 98% of the learned model structures.
f5 is modeling the mass rate of produced aluminum in the cell ẋ5 .The x 5 time series are produced by the simulation model ẋ5 = f 5 in Eq. 21. f 5 is a linear model dependent on the features {u 2 , u 4 }.However, most of the model structures are only depending on u 4 .This can be caused by the fact that u 2 is proportional to a very small constant k 6 = 4.43 • 10 −8 .Thus, variations in u 2 might not be large enough for the learning algorithm to find u 2 significant as a basis for f5 .Fig. 9 show the most common model structures of f6 .f6 models the bath temperature time derivative ẋ6 .The bath temperature x 6 is simulated by the ODE in Eq. (22).It is a nonlinear equation depending on {x 1 , x 2 , x 3 , x 4 , x 6 , x 7 , u 2 , u 5 }.The most common structure, learned by 57% of the models f6 is illustrated in the top plot of Fig. 9.This structure has the basis {x 1 , x 2 , x 3 , x 4 , x 6 , x 8 , u 1 , u 2 , u 5 }.Hence, it finds u 1 and x 8 , which is erroneously found in many of the structures above.A possible explanation for this trend is given above.The structure has two neurons in the first layer, one with the basis {x 1 , x 2 , x 6 , x 8 , u 1 , u 2 , u 5 } and one with the basis {x 1 , x 2 , x 3 , x 4 , x 6 , x 8 , u 1 }.Since the model collapses to a linear model, all terms are summed in the end.Thus, the separation of the basis is thus of little importance.The second plot from above in Fig. 9 is the second most common structure, and accounts for 11% of the models f6 .It has the same feature basis as the most common structure, but they are arranged differently in the first layer.However, since both model structures are linear models, this arrangement is of minor importance.The third structure in Fig. 9 which account for 7% of the structures of f6 has the basis {x 1 , x 2 , x 6 , u 1 , u 2 , u 5 }.Compared to the other structures, {x 3 , x 4 , x 8 } is not present.Since it only happens rarely, this is maybe partly caused by bad parameter initialization.The fourth most common structure, which accounts for 6% of the structures, has the same feature basis as the first and the second most common structures.This structure is plotted in the bottom of Fig. 9.While all other structures in Fig. 9 have one linear response region, the fourth most common model structure models some nonlinearities.That is, neurons in the first hidden layer are connected in the second hidden layer.Hence, the input space must be divided into several linear response regions for this structure.Fig. 10 show the two most common structures for the models f7 .f7 models the time derivative of the side ledge temperature ẋ7 .ẋ7 = f 7 is simulated by the ODE in Eq. ( 23).ẋ7 is depending on the feature basis {x 1 , x 2 , x 3 , x 4 , x 6 , x 7 , x 8 }.Table 5 states that the basis {x 2 , x 3 , x 4 , x 6 , x 8 , u 1 } is present for 100% of the models, x 1 is present for 99% of the models and x 7 is present in 89% of the models.The top plot in Fig. 10 show the structure that accounts for 63% of the models.The bottom plot account for 10% of the model structures of f7 .These two structures collapse to linear models, and have the same feature basis {x 1 , x 2 , x 3 , x 4 , x 6 , x 7 , x 8 , u 1 }, but have minor differences in how weights are connected between input layer and first hidden layer.u 1 is also for this model output erroneously found as a basis, and a possible explanation is mentioned above.Fig. 11 show the two most common model structures for the last model output f8 .f8 models the time derivative of the wall temperature ẋ8 = f 8 , which is simulated in Eq. (24).ẋ8 depends on the feature basis {x 1 , x 7 , x 8 }.However, the most common learned structure for f8 has the basis {x 1 , x 2 , x 3 , x 4 , x 6 , x 7 , x 8 , u 1 }.This is the exact same structure as the most common learned structure for f7 .Therefore, a possible explanation is that f8 adapts the same parameters as f7 in some cases as they highly correlates.The bottom plot in Fig. 11 show the second most common learned structure of the model output f8 .This structure is learned by 11% of the models f8 .The feature basis for this structure is {x 1 , x 2 , x 7 , x 8 }, and reminds more of the actual basis.In this structure, there is only one erroneous learned feature, namely x 2 .Fig. 4-11 and Table 5 show that the sparse learning is quite consistent in finding the same feature basis and structure with similar characteristics.However, some differences that affect the models are present.
It is clear that doing a similar analysis for models in Fig. 3 is impossible as all interconnections make the models a black box.On average, 93% of the weights in the sparse DNN models are pruned.For the outputs f1 , f4 , f6 , f7 and f8 , approximately 40% of the input features are pruned of the model structures.For f2 , f3 and f5 , 85 − 95% of the input features are pruned.For all outputs of the sparse DNN models, around 85 − 95% of the neurons are pruned at each layer.In a neural network, the number of matrix operations only decreases if neurons are pruned.That is, removing a neuron in layer j is equivalent to removing a row in weight matrix W j and a column in weight matrix W j+1 .The dense models have a compact model structure, where most of the weights are nonzero.The dense DNN models in the case study have the shapes 13-15-14-12-8.The first number is the number of features, the second, third, and fourth numbers are the numbers of neurons in hidden layers, and the last number is the number of outputs.This shape gives 669 matrix operations in a forward pass.An average sparse DNN, has the shape 13 − 6 − 6 − 6 − 8.This gives 198 matrix operations.Thus, the number of matrix operations in the forward pass of a sparse DNN model is reduced by approximately 70%.

Generalizability perspective
This section focuses on the models' performance on test data in terms of accuracy and uncertainty.Furthermore, we also investigate the impact of the training data quantity and prediction horizon on the performance measured in terms of AN-RFMSE.

Comparison of sparse and dense rolling forecast
Fig. 12 and Fig. 13 show the performance of 20 sparse and 20 dense DNN models with different parameter initialization forecasting the state variables x in one of the time series in the test set S test (i) = {X i } as defined in Eq. (33).The models are trained on a data set  Fig. 12 and Fig. 13 indicate that the forecasts of sparse and dense models are showing similar performance for the first time steps after they are given the initial conditions.However, while the forecasts calculated by sparse models show a consistently slow drift from the simulated values of x, the mean and standard deviation of forecasts calculated by dense models suddenly drifts exponentially.The narrow banded standard deviation of sparse neural networks can indicate that these models converge to models with similar characteristics during training despite different parameter initialization.Furthermore, the consistently slow drift between the sparse DNN model forecast of x and the true values of x indicate that the sparse models are generalizing better as they are showing good forecasting capabilities in a broader region than the dense DNN models.Fig. 12 and Fig. 13 show some interesting results that indicate better generalization of sparse DNN models than dense DNN models and that the convergence of model parameters for sparse DNN models are more robust to random initialization than dense DNN models are.14a to 14d report median and extreme values of AN-RFMSE vec over four different time horizons for five groups of models trained on five data sets with different sizes.Hence, the results in Fig. 14 show how dense and sparse DNN models perform with varying amounts of training data over varying time horizons.Figs.14a to 14d show that the groups of sparse models trained on data sets with varying size show similar results, both in terms of median AN-RFMSE and extreme values.However, as Fig. 14d shows, there seems to be a small trend that groups with more training data perform slightly better over longer forecasting horizons.Furthermore, the band between the minimum and maximum values of AN-RFMSE is overall relatively small for all groups of models and all forecast horizons for sparse models.The converging behavior of the performance of groups of sparse models as a function of the amount of data in the training set indicates that only small amounts of data are required to gain significance for the model parameters.While the sparse models show stable performance across groups of models with different amounts of training data and slowly increasing values of AN-RFMSE proportional to the length of the forecasting horizon, the same cannot be said about the performance of the dense models.When considering the dense models, Figs.14a to 14d indicate that there is a trend where both median, minimum and maximum values of AN-RFMSE decreases significantly as sizes of training set decreases.This expected trend indicates that the performance improves with increasing dataset size.However, the trend is not consistent for all groups of dense DNN models for all forecasting horizons.Furthermore, the maximum values of AN-RFMSE for groups of dense DNN models for longer forecasting horizons such as in Fig. 14c and Fig. 14d show that AN-RFMSE exponentially increases for some of the models within the groups.This indicates that the dense DNN models are likely to have some input regions where the model output is not sound.If the model estimate enters a region with poorly modeled dynamics, the model estimate might drift exponentially.For short-term prediction, that is in Fig. 14a and Fig. 14b, the trend is that dense models show better performance for median and minimum values, especially within the groups with large training sets.This may be because dense models have more flexibility in terms of more parameters.Furthermore, the sparse regularizers defined in Section 2.2 are guiding the structure of the network.Sparse regularizers are based on assumptions about the model structure that might ignore some of the dynamics in the system.However, for longer forecasting horizons (Fig. 14c and Fig. 14d), sparse models are always showing better performance than dense models in terms of median AN-RFMSE.This is a typical example of a bias-variance trade-off.For all forecasting horizons and within all groups of training set sizes, sparse models are always showing a smaller maximum value of AN-RFMSE.

Training stability perspective
Sparsification on a large ensemble of neural networks gives similar sparse structures.This has been shown in the structure plots in the Figs.4-11.Furthermore, Table 5 show that sparse models to a large extent finds the same feature basis for each of the model outputs { f1 , ..., f8 }.
Moreover, Fig. 14 clearly indicates for all forecasting horizons that the uncertainty bounds for dense models are much larger than those for sparse models.For the longer horizons, some of the dense models tend to blow up.This is probably due to that the model enters a region of the input space where it overfits.This can be seen as poor generalization to that specific area.However, for shorter prediction horizons such as in Fig. 14a, the large bound between minimum and maximum AN-RFMSE is not of an order that indicates a blow up.Still, this uncertainty bounds are larger for dense models than for sparse models.This indicates that the sparse models are more likely to converge to a similar minimum than the dense models.The results presented in Fig. 14 also indicates that the sparse models are likely to converge already for small amounts of data.This is not the case for dense models.

Conclusions and future work
This article presents a sparse neural network model that approximates a of nonlinear ODEs based on time series data sampled from the system variables included in the ODEs.The set of nonlinear ODEs represents an aluminum electrolysis simulator based on the mass and energy balance of the process.This includes nonlinear and interrelated models of electrochemical and thermal subprocesses.The sparsity in the model is achieved by imposing sparsity using • The sparse neural network was more interpretable using the domain knowledge of the aluminium electrolysis process.In contrast the dense neural networks were completely black-box.
• Sparse neural networks were consistently more stable compared to the dense counterparts.This was reflected in the model uncertainty estimates based on a large ensemble of models.
• For short prediction horizon, the median accuracy of dense models are better than for sparse models.However, the uncertainty bounds of dense models are much larger.
• For longer prediction horizons, sparse models outperforms dense models both in terms of higher median accuracy and lower uncertainty bounds.
While these sparse models show promising results within interpretability and generalizability, there is still a high potential for improvement.There is a desire to increase prediction accuracy and decrease the bias of the sparse models.This might be addressed by investigating other sparsity structures at different layers that better compromise the bias-variance trade-off.One possible direction is to inject simplified theories known from first principle into the neural network to possibly increase accuracy.
Hence, the resulting temperature specific energy equation for component i in a control volume is given by: The latter equation states that the time derivative of the temperature in the control volume is dependent on the composition of species in the control volume.It is assumed that the temperature is equal for all components in the control volume.Furthermore, it is assumed that there is a common heat loss Q from a control volume to other control volumes, and that electrical power Ẇel is performed on the whole control volume instead of on different components in the control volume.When different components are mixed in a control volume, the enthalpy of this mix is more complex than adding the enthalpy of individual species.However, the complexity of mixed enthalpy is left out of this simulation model.The heat capacity of a mix of components in a control volume c p cv is simplified to be constant despite of that c p cv varies with composition and temperature in the control volume.The values for different species and control volumes are taken from [26] and [27].Thus, the simplified simulation equation for the temperature derivative in a control volume is given by: where m cp is the sum of masses in the control volume.

Figure 1 :
Figure 1: Illustration of 1 regularization.β 1 and β 2 are the model parameters.β is the MSE estimate.The ellipses show the contours of the error from the MSE estimate.The blue diamond illustrates the 1 constraints.

Figure 2 :
Figure 2: Enumerated weights according to their input and output nodes between layer j and ( j + 1)

Figure 5 :
Figure 5: Most common learned structure for f2 (x, u).97% of f 2 ends up with this structure.

4. 1 . 3 .
Model output f3 Fig.6shows the four most common learned structures of f3 .f3 models the time derivative of the aluminum fluoride mass ẋ3 in the cell.The simulation model ẋ3 = f 3 in Eq. (

5 Figure 12 :Figure 13 :
Figure 12: Sparse rolling forecast of state variables {x 1 , .., x 8 } at each time instant.The true values of x and control inputs u are taken from one simulated set of test set trajectories X i ∈ X test .The orange dotted line shows the average of 20 forecasts calculated by 20 sparse neural network models with different parameter initialization.The orange band shows the standard deviation of the same 20 forecasts calculated by 20 sparse models.

Fig. 14 shows
the median, maximum and minimum elements of the AN-RFMSE vec vector.Median, maximum and minimum AN-RFMSE for 200 timesteps of forecasting.Median, maximum and minimum AN-RFMSE for 300 timesteps of forecasting.Median, maximum and minimum AN-RFMSE for 500 timesteps of forecasting.Median, maximum and minimum AN-RFMSE for 1000 timesteps of forecasting.

Figure 14 :
Figure 14: Median, maximum and minimum AN-RFMSE.There are five groups of models of both sparse and dense DNNs trained on data sets with different sizes.For each group of models, there is a corresponding colored bar indicating median values and an error bar indicating maximum and minimum values of the vector AN-RFMSE vec .Moreover, the size of the training sets are indicated on the x-axis of the subplots.The blue bar shows the median AN-RFMSE among 20 sparse DNN models over 20 test sets.The orange bar shows the median AN-RFMSE among 20 dense DNN models over 20 test sets.For each subfigure, the AN-RFMSE are calculated for a given number of timesteps reported in the captions of each subfigure (14a -14d).Notice the logarithmic scale of the y-axis.

Fig. 14
Fig.14contains a good amount of information about the model performance of dense and sparse DNN models.Figs.14a to 14d report median and extreme values of AN-RFMSE vec over four different time horizons for five groups of models trained on five data sets with different sizes.Hence, the results in Fig.14show how dense and sparse DNN models perform with varying amounts of training data over varying time horizons.Figs.14a to 14d show that the groups of sparse models trained on data sets with varying size show similar results, both in terms of median AN-RFMSE and extreme values.However, as Fig.14dshows, there seems to be a small trend that groups with more training data perform slightly better over longer forecasting horizons.Furthermore, the band between the minimum and maximum values of AN-RFMSE is overall relatively small for all groups of models and all forecast horizons for sparse models.The converging behavior of the performance of groups of sparse models as a function of the amount of data in the training set indicates that only small amounts of data are required to gain significance for the model parameters.While the sparse models show stable performance across groups of models with different amounts of training data and slowly increasing values of AN-RFMSE proportional to the length of the forecasting horizon, the same cannot be said about the performance of the dense models.When considering the dense models, Figs.14a to 14d indicate that there is a trend where both median, minimum and maximum values of AN-RFMSE decreases significantly as sizes of training set decreases.This expected trend indicates that the performance improves with increasing dataset size.However, the trend is not consistent for all groups of dense DNN models for all forecasting horizons.Furthermore, the maximum values of AN-RFMSE for groups of dense DNN models for longer forecasting horizons such as in Fig.14cand Fig.14dshow that AN-RFMSE exponentially increases for some of the models within the groups.This indicates that the dense DNN models are likely to have some input regions where the model output is not sound.If the model estimate enters a region with poorly modeled dynamics, the model estimate might drift exponentially.For short-term prediction, that is in Fig.14aand Fig.14b, the trend is that dense models show better performance for median and minimum values, especially within the groups with large training sets.This may be because dense models have more flexibility in terms of more parameters.Furthermore, the sparse regularizers defined in Section 2.2 are guiding the structure of the network.Sparse regularizers are based on assumptions about the model structure that might ignore some of the dynamics in the system.However, for longer forecasting horizons (Fig.14cand Fig.14d), sparse models are always showing better performance than dense models in terms of median AN-RFMSE.This is a typical example of a bias-variance trade-off.For all forecasting horizons and within all groups of training set sizes, sparse models are always showing a smaller maximum value of AN-RFMSE.

Table 1 :
Table of states and inputs

Table 2 :
Constants in the simulator

Table 3 :
Initial conditions for system variables

Table 4 :
Control functions

Table 5 :
Frequency of learned features for each output { f1 , ..., f8 }.Each column i correspond to an output fi of the neural network.Each row element j correspond to one of the features {x 1 , x 2 , ..., x 8 , u 1 , u 2 , ..., u 5 }.The value of the table element (i, j) is the percent of how many out of one hundred models of the output f i that feature j occurs.