Deep Learning, Explained: Fundamentals, Explainability, and Bridgeability to Process-based Modelling

,


19
Deep learning (DL) is an artificial intelligence (AI) technique that has already served the vast majority, if 20 not all, of everyday society in tasks such as image recognition and language processing through 21 smartphones. The recent unprecedented performance of DL in those tasks has accelerated applications 22 in non-native areas such as earth and environmental sciences where knowledge-based modelling has 23 dominated to date. A major challenge, however, is DL and knowledge-based modelling are rooted in 24 different worldviews towards problem solving. This paper explains the 'whats' and 'whys' of DL from first 25 principles, with an eye on applications since inception in environmental problems. An experiment is run 26 to illustrate the fundamental differences between the two worldviews, and to shed light on some critical, 27 but often ignored, issues DL may face in practice, largely arising from the fact that earth and 28 environmental systems are complex with behaviors changing in ways that are physically explainable but 29 not seen in the period of record due to uncertain factors such as climate change. Such issues must be 30 addressed at the heart of the endeavor to develop DL techniques that embrace the knowledge base 31 available, in anticipation of breakthroughs in an age of big data and computational power. 32 80 The last decade has witnessed a tremendous rise in techniques called 'deep learning' (DL), under the 81 umbrella of artificial intelligence (AI) and machine learning (ML), and their unprecedented performance 82 in areas such as computer vision (Krizhevsky et al., 2017), natural language processing (Young et al., 2018), 83 and gaming (Silver et al., 2018). These successes have motivated the application of DL across a wide range 84 of disciplines, including medicine (Hosny et al., 2018), earth sciences (Reichstein et al., 2019), robotics 85 (Torresen, 2018), engineering (Panchal et al., 2019), and finance (Lee et al., 2019). DL owes its exemplary 86

The rise of deep learning
success to the boom in computational power and the emergence of big data sources and associated data 87 storage and sharing technologies. 88 Earth and environmental sciences appear to be positioned to benefit profoundly from DL, as big data 89 sources on a range of in situ and remotely-sensed variables are becoming increasingly available with the 90 advances in sensing technologies (Reichstein et al., 2019). The storage volume of remote sensing data for 91 earth observations is already well beyond dozens of petabytes, with transmission rates exceeding 92 hundreds of terabytes per day. Datasets based on model outputs are rising; for example, the climate 93 assessment dataset provided by the Coupled Model Intercomparison Project Phase 6 may reach 40 94 petabytes (Eyring et al., 2016). Reanalysis climatic datasets have also grown; for example, NASA's Modern-95 Era Retrospective Analysis for Research and Applications version 2 (MERRA-2) is ~400 terabytes (Gelaro 96 et al., 2017). In addition, datasets generated via tens of thousands of citizen science projects are providing 97 large and rich sources of ground-based data. 98 This potential is shifting the attention of earth and environmental scientists and relevant funding agencies 99 towards ML, as evidenced, for example, by the shift in research work presented at the American 100 Geophysical Union (AGU)'s fall meetings, the largest assembly of earth and environmental scientists with 101 more than 27,000 people in attendance and 25,000 presentations in 2019. The number of ML-related 102 presentations has risen consistently-from 0.2% of total presentations in 2015 to 4.2% in 2020. In 103 particular, this shift has been astonishing in the 'non-linear geophysics', 'earth and space science 104 informatics', 'natural hazards', 'hydrology', and 'seismology' sub-fields, where 28 (2.1), 18 (5.1), 9 (1.3), 105 7.5 (1.4), and 6.7% (0.9%) of total presentations, respectively, were related to ML in 2020 (2015). Notably, most DL algorithms, formerly known as artificial neural networks (ANNs), have been around and 115 widely applied in earth and environmental sciences since the early 1990s with the birth of domains such 116 as Hydroinformatics (Abbott, 1991). These applications are documented in reviews by Gardner and 117 Dorling ( the uptake of DL to facilitate and advance earth and environmental sciences has not kept pace with data 120 availability and computational power over the past three decades. 121 But why? The challenges impeding the widespread application of DL to earth and environmental problems 122 to date may be rooted in the fact that convincingly casting those problems, for which an extensive 123 knowledge base is usually available, within the DL framework is often not straightforward. Moreover, the 124 lack of interpretability and explainability of DL has been a major hindrance, as model developers need to 125 be able to make sense of why a model functions the way it does, and to explain that to model users. These 126 challenges can be further complicated in the absence of a solid understanding of the fundamentals of DL 127 and how they differ from theory-driven, mechanistic modelling and prediction. Mechanistic modelling, 128 also called process-based or knowledge-based modelling in this paper, has traditionally been the 129 cornerstone of scientific advancement and policy support. 130 And why this paper? Motivated by the recent breakthroughs by DL in its original areas of application, 131 namely computer vision and natural language processing, this paper aims to address the persistent 132 challenges facing DL applications in non-native areas related to earth and environmental sciences. With 133 this overarching aim, this paper addresses 10 questions regarding the fundamentals of DL and its 134 explainability and bridgeability to earth and environmental systems modelling: 135 (1) What is DL and how did it evolve from ANNs? 136 (2) How can we interpret the internal functioning of DL? 137 (3) How can the complexity of DL be justified in light of the principle of parsimony? 138 (4) Why is DL considered superior to other types of ML? 139 (5) How can DL account for memory and time dependency? 140 (6) How may DL and process-based models behave differently in out-of-sample prediction? 141 (7) What can be the often ignored value of domain knowledge in DL? 142 (8) Why is DL essentially different from process-based modelling? 143 (9) What are the existing approaches to bridging DL and process-based modelling? 144 (10) What can we learn from prominent DL applications such as gaming and the stock market? 145 The structure of this paper is such that it best serves the reader when all sections are followed 146 sequentially. However, an advanced reader could directly refer to a section designated to address a 147 question of interest. Sections 2 through 7 address questions 1 through 6 and sub-sections 8.  (Jordan and Mitchell, 2015). According to Goodfellow et al. 158 (2016), however, the early efforts to generate AI were based on a knowledge base paradigm to instruct 159 machines with a formal set of step-by-step mathematical and if-then rules. Those efforts focused on 160 carrying out tasks that were intellectually difficult for humans but straightforward for computers. 161 Goodfellow et al. (2016) argue such efforts led to no major successes, and the AI of today is about enabling 162 machines to perform tasks that humans perform intuitively and rather easily but have difficulty formally 163 describing how they do so. Examples of such tasks include recognizing faces in a photo or comprehending 164 spoken words. 165 Not only did state-of-the-art AI divorce from the knowledge base, but it also completely separated from 166 classic data-driven modelling rooted in statistics such as regression. This separation was a response to the 167 need for models that are not constrained by the many assumptions typical statistical models hold. For 168 example, traditional statistical modelling requires a formalization of relationships between variables and 169 assumptions about functional shapes, distributions of variables, and their inter-dependencies, which 170 enables hypothesis testing and the generation of confidence bounds. Conversely, in the ML context the 171 underlying relationships in data may have any complex form, which is typically unknown a priori, and the 172 data used may have any size and distributional properties (see Dangeti, 2017, p. 10-11). 173 Because of these characteristics, ML is deemed suitable to pursue the longstanding ambition to build 174 machines that work with minimal or no human supervision and imposed assumptions. As a result, ML 175 techniques nowadays, and in particular DL, provide flexible tools that can adapt to a wide range of data 176 and applications. 177

Evolution of DL and major milestones 178
It was 1957 when Frank Rosenblatt invented the first algorithm, termed 'perceptron' (Rosenblatt, 1957), 179 which today forms the smallest computational unit of DL. A perceptron, alternatively termed a 'neuron' 180 because of its resemblance to the basic working unit of the brain, is shown in Figure 1a and formulated 181 as: 182 where D is the dimension of input space, x is the input vector, w is a set of weights corresponding to the 184 input vector, b is bias, and f is an 'activation' function. A perceptron has D+1 tunable parameters (i.e., D 185 weights and one bias) and is basically nothing but a multiple linear regression augmented by an output 186 function (f), which is non-linear. The form of the activation function was originally a step function, but 187 now a range of monotonic functional forms, such as 'sigmoidal', are used. 188 The invention of perceptrons created significant excitement in the AI community and beyond. But, it soon 189 became clear that a perceptron would not be able to map input spaces that are not linearly separable, 190 such as the XOR problem (Minsky and Papert, 1969), rendering perceptrons of limited use in real-world 191 applications. The reason for this inability is that the core of the perceptron is a linear regression. 192 Efforts to overcome this barrier could have followed two different avenues. Perhaps the most intuitive 193 avenue was to employ non-linear regression, by allowing the terms inside the parentheses in Eq. 1 to be 194 of other algebraic forms such as quadratic. However, this was not a viable option in part because the user 195 then would need to specify the form of non-linearity which is not typically known a priori, requiring 196 possibly extensive trial-and-error. 197 The second avenue that led to today's DL was to combine perceptrons both in parallel and in series to 198 create so-called 'multi-layer perceptrons' (MLPs), as shown in Figure 1b, with the hope this more complex 199 system could overcome the barrier. An MLP would then have many more tunable parameters than the 200 perceptron. But, MLPs on their own did not go far and the field stagnated for many years because of the absence of 220 an algorithm that could automatically derive from data the network weights and biases-a process 221 referred to as 'training' in the AI community. It took until the mid-1980s when the first 'back-propagation' 222 (BP) algorithm was invented to enable the training of MLPs with any network structure (Rumelhart et al., 223 1986). This invention marked the beginning of the 'second wave' of popularity of ANNs. BP is essentially 224 an optimization algorithm, based on non-linear programing, that minimizes a loss function representing 225 the goodness-of-fit of predictions to observations, such as the 'sum of squared errors', as follows: 226 where is the output of neuron j in the output layer when the network is forced with input data sample 228 k and is the respective desired target. Also, M is the size of training data, and N is the number of 229 neurons in the output layer. 230 Different variations of BP rooted in first-(e.g., gradient descent) or second-order (e.g., Newton's method) 231 optimization, or a combination thereof, now exist; see e.g., the Levenberg-Marquardt algorithm as 232 implemented by Hagan and Menhaj (1994). These algorithms are fundamentally the same as optimization 233 algorithms used nowadays for calibration of process-based models. The only difference is that, in the case 234 of ANNs, and unlike most process-based models, the partial derivatives of the loss function with respect 235 to weights and biases are analytically available and obtained through the 'chain rule of differentiation'. 236 More recently, derivative-free and metaheuristic optimization algorithms have shown promise in ANN 237 training (e.g., Dengiz (Hagan et al., 1996). These two approaches, also commonly referred to as 'stochastic gradient descent, 245 are useful when the size of training data is large (Bottou, 1998;Bottou, 2010). 246 In the late 1980s, after the invention of BP, MLPs were proven to be 'universal approximators' (Hornik et 247 al., 1989). This proof indicated MLPs with only one single-hidden layer that possesses a sigmoidal 248 activation function, and a linear output layer, would be able to approximate any function with any desired 249 level of accuracy provided the number of hidden neurons is sufficient. Since then, the 'universal function 250 approximation theorem' has been the fundamental driver of interest in MLPs across a variety of disciplines 251 and applications. Despite all of these advances, investments in ANNs and therefore the popularity of ANNs saw a decline in 263 the AI community beginning in the mid-1990s. This was perhaps triggered by failures to fulfill overly 264 ambitious or unrealistic promises by prominent AI scientists (Goodfellow et al., 2016) that brought about 265 somewhat a negative reputation for ANNs (Duerr et al., 2020), as historically observed in 'AI winters' 266 (Hendler, 2008 perhaps still is) no consensus about a proper network depth, because identifying the optimal network 284 configuration for a given problem and dataset is challenging. 285 Historically, some researchers favored ANNs with more than one hidden layer, arguing that they require 286 fewer hidden neurons to approximate the same function (see e.g., Tamura and Tateishi, 1997 Now, one might ask how unsupervised learning can be of any help in supervised learning. A common 304 method for this purpose uses 'autoencoders', which are a class of ANNs historically used for 305 dimensionality reduction and feature learning (Bourlard and Kamp, 1988). An autoencoder is an MLP, 306 typically trained by BP, with one or more hidden layers that receives input and aims to produce the same 307 input as its output. In a typical autoencoder, the middle layer has fewer neurons than the dimension of 308 input, thereby acting as a bottleneck that encodes the input data in a lower dimensional space. The signals 309 in the middle layer preserve the information contained in the inputs, which will be decoded back to the 310 original space in the following layers. Autoencoders can pre-train some layers of a deep ANN such that 311 the weights of those layers capture the main features in input data before passing them to the next layers. 312 After the pre-training phase by unsupervised learning, the ANN needs to be further trained in the 313 conventional supervised manner, using the actual output data and algorithms such as BP.  This section utilizes a geometrical interpretation of ANNs to illustrate the internal functioning of ANNs 343 and explain why deeper ANNs can be more powerful than 'shallower' ANNs in learning representations in 344 data. This interpretation is adopted in part from the work of Razavi and Tolson (2011), in which they recast 345

Geometrical Interpretation of DL
ANNs with respect to a new set of more interpretable variables based on the network functional 346 geometry. 347

A perceptron 348
An MLP is in principle made of a number of perceptrons. Consider an MLP with a single hidden layer with 349 a sigmoidal activation function, as shown Figure 2a. Each hidden neuron, e.g., the r th neuron, is a 350 perceptron whose output 1 is multiplied by the weight 1, before entering the output neuron. This 351 hidden neuron, when only having one input 1 , forms a functional relationship such as that shown in 352 Figure 2b. This 'sigmoidal unit' can be characterized by three variables: 'slope', 'location', and 'height'. 353 There is one-to-one mapping between these variables and the original network variables, ,1 1 , 1 , and 354 1, , as shown in the figure. As such, one can directly control the shape of the sigmoidal unit through 355 slope, location, and height, and where needed, map them onto the network's original variables. The 356 benefit of doing so is that, unlike the original variables, the new variables are geometrically interpretable 357 and therefore more intuitive. 358 Figure 2c shows the geometry of a perceptron with two inputs, 1 and . In this case, the resulting 359 sigmoidal unit forms a plane that can be characterized by slope, location, and height, plus an additional 360 variable called 'angle' that specifies the direction toward which the sigmoidal unit is facing. This geometry 361 can be extended to perceptrons with three or more (say D) inputs, where the sigmoidal unit becomes a suppose the objective in a two-input problem is to approximate the dome-like feature shown in Figure  391 4a. A single-hidden layer ANN with four sigmoidal hidden neurons and one linear output neuron would 392 be able to approximate the dome part of the surface, as shown in Figure 4b. This ANN would basically 393 superimpose four sigmoidal units with equal heights, equal slopes, equal locations, but different angles 394 (90° apart). The performance of this ANN, however, is unacceptable, as it creates erroneous features on 395 the tails. 396 But, can we rectify this issue by using more sigmoidal neurons? Figure 4c shows the performance of a 397 network with eight sigmoidal units, all having the same heights, slopes, and locations, but different angles, 398 45° apart. With more sigmoidal units at work, the performance at the tails is improved, producing less 399 erroneous features. Almost 40 hidden neurons are required, as shown in Figure 4d, to generate smooth 400 tails, similar to the original function shown in Figure 4a. This example provides a geometrical proof for the 401 universal function approximation theorem of Hornik et al. (1989) because, in principle, any function could 402 be approximated by a combination of such dome-like (i.e., basis) functions. The challenge, however, is 403 that many (possibly an excessively large number of) hidden neurons may be required for a given problem 404 to attain a desired level of approximation accuracy.  were required, as seen in Figure 4b, to reproduce the dome-like feature at the center. One might ask: Can 419 we stick to these four sigmoidal units and somehow smooth the tails? Yes, all that is needed is a second 420 layer with a nonlinear activation function (e.g., sigmoidal) to deactivate any feature that is under a 421 threshold. In other words, in this process, the geometry formed by the sigmoidal units in the first layer 422 filters through another sigmoidal unit that bounds that geometry. Figure 4e shows how adding the second 423 non-linear layer enables the network to reproduce the original function, with only four neurons in the first 424 hidden layer. Similar to single-hidden-layer ANNs, those with two hidden layers can approximate any 425 function by putting the dome-like functions side by side. 426 In general, shallower ANNs are special cases of deeper ANNs. As shown in the example, deeper ANNs can 427 provide more flexibility while they may require fewer hidden neurons across the network for 428 representation learning. However, the training of deeper ANNs has been historically much more difficult 429 because of the now well-known problem of 'vanishing and exploding' gradients. This problem relates to 430 the fact that the partial derivatives of a loss function (Eq. 2) with respect to weights and biases in first 431 layers, obtained via the chain rule of differentiation, tend to become very small (i.e., close to zero) or very 432 large (i.e., exponentially growing or fluctuating). Improved algorithms along with higher computational 433 power have now eased that difficulty and made possible the training of very deep ANNs (Schmidhuber,434 2015b). 435 Lastly, a related consideration about the proper number of hidden layers is about the fact that, in many 436 problems, only a small part of the input space is active. In other words, some combinations of the different 437 inputs might not occur in reality and therefore the accuracy of the ANN might not matter much in the 438 regions of input space containing those combinations. For example, consider a case similar to one shown 439 in Figure 4b, where the corners on the input space do not show up in the data available. A hydrological 440 example is where snowfall and temperature are two inputs to ANNs. Because snowfall would never occur 441 along with high temperature, the respective part of the input space always remains inactive. 442 Occam's razor. 452

Relevance of Occam's razor and equifinality?
Occam's razor, or principle of parsimony, indicates that simpler hypotheses or models should be preferred 453 over more complex ones. In other words, those models that serve the purpose with as few parameters as 454 possible should be chosen. However, many data-driven modellers, in particular in the field of ML In the 'early stopping' strategy, the quality of fit to the 'validation' dataset is evaluated after each 'epoch', 491 that is an optimization iteration trying to minimize the loss function on the 'training data' (see Section 492 2.2). Empirically speaking, as the training error decreases over time, the validation error decreases as well 493 for a while. However, at some particular epoch, the validation error may begin to increase while the 494 training error may keep decreasing (see Figure 5). This epoch is deemed to mark the beginning of 495 overfitting; thus, the user stops the training process. This strategy is therefore called 'early stopping' in 496 the sense that the training stops early, before it can further improve the fit to the 'training' dataset (for a 497 review, see Prechelt, 1998). When the training process stops, the generalizability of the trained network 498 is assessed via out-of-sample prediction on the 'testing' dataset. 499 features, which are unsupported by data, from the overall network response. 517 But how can one balance the goodness of fit and smoothness of the network response? In practice, this 518 is a bi-objective optimization problem, where one objective is to minimize the error function and the other 519 is to minimize the regularization function. These two objective functions are commonly integrated into 520 one loss function via weighting schemes. Figure 6 shows how the two objectives compete in a real 521 example. Ideally, one may wish to achieve a performance such as that shown in Figure 6e. Doing so is not 522 trivial, however, because in practice the underlying function is unknown, available data are limited, and 523 response surfaces are multi-dimensional and cannot be easily visualized. The Bayesian regulation method 524 developed by MacKay (1992) and extended by Foresee and Hagan (1997) has proven useful to adaptively 525 assign the weights associated with each function during training. 526 training. When a part of an ANN is inactivated in this process, the resulting network is called a 'thinned' 539 network. The ultimate prediction after training with dropout is viewed as an approximation of the 540 ensemble average of predictions by many independent ANNs. Basically, the many different thinned 541 networks created throughout the process are assumed to represent ANNs with different configurations 542 and parameters. This heuristic discourages neurons to co-adapt too much and, as such, is believed to 543 avoid overfitting. 544 Lowe, 1988), Gaussian emulator machines (Kennedy and O'Hagan, 2000), and support vector machines 550 (Vapnik, 1998 Each kernel typically has a limited radius of influence in the input space, and therefore only responds to 552 inputs located in their local neighborhood. 553

Fundamental differences from other ML methods
Conversely, a unique feature of ANNs is their ability to learn through 'distributed representations' (Hinton 554 et al., 1986). They typically represent an entity via collective efforts distributed among multiple processing 555 units (e.g., sigmoidal units). Unlike kernel functions, the sigmoidal units typically have large regions of 556 influence (see e.g., Figure 2c) that overlap each other in the input space (see e.g., Figure 4b). The former 557 figure shows that a sigmoidal unit influences the entire input space, by dividing it into three zones: lower 558 tail, upper tail, and slope. The latter figure shows how the influences of four such sigmoidal units are 559 superimposed to generate the network response. 560

Implications for users 561
The use of distributed representations has several practical implications. To the author's knowledge, these 562 include: 563  Transparency: The internal functioning of methods based on local representations is more 564 transparent. Local representations are the most straightforward and easy-to-interpret way of 565 learning, whereas distributed representations can be complex, often leading to emergent 566 properties that cannot be easily explained by local representations . 567  Learning difficulty: Distributed representations are more difficult and time-consuming to learn. 568 In local representations, the role of each processing unit may be assigned independently of the 569 other units, but in distributed representations, many processing units may be configured together 570 in complex ways to represent a feature in the data. 571 Section 2.6.2) for a discussion on this issue. 584 In addition, ANNs are essentially multi-output models because they can have as many output neurons as 585 required for a given problem. This means a single ANN can simultaneously predict different variables while 586 accounting for their possible cross-correlations. Many other ML methods are, however, single-output 587 models. For example, in the case of support vector machines, one need to develop two independent 588 models to be able to predict two different variables in a system. Refer to Razavi et al. (2012a, Section 589 2.6.5) for an extensive discussion on this matter. 590

591
MLPs provide static mapping from inputs to outputs. However, many applications require mappings with 592 a formal representation of time evolution and memory. To enable MLPs to do so, two general sets of 593 tools, and combinations thereof, have been used in the literature: (1) tapped delay lines and (2) recurrent 594 connections. These tools are explained in the following. 'context' signals that evolve through time steps. 616

Recurrent connections 617
TDLs, as described in Section 6.1, explicitly represent time with a memory unit of limited length. Unlike 618 TDLs, recurrent connections, first introduced by Jordan (1986), enable ANNs to account for time evolution 619 based on an implicit memory concept, which is theoretically of unlimited length and is highly context 620 dependent (Elman, 1990). Recurrent connections receive the outputs of a layer at every time step and 621 feed them back to the same or some other layer in the next time step. Technically, they do so via a 'context 622 unit' that stores those outputs in a set of delay boxes (Figure 7c). Recurrent connections can be installed 623 on one or more layers (e.g., Jordan, 1986;Elman, 1990) or locally on some select neurons (e.g., Frasconi 624 et al., 1992). 625 An MLP enabled with recurrent connections is commonly called a 'recurrent neural network' (RNN). An 626 RNN can possess many more tunable parameters compared to an MLP with the same number of layers 627 and neurons. Using the example given in Section 6.1, an MLP with three inputs and 10 neurons in the first 628 hidden layer would have 30 weights in that layer, whereas adding recurrent connections to that layer 629 (e.g., Figure 7c) would add 100 more weights (130 in total) to that layer. 630 Unlike TDNNs that possess a short-term memory, RNNs in theory can represent long-term dependencies 631 in the input sequence as well. In practice, however, recurrent connections have difficulty representing 632 long-term memory because they can easily get dominated by short-term memory. In other words, even 633 very small features arising from short-term dependencies tend to mask features arising from long-term 634 dependencies. In addition, RNNs are prone to the 'exploding and vanishing' gradients problem in their 635 training (Bengio et al., 1994). This is because RNNs, even with a single hidden layer, are in principle deep 636 networks implicitly possessing an infinite number of recursive layers. 637

Gate layers to forget or preserve over time 638
To explicitly account for and balance both short-and long-term dependencies in input sequences, 639 Hochreiter and Schmidhuber (1997)  response between zero and one via using a logistic function. These responses are then multiplied by their 647 respective signals flowing through the context, which means a value of zero would kill a signal whereas a 648 value of one would fully preserve it. Due to the additional weights and biases in the gate layers, an LSTM 649 typically has many more tunable parameters than a conventional RNN. 650 LSTMs are now perhaps the most popular and widely used type of ANNs with memory. However, LSTMs 651 took a long time (more than a decade) to become known and mainstream, particularly beyond their core 652 computer science community. Their widespread application nowadays owes to recently developed 653 software tools such as Python's TensorFlow that efficiently implement variations of LSTMs for a range of 654 problems. 655

Training considerations when the order of data matters 656
The training of memory-enabled ANNs, such as TDNNs, RNNs, and LSTMs is different from that of standard 657 ANNs in terms of the way time-ordered data are presented to the network. To train standard ANNs, the 658 data entries can be presented in any order even randomly, for example through stochastic gradient 659 descent (Bottou, 2010). In memory-enabled ANNs, however, the data entries should be presented in order 660 of occurrence so that the structure of the time dependency is preserved. While this point might seem 661 trivial, it requires careful attention in practical applications. 662 Another point to consider in the training of memory-enabled ANNs is that all data entries are typically 663 viewed to have equal importance, regardless of their location in the sequence. When used in an online 664 operational forecast, however, the 'forgetting factor' approach can be used to discount older samples. 665 This approach allows the network to adapt to non-stationary environments, where more recent data are 666 more representative of the underlying processes than older data (Razavi and Araghinejad, 2009). 667 Lastly, elements of TDNNs and RNNs can be combined in a variety of ways. A well-known combination is 668 'time-delay recurrent neural networks' developed by Kim (1998)  This section provides an experiment that runs and compares both types of models for the same problem 686 and walks the reader through all of the steps involved. In particular, the processes around calibration and 687 validation, role of physics, and interpretations of out-of-sample prediction are discussed. This experiment 688 is performed in the context of hydrologic modelling, which has seen tremendous progress over the years 689 with respect to both ML and process-based modelling. 690

Data and models 691
The case study used aims to model the hydrologic system of the Oldman River watershed in Alberta, 692 Canada. This watershed has an area of 1434.73 km 2 at Waldron's Corner with a long-term average 693 temperature of 2.2 °C. On average, this watershed receives 611 mm of precipitation (rainfall + snowfall) 694 annually and generates 11.7 m 3 /s of river flow. Figure 8 shows the 30-year long daily time series data 695 used. The first 22 years were used for model 'calibration' (i.e., the 'seen' data in model development) and 696 the last eight years for model 'validation' (i.e., the 'unseen' data in model development). The first three 697 months of the calibration period were used for model spin-up. In the case of DL, the calibration period 698 was further broken into 'training' (17 years) and 'testing' (5 years) periods, the latter for early stopping of 699 the training process to avoid overfitting. Note that, as explained in Section 4.2, the naming convention in 700 the DL context for the 'validation' and 'testing' periods is often the other way around. 701 To model this system, an LSTM configuration was chosen here as a state-of-the art DL model that accounts 702 for time dependency and memory. The inputs to the LSTM model are daily precipitation and temperature 703 (Figures 8a and d) and the output is the concurrent flow (Figure 8e). The LSTM structure was rather 704 arbitrarily chosen to have one hidden layer with five neurons, resulting in 166 calibration parameters. For 705 benchmarking purposes, a classic hydrologic model called HBV (Lindström et al., 1997), as implemented 706 in HBV-SASK (Razavi et al., 2019), was used. HBV-SASK is based on a conceptualization of physical 707 principles governing the water movement in a watershed using 12 calibration parameters. Each of these 708 parameters has a physical interpretation and a physically justified feasible range (see Figure 9 and Table  709 2  period was used for HBV calibration. The validation period was used to evaluate the performance of 719 both LSTM and HBV in out-of-sample prediction. 720 721

Model performance in calibration 722
The model calibration problem was cast as an optimization problem that tries to maximize the goodness 723 of fit to data by tuning the model parameters, with the Nash-Sutcliffe efficiency (NSE; Nash and Sutcliffe, 724 1970) as the objective function. NSE is essentially a normalized version of mean squared errors computed 725 as 1-[VAR(errors)/VAR(observations)]. As such, an NSE of one indicates a perfect fit, and an NSE of zero 726 indicates the model prediction is not any better than the average of observations. As a rule of thumb, 727 hydrologists often call an NSE of 0.7 and higher an acceptable fit. 728 The LSTM model was calibrated using BP with the early-stopping strategy to avoid overfitting. In each 729 epoch, the training period data were used to update the network parameters, while the testing period 730 data were used to detect possible overfitting. Five independent replicates of LSTM calibration (with 731 different initial random seeds) were conducted to account for possible variability of model performance. 732 Figure 9a shows the training results of the five replicates compared to a case where the training would 733 not have stopped. As expected, the LSTM performance keeps improving in training, whereas in testing it 734 begins to significantly degrade at some point. The objective function in training came very close to one 735 after many more epochs but with very poor performance in testing (not shown). 736 The HBV-SASK model was calibrated by a multi-start Newton-type optimization algorithm. Similar to 737 LSTM, five independent replicates of HBV-SASK calibration were run. Note that the calibration performance of HBV-SASK shown herein is almost the best the author has 741 achieved so far for this watershed. Based on these results, the superiority of LSTM over HBV-SASK in 742 calibration is quite significant from a hydrologic modeling point of view. The performance of the two 743 models in validation is discussed in Section 7.4, but before that let us discuss what information the two 744 contained prior to calibration.   represented in the data. Figure 10a shows the LSTM performance of arbitrarily chosen replicates before 760 and after calibration. The model response to inputs before calibration seems to be completely random 761 but, after calibration, the model response has learned to closely follow the underlying system response. 762 Unlike LSTM, HBV-SASK encodes the expert knowledge available in the field of hydrology. This model is a 763 collection of conservation of mass equations and process parametrizations that represent how 764 hydrologists conceptualize the way a watershed works. This 'physically based' modelling structure is 765 presumably able to emulate the behavior of any watershed by tuning only 12 parameters. Figure 10b  766 shows how the model performs before calibration, with parameter values chosen to be at the midpoint 767 of their ranges, and after calibration. The figure shows the 'uncalibrated' model responds reasonably to 768 the inputs; it generally captures the timing of flows and emulates the low flow segments well but is overly 769 responsive to large precipitation events, generating spurious spikes in flows. Calibration, either manual 770 by expert knowledge or automatic as done here via optimization, can fix the discrepancies and fit the 771 model output to observations. 772 So, a fundamental difference between the two approaches is now clearer: using a process-based model 773 is about directly using a wealth of expert knowledge available in a scientific field while using DL is about 774 learning everything from scratch directly from data. This difference is manifest in the number of 775 parameters that need to be tuned to achieve a reasonable performance. Notably, the LSTM model 776 achieved a better performance in emulating observations after calibration, as evident in a comparison of 777 Figures 10a and b. However, in any modelling exercise, one needs to ensure the model gives the right 778 answer for the right reasons (Kirchner, 2006). That is why proper model evaluation in out-of-sample 779 prediction is critically important, as discussed in the next section. 780

Model validation: Standard versus true out-of-sample prediction 781
In general, validation and verification of mathematical models are very challenging in some scientific 782 disciplines, if possible at all (Oreskes et al., 1994). The standard practice, however, is to test the 783 performance of the model under investigation in terms of reproducing some historical record not seen 784 during model calibration (Klemeš, 1986a), a process called 'out-of-sample prediction' in this paper. Figure  785 9b shows the results of such practice in the validation period set in Figure 8. In this case, both LSTM 786 (standard) and HBV models do reasonably well from a hydrologic point of view, with LSTM outperforming 787 HBV across all replicates. In addition, and as expected, both models produced slightly lower NSE values in 788 validation compared to those in calibration. 789 The above so-called 'model validation' is inherently partial (Oreskes et al., 1994). While the performance 790 of LSTM appears to be better than that of HBV in a 'relative' sense, one needs to take extra care before 791 making such a conclusion. As argued by Klemeš (1986a) more than three decades ago, a strong 792 assumption in this type of validation is that the conditions under which the model will be used will be 793 similar to the conditions under which the model has been developed and calibrated. It is now well-794 recognized that such an assumption may not hold, as many natural systems are essentially non-stationary 795 (Milly et al., 2008;Razavi et al., 2015). Despite such recognition, this standard model validation practice 796 has arguably remained unchanged (Beven, 2018). Under the new conditions, however, the two models show the two distinct behaviors shown in Figure  816 11b. According to LSTM, peak summer flows would decline by about 25% on average and the time of the 817 peak would shift backward by about a week, from the beginning of June to a time in the fourth week of 818 May. According to HBV-SASK, however, the changes would be more pronounced. The peak flows would 819 decline by about 35% on average and the flows might show two modes: the higher one at the beginning 820 of May and the other at the beginning of June, at about the same time as the peak in the historical 821 observations. Are such differences not sufficiently large so as to make the user skeptical about the 822 modelling process? 823 particularly under new conditions. Let us give it a try by recasting the modelling problem based on some 832 understanding of the governing physics in hydrology. For example, physics tells us that the freezing point 833 of water is around 0 °C and, therefore, this threshold could be used as an approximation to differentiate 834 rainfall from snowfall on a daily basis, i.e., if the temperature on a day is above/below 0 °C, the 835 precipitation on that day, if any, is considered to be rainfall/snowfall (see Figures 8b and c). This 836 differentiation is actually a part of process parameterization in HBV, similar to many other hydrologic 837 models, via a parameter called 'temperature threshold' (TT) for melting/freezing and separating rain and 838 snow, with a feasible range from −4 to +4 °C (see Razavi  Perhaps the most straightforward way of introducing the TT concept to LSTM is via pre-processing of the 842 inputs. Therefore, a new LSTM model was developed and calibrated, called 'process-informed LSTM' in 843 this paper, with three inputs: rainfall, snowfall, and temperature as shown in Figures 8b, c,  What is worrisome is the large divergence in behavior of models in response to expected, but yet to be 876 seen perturbations, whereas those models produce comparable results in standard out-of-sample 877 prediction. Broadly speaking, one might say any known consistency of a model with the known underlying 878 physics can improve model's explainability and interpretability, thereby helping us better explain the 879 model behavior in response to such perturbations. Explainability and interpretability are fundamental 880 assets in building trust in a model, and of course, physically-based models are advantaged in that respect. 881 I would say mistrust in some data-driven modelling paradigms such as ANNs is a long-standing issue in 882 part of our research community and stakeholders. I have heavily struggled with this issue as a researcher 883 who started his research career with ANNs almost 20 years ago and has also had the privilege to work 884 extensively with process-hydrologists and physically-based modellers. I believe with the current 885 momentum and excitement, an opportunity is arising to bring the two world views together and promote 886 the dialogue between the champions of process-based modelling and those of machine learning, as 887 already discussed by many authors (e.g., Reichstein  water available in the soil and its contribution to flows; more water means more flows due to gravitational 919 forces. 920 Mechanistic models directly account for such knowledge on casual relationships. This knowledge is 921 encoded in process parametrizations typically in the form of deterministic, monotonic functions, or rarely 922 in hysteretic forms, with a limited number of parameters to be calibrated to the specific case study in 923 hand (Gharari and Razavi, 2018 mechanisms is often built into mechanistic models, using differential equations (ordinary or partial) to 935 describe the system dynamics. The representation of such dynamics in the making of models is important, 936 particularly for long-term predictions and over long time scales. 937 DL models are often unable to account explicitly for such long-term dynamics. If a particular dynamical 938 behavior is present in training data, then DL can capture that behavior in its mapping from input onto 939 output. DL however has no explicit mechanism to represent that dynamic under perturbed conditions 940 beyond what has been recorded in the training data. 941 Is mechanistic modelling immune to issues with extrapolation? Certainty not. While a discussion on the 942 limitations and prospects of mechanistic modelling is beyond this paper, one solution to improve 943 extrapolability of mechanistic modelling over time that is also relevant to ML is 'space-for-time 944 substitution'. This strategy is to investigate multiple or many sites simultaneously, instead of one, to infer 945 a temporal trend for a site based on information from other sites that have different properties and/or 946 experienced different conditions, assuming spatial and temporal variations are equivalent. For example, 947 refer to Pickett (1989)  The bottom line is that mechanistic models are generally expected to be less prone to generating spurious 954 behaviors in true out-of-sample prediction. Therefore, many domain experts may be inclined to trust 955 physically based models as their behavior is constrained by physical laws that are perceived as unchanging 956 with time. The points made in this section will become clearer in the next section, where the essential 957 differences between DL and mechanistic modelling are discussed. 958

Why is DL essentially different from process-based modelling? 959
In the author's view, the first principles of ANNs are rooted in connectionism, hyper-flexibility, and 960 vigorous optimization. These characteristics are fundamentally different from the guiding principles of 961 developing and calibrating mechanistic models, as described in the following: 962  Connectionism is an approach that orchestrates a set of simple algebraic operations in a massively 963 parallel manner to create a model that is able to carry out complicated tasks. Following this 964 approach, ANNs represent the response of a system under consideration to an input by summing 965 the collective efforts of many neurons, whose roles cannot be easily attributed to individual 966 processes involved in that system. This is unlike mechanistic modelling where each part of a model 967 is designed to be responsible for a specific process. 968  Hyper-flexibility is a characteristic of a model with excessive degrees of freedom, which can literally 969 fit any dataset, and is not constrained by the many assumptions held by typical statistical models. 970 ANNs are known to be hyper-flexible. Mechanistic models, however, have limited degrees of 971 freedom depending on the knowledge base available about the processes being modelled. Ideally, 972 mechanistic models tend to have just as many degrees of freedom as can be supported and 973 constrained by available knowledge and data. 974  Vigorous optimization here refers to the practice of manipulating model parameters at any cost to 975 maximize the goodness-of-fit to calibration data. The training of ANNs is all about minimizing an error 976 function; that is, among two competing ANNs, the one producing smaller errors in calibration and 977 validation is the winner. Optimization is also often an essential part of mechanistic modelling to 978 calibrate model parameters. However, in mechanistic modelling, minimizing the errors is not the 979 goal but a means to improve the realism of the model. In other words, unlike ANNs, physical 980 feasibility of a parameter, its identifiability, and equifinality are key considerations in mechanistic 981 modelling. 982 The recognition of these fundamental differences is critically important when one aims to choose the right 983 modelling paradigm for a purpose, compare the two paradigms in a case study, or attempt to bridge the 984 two paradigms, possibly for improved modelling performance. The following section outlines the status 985 quo for bridging the two paradigms and some emerging trends. influential article in Nature re-introduced and proposed the notion of 'hybrid modelling' and the above 994 three approaches as the next steps in earth science. In the following, these three approaches are 995 explained, and then more modern existing approaches arising from research fields beyond earth and 996 environmental sciences are discussed. Integrating differential equations into ANNs. This approach is a very recent and perhaps the most has been referred to as a form of modern "alchemy"; see Rahimi and Recht (2017) for the sentiment, 1126 LeCun (2017) for a rebuttal, and Hutson (2018) for a summary. This point is not to undermine the benefits 1127 of AI technology, particularly for earth and environmental applications. Instead, it calls for improved rigor 1128 and better appreciation of the knowledge base available. After all, it has been long known in 1129 environmental sciences that complex models can be made to produce virtually any desired behavior given 1130 their large degrees of freedom, as articulated by Hornberger and Spear (1981) three decades ago. 1131 Having such risks in mind, the new potential afforded by AI for earth and environmental sciences is great. 1132 To realize this potential, we need to reconcile data-driven AI techniques and the theory-driven knowledge 1133 base. The knowledge base is at the heart of 'traditional programming', which is still a major building block 1134 of process-based or mechanistic modelling in earth and environmental sciences. Clearly, the traditional, 1135 knowledge-based programming and AI are made up of two fundamentally different world views for 1136 problem solving and, therefore, their reconciliation will not be straightforward. This paper tried to address 1137 some critical questions in this regard and provide some perspective for this important endeavor, in 1138 anticipation of new breakthroughs in earth and environmental sciences in an age of big data and 1139 computational power. 1140