Untenable nonstationarity: An assessment of the ﬁ tness for purpose of trend tests in hydrology

The detection and attribution of long-term patterns in hydrological time series have been important research topics for decades. A signi ﬁ cant portion of the literature regards such patterns as ‘ deterministic components ’ or ‘ trends ’ even though the complexity of hydrological systems does not allow easy deterministic explanations and attributions. Consequently, trend estimation techniques have been developed to make and justify statements about tendencies in the historical data, which are often used to predict future events. Testing trend hypothesis on observed time series is widespread in the hydro-meteorological literature mainly due to the interest in detecting consequences of human activities on the hydrological cycle. This analysis usually relies on the application of some null hypothesis signi ﬁ cance tests (NHSTs) for slowly-varying and/or abrupt changes, such as Mann-Kendall, Pettitt, or similar, to summary statistics of hydrological time series (e.g., annual averages, maxima, minima, etc.). However, the reliability of this application has seldom been explored in detail. This paper dis-cusses misuse, misinterpretation, and logical ﬂ aws of NHST for trends in the analysis of hydrological data from three di ﬀ erent points of view: historic-logical, semantic-epistemological, and practical. Based on a review of NHST rationale, and basic statistical de ﬁ nitions of stationarity, nonstationarity, and ergodicity, we show that even if the empirical estimation of trends in hydrological time series is always feasible from a numerical point of view, it is uninformative and does not allow the inference of nonstationarity without assuming a priori addi- tional information on the underlying stochastic process, according to deductive reasoning. This prevents the use of trend NHST outcomes to support nonstationary frequency analysis and modeling. We also show that the correlation structures characterizing hydrological time series might easily be underestimated, further compro-mising the attempt to draw conclusions about trends spanning the period of records. Moreover, even though adjusting procedures accounting for correlation have been developed, some of them are insu ﬃ cient or are applied only to some tests, while some others are theoretically ﬂ awed but still widely applied. In particular, using 250 unimpacted stream ﬂ ow time series across the conterminous United States (CONUS), we show that the test results can dramatically change if the sequences of annual values are reproduced starting from daily stream ﬂ ow records, whose larger sizes enable a more reliable assessment of the correlation structures.


A B S T R A C T
The detection and attribution of long-term patterns in hydrological time series have been important research topics for decades. A significant portion of the literature regards such patterns as 'deterministic components' or 'trends' even though the complexity of hydrological systems does not allow easy deterministic explanations and attributions. Consequently, trend estimation techniques have been developed to make and justify statements about tendencies in the historical data, which are often used to predict future events. Testing trend hypothesis on observed time series is widespread in the hydro-meteorological literature mainly due to the interest in detecting consequences of human activities on the hydrological cycle. This analysis usually relies on the application of some null hypothesis significance tests (NHSTs) for slowly-varying and/or abrupt changes, such as Mann-Kendall, Pettitt, or similar, to summary statistics of hydrological time series (e.g., annual averages, maxima, minima, etc.). However, the reliability of this application has seldom been explored in detail. This paper discusses misuse, misinterpretation, and logical flaws of NHST for trends in the analysis of hydrological data from three different points of view: historic-logical, semantic-epistemological, and practical. Based on a review of NHST rationale, and basic statistical definitions of stationarity, nonstationarity, and ergodicity, we show that even if the empirical estimation of trends in hydrological time series is always feasible from a numerical point of view, it is uninformative and does not allow the inference of nonstationarity without assuming a priori additional information on the underlying stochastic process, according to deductive reasoning. This prevents the use of trend NHST outcomes to support nonstationary frequency analysis and modeling. We also show that the correlation structures characterizing hydrological time series might easily be underestimated, further compromising the attempt to draw conclusions about trends spanning the period of records. Moreover, even though adjusting procedures accounting for correlation have been developed, some of them are insufficient or are applied only to some tests, while some others are theoretically flawed but still widely applied. In particular, using 250 unimpacted stream flow time series across the conterminous United States (CONUS), we show that the test results can dramatically change if the sequences of annual values are reproduced starting from daily stream flow records, whose larger sizes enable a more reliable assessment of the correlation structures.
Faced with a sample of unknown origin, many applied statisticians working in economics, meteorology and the like, hasten to decompose it into a trend and an oscillation (and added periodic terms). They assume implicitly that the addends are attributable to distinct generating mechanisms, and are statistically independent. This last implicit assumption is quite unwarranted, except when the sample is generated by Brownian motion. (B.B. Mandelbrot, The Fractal Geometry of Nature, p. 352, 1982)

Introduction
Due to the complexity of hydrological systems, their analysis and modeling heavily rely on historical records as theoretical reasoning and deduction are often inadequate. This analysis is even more problematic when we depart from the hypothesis of stationarity to embrace that of nonstationarity. Even though the two notions of stationarity and nonstationarity should apply to models and not to the real-world processes themselves (see Section 4 below), considerable literature assumes that the observed time series generated by the real world seldom appear to be stationary but exhibit more complicated nonstationary behavior. In many cases, conclusions on nonstationarity are based on the outcome of trend tests applied to finite-size time series covering relatively short periods of record.
A change of paradigm from stationary to nonstationary can be claimed to account for human activities producing predictable changes, such as land-use and land-cover changes, and water resources exploitation, or more complex but less predictable phenomena such as the worldwide hydrologic change ascribed to anthropogenic climate change (ACC) (Milly et al., 2015). In this respect, in the last three decades, a huge number of studies have investigated possible humandriven changes in the form of slowly-varying trends or abrupt changes in time series of hydrological variables across different regions of the world. Broadly speaking and taking for granted unavoidable differences, the aim of these studies has been to understand if these changes are detectable, what is their pattern, and ultimately, to infer nonstationarity, thus promoting the implementation of nonstationary models to support new design and planning strategies (e.g., Ouarda and El-Adlouni, 2011;Rootzén and Katz, 2013;Cheng et al., 2014, among many others).
Therefore, we believe there exists a need for careful inspection of the basic concepts of null hypothesis statistical tests (NHSTs) for trends and their application to hydrological problems. Following Serinaldi and Kilsby (2015), this paper is an attempt to meet this need. In fact, the purpose of our work is neither to review the state of the art of the research related to trend analysis, nor to give examples of the problems discussed thereto. Rather, we summarize and attempt to shed some light on the reasons for contradictory results encountered in the literature, and discuss widespread practices that can easily be identified in many studies. Therefore, the conceptual perspective of this study should be seen as a guideline in agreement with the general but scientifically based and widely applicable statement by Mandelbrot in the opening motto of this paper. On the other hand, we attempt to confute a certain mechanistic approach often characterizing the literature on the topic. We highlight that conceptual arguments and mathematical definitions are necessary to provide practical advice to identify trends, to interpret results, and to avoid misleading usage and conclusions.
The paper is structured as follows. By using a simple example, Section 2 introduces the discussion and research questions, and summarizes our conclusions in order to provide the reader with a clear outline of what will follow. Section 3 reviews the role of trend testing and some problems related to historical derivation, epistemological reasons, and detection and attribution of changes under temporal persistence. Then, in Section 4, we discuss the importance of clear terminology corresponding to well-defined concepts to avoid misunderstandings relying on different interpretation of the same terms. Section 5 gives an overview of the properties of some commonly used trend tests, namely, Mann-Kendall (MK) and Pettitt. In Section 6, we analyze 250 unimpacted stream flow time series across the conterminous United States spanning the period 1950-2011. Discussion and final remarks are given in Section 7.
2. NHST for trends: overview of key ideas 2.1. Setting the scene with a simple example We start our discussion by a simple example of typical trend testing exercise familiar to practitioners as we want to highlight the basic concepts behind trend testing procedures. Fig. 1(a) and (b) shows the average annual discharge of two nearby rivers that we will call the Nera River and Velino River, referring to the following discussion for an indepth description of these data. Both time series, ranging from 1916 to 2015, show an apparent change point around the years 1974-1975 as well as statistically significant and similar autocorrelation functions ( Fig. 1(c) and (d)); the Kendall correlation coefficient between the two time series is = τ 0.32 K . Since we do not know if the autocorrelation is a consequence of a possible deterministic change of regime or the effect of the dependence structure, we use statistical tests accounting for the latter. Therefore, we apply both the classic Pettitt tests and four additional versions accounting for possible first-order Markov autocorrelation structure and fractional Gaussian noise (fGn; also known as Hurst-Kolmogorov process) (Serinaldi and Kilsby, 2016a). Following the common interpretation of trend tests, the tests unanimously lead to the conclusion that a possible deterministic change around [1974][1975] at the 5% significance level. After splitting the series into two sub-series (before and after the change point) we find that their autocorrelation is not significant (Fig. 1(e)-(h)). Therefore, we next apply standard Pettitt and MK tests to the two sub-series for both rivers (Villarini et al., 2009a). Since no significant trend or change point was found in the subseries, we can conclude that autocorrelation is reasonably the effect of an abrupt change occurred around [1974][1975]. Based on these results, the common approach attempts to explain such changes by some anthropogenic activities, including some more easily recognizable (i.e. river training, water abstraction, dam construction, etc.) and some less (i.e., ACC, climate teleconnections, etc.). In the latter case, the attribution is performed by some further statistically-based analysis (see e.g., Merz et al., 2012;Viglione et al., 2016, and references therein, for an overview of the attribution problem and examples). However, the nature of the time series analyzed above is completely different. In fact, the true nature of these data is that they are nothing more than artificially generated series designed to be a complete contrast. The Nera River sequence is a step-wise signal superimposed on a sample of independent pseudo-random realizations drawn from a Gaussian distribution ( Fig. 1(i)), whereas the Velino River sequence is a sub-set extracted from a longer time series of size 2000, which is a realization of a discrete-time fGn with unit variance and Hurst parameter = H 0.8 ( Fig. 1(j)). This synthetic experiment highlights that even procedures specifically devised to account for the interplay between possible deterministic trends and/or change points are not able to discriminate, and can easily lead to incorrect conclusions. In fact, persistence generates local trends and abrupt changes, and deterministic changes result in artificial persistence. Notice that sequences similar to Velino time series can easily be extracted by a quick visual inspection of the entire time series, since these local trends and step changes are a characteristic of persistent processes.
Since abrupt changes can be seen as a special (limit) case of monotonic slowly-varying trend, in the following discussion we generally indicate both types with the term 'trend', unless otherwise specified. The null hypothesis is defined as {H 0 : there is no deterministic trend}, while the alternative hypothesis as {H 1 : there is deterministic trend}. Further specifications are given according to the specific context throughout the text.

Forgotten questions whose answers are often taken for granted
Some research questions arise from the example in the previous section: 1. What is the origin of slowly-varying trends (or step changes) in hydro-meteorological time series? Are they due to external drivers (e.g., well-defined human interventions) or are they related to intrinsic persistence or other causes? Alternatively, is measured persistence a spurious effect of trends induced by external forcing, or are observed trends spurious effects of persistence or other generating mechanisms? 2. Can null hypothesis statistical tests (NHSTs) for trends answer the above questions? Which information can trend NHSTs provide? 3. Under the assumption that trend NHSTs can provide information about trends in recorded series, can one draw conclusions about nonstationarity, thus justifying, for instance, the use of nonstationary modeling in hydrological frequency analysis?

Resetting some beliefs concerning NHST for trends
Since answers to the above questions require an extensive discussion of arguments often neglected in hydro-meteorological applications, in this section we firstly summarize the main conclusions, and then we present in full detail the reasoning leading to them in the remainder of the paper. Broadly speaking, searching for answers to the above questions reveals critical aspects related to trend NHST that can be classified as empirical, methodological, and theoretical. The first refer to the nature of the data, the second to the models used to make inference, and the latter to logical foundations and semantics (i.e. the link between symbol, concept, and referent; Eco (1976)). All these aspects are already known and discussed but spread out in various research areas; however, they are often overlooked, and their impact on results dramatically underestimated in many hydro-meteorological studies.
Following the structure of the subsequent sections, these arguments can be summarized as follows: 1. NHSTs have a logically flawed rationale coming from ill-posed and theoretically unfounded hybridization of Fisher significance tests and Neyman-Pearson hypothesis tests; they do not provide the information that scientists need (i.e., the likelihood of H 0 given the data and/or physical significance), do not allow conclusions about the truth of falsehood of any hypothesis, and do not apply to exploratory non-randomized studies. Trend tests share the general problems of NHST procedures. Such issues are concerned with the inverse probability problem, the confusion between substantive and statistical hypotheses, and the fact that NHSTs are not devised for exploratory studies. In fact, hydro-meteorological data are unique in the sense that every record is the only available realization or trajectory of the underlying process. Since alternative experiments cannot be performed, these observations do not provide the type of independent information that would be obtained by observing the same variables over a period of similar length at another point in time. Even though hypothesis testing falls in the realm of so-called confirmatory analysis, its nature is basically dissenting as its outcomes can only be rejection or no rejection, and both cases reflect lack of knowledge about the null hypothesis H 0 and the alternative hypothesis H 1 : formal acceptance is not a contemplated option. Moreover, statistical significance does not imply physical significance because the former depends on the sample size, and almost every test assigns statistical significance to physically negligible differences for very large samples (see Sections 3.1). 2. Hydro-meteorological data are commonly characterized by spatial and temporal dependence. This property can greatly help to interpret and account for many features of hydro-meteorological records such as apparently unexpected variability. Dependence is usually incorporated into the null hypothesis H 0 in order to compare the assumption H 1 of deterministic trend with a more realistic H 0 relaxing the assumption of independence. Nevertheless, dependence can be strongly underestimated due to the limited extent and uniqueness of the hydro-meteorological data, which should therefore be carefully taken into account. For example, this study highlights that even more refined statistical techniques accounting for dependence can be not enough. In fact, we show that the nature, quantity, and quality of some annual summary statistics are not sufficient to infer the dependence (and thus the variability) resulting from the entire daily process (see Sections 3.2 and 6). 3. Trend tests are widely used to assess the effect of known external forcings (e.g., land cover change) on hydro-meteorological records (e.g., flow peaks) in order to explore inhomogeneity or trends or nonstationarity (e.g., McCuen, 2003). Such procedures often result in circular reasoning because if we assume that the forcing process is changing according to some deterministic function of time -and thus it is nonstationary -and it affects in some way a target process of interest (e.g., flood intensity or frequency), then we already know that the target process is nonstationary. In these cases, we are interested in the size of the effects and not in the presence/absence of generating mechanism, which is already known (see Section 3.3). 4. The outcome of trend NHSTs cannot support and justify the use of nonstationary models. As a deterministic trend is a systematic change reflecting a time-dependent process, the mathematical rule describing the evolution of this change should be established by deductive reasoning (a priori; see e.g., Poppick et al., 2017) or assumed as a working hypothesis but cannot be inferred solely from the data without external information, because, without attribution, new data might easily change the nature and shape of the supposed trend. Therefore trends cannot result for instance from fitting arbitrary parametric curves or applying smoothing filters to observed records. Despite the possible goodness of fit, these pseudo-trends might yield completely unreliable predictions. This lack of reliability reveals the actual nature of such data-driven trends, i.e. that they refer to the time series and not to the underlying process, and thus are affected by sampling uncertainty and can change as additional data become available (Luke et al., 2017;Serinaldi and Kilsby, 2015). Therefore, even though nonstationary modeling is legitimate, every step should be approached with great care in order to be logically and scientifically correct, bearing in mind the underlying assumptions of procedures, methods, and models used in each stage of the analysis. Beside possible ill-posed selection of nonstationary models yielding unreliable predictions, overlooking theoretical assumptions generates misconceptions such as the incorrect belief of the existence of temporally varying return periods and corresponding return levels and their confusion with time varying probabilities of exceedance and quantiles, whereby the mathematical definitions of return period yield unique and comparable values in stationary and nonstationary contexts (Cooley, 2013;Olsen et al., 1998;Salas and Obeysekera, 2014;Serinaldi, 2015;Serinaldi and Kilsby, 2015;Volpi et al., 2015) (see Sections 3.3 and 4.2). 5. Another consequence of the limited extent and uniqueness of the hydro-meteorological data is that one needs to make a number of implicit but strong assumptions in order to treat these records as outcomes of deterministic, stochastic, or some mixed processes. In this respect stationarity and ergodicity play a key role in statistical inference. Ergodic theory deals with the relationship between statistical averages and sample averages, which is a central problem in the estimation of statistical parameters in terms of real data. For example, empirical summary statistics (e.g., moments) are informative only under the assumption that the process is stationary and ergodic. For example, even if the sample mean of an observed time series can always be estimated and does not change irrespective of stationary or nonstationary assumptions, in the first case it is assumed to be representative of the process thanks to ergodicity, while in the latter it is not (if one does not account for the source of nonstationarity). This means that other realizations of the same nonstationary process can have completely different sample averages, none of which can give insight into the actual population mean of the process, if any. Therefore, assuming nonstationarity requires great care in order to understand what we can really infer from data under lack of ergodicity. Without supporting the nonstationary choice with deductive (top-down) arguments identifying the mechanism generating the time-dependent behavior of the process, the modeling procedure reduces to a mechanistic numerical exercise attempting to minimize some performance criterion with the aim to follow local patterns of fitting data sets. As mentioned above, this approach yields models that reveal the weakness of their derivation and justification when predictions are compared with new (future) observations in validation data sets (see Sections 4.1 and 4.2). 6. Nonstationarity is a very stringent assumption as it implies that one or more characteristics of the distribution of a system depend on time by a deterministic function d t . As the term deterministic implies being free of uncertainty, nonstationarity cannot be claimed from the data only without an attribution identifying the source of the deterministic dependence on time. Therefore, Koutsoyiannis and Montanari (2015) noted that "Because it explains in deterministic terms part of the variability, a nonstationary description is associated with reduced uncertainty. Hence unjustified or inappropriate claim of nonstationarity results in underestimation of variability, uncertainty and risk". Here, uncertainty does not refer to specific parametrization but to the existence and overall behavior of time-dependent deterministic processes. For example, the existence and general evolution of seasonal behavior is deduced from arguments independent from data (i.e., planetary dynamics), while its parametrization varies for each specific data set. Excluding spurious local trends characterizing stationary stochastic processes, trends of interest in hydro-meteorology are those related to mechanisms generating departures from the so-called natural variability (which is implicitly assumed to be stationary). Such trends are therefore a form of nonstationarity, which implies the existence of a deterministic function of time d t requiring detection and attribution by combining deductive reasoning, which supports and justifies the existence of d t , and inductive inference, which provides preliminary knowledge and quantication/parametrization of d t . The definition of deterministic trend has direct practical consequences (see Sections 4.3 and 4.4): (a) The commonly used approach of comparing nested models with time-varying and constant parameters by using some performance criterion is not sufficient to infer nonstationarity if d t does not result from deductive reasoning, but results from simple fitting procedures. General poor performance in prediction confirms the weakness of such a bottom-up procedure (Serinaldi and Kilsby, 2015).
(b) Replacing the dependence on time (i.e., d t ) with dependence on teleconnection indices or other environmental variables showing clear stochastic behavior does not make models nonstationary, but simply doubly stochastic stationary. This replacement makes models nonstationary only if such auxiliary environmental variables are themselves nonstationary, and thus time-dependent according to a well-defined function d t . This is particularly important for a correct application and interpretation of frequency analysis based on generalized linear models (GLMs), generalized additive models (GAMs), or similar. 7. Trend NHSTs suffer logical flaws and some of them are also incorrect or incorrectly applied. For example, the still widely applied so-called trend-free prewhitening (TFPW)  was shown to be theoretically flawed (Serinaldi and Kilsby, 2016a), as its original version does not address the variance inflation related to dependence, which can be even exacerbated. This explains the contradicting results reported in the literature concerning the outcome of this test compared with alternative procedures. The correctness of applied tests (or methodology in general) should not be taken for granted, and a preliminary check of their performance under H 0 (controlled conditions) should be performed (by simulation) before their application, especially if the methodology was not developed by statisticians (see Section 5).
The choice between stationary and nonstationary depends on a stringent process of attribution supported by deductive arguments, which come before and go beyond statistical inference techniques (see Section 6).
3. The (non)logic of trend hypothesis tests: what they cannot say about trends and nonstationarity

The consequences of a difficult birth: NHSTs logical flaws and misinterpretations
In several fields of applied science, NHSTs have been widely discussed and criticized for a long time (Cohen, 1994;Gill, 1999;Johnson, 1999;Levine et al., 2008;Beninger et al., 2012;Ellison et al., 2014;Nuzzo, 2014;Briggs, 2016;Greenland et al., 2016;Wasserstein and Lazar, 2016, and references therein), but to our knowledge, the problems concerning NHSTs received little attention in hydrological sciences (McBride et al., 1993;Nicholls, 2001;Clarke, 2010). NHST is a synthesis of the Fisher test of significance, developed as a general approach to scientific inference, and the Neyman-Pearson hypothesis test, designed for applied decision making and quality control (Levine et al., 2008). These methods are conceptually different and imply different interpretations of their outcomes. Neyman and Pearson believed they had made Fisher's theory of significance testing more complete and consistent, whereas Fisher never perceived the emerging Neyman-Pearson theory as correcting and improving his own work on tests of significance (Gigerenzer et al., 1989, pp. 98 and 102). A heated controversy followed, and "although the debate continues among statisticians, it was silently resolved in the 'cookbooks' written in the 1940s to the 1960s, largely by non-statisticians, to teach students in the social sciences the 'rules of statistics' " (Gigerenzer et al., 1989, p. 106). The result was a so-called 'hybrid system', i.e. NHST (Beninger et al., 2012), merging "Fisher's easy-to-calculate P value into Neyman and Pearson's reassuringly rigorous rule-based system" (Nuzzo, 2014). Overlooking the great differences in conceptual interpretation, this seemed perfectly acceptable to statistics end-users, partly because often the same formulas were used and the same numerical results obtained (Gigerenzer et al., 1989, p. 106). This has led to an enormous confusion about, for instance, the meaning of a significance level, coining the well-known expression 'the null hypothesis is rejected at the α level', which occurs neither in Fisher nor in the writings of Neyman and Pearson. Moreover, the neglect of controversial issues and alternative theories, and the anonymous presentation of an apparently monolithic body of statistical techniques often turned the hybrid theory into a mechanical ritual, even though Fisher, and Neyman-Pearson had all warned against drawing inferences from tests without judgment (Gigerenzer et al., 1989, p. 107 and 209). This historical digression confirms how damaging a mechanistic approach can be through overlooking subtle, but fundamental, theoretical concepts.
The differences between Fisher and Neyman-Pearson systems highlight their incompatibility and the problems affecting the NHST synthesis. With Fisher significance testing, no explicit alternative hypothesis H 1 to the null H 0 is identified, and the p-value that results from the model and the data is evaluated as the strength of the evidence for the research hypothesis. Therefore there is no notion of 'power of test' nor of accepting alternative hypothesis H 1 in the final interpretation. Conversely, Neyman-Pearson tests identify complementary hypotheses, H 0 and H 1 , in which rejection of one implies acceptance of the other, and this rejection is based on a predetermined significance level α. Neyman-Pearson hypothesis test defines the significance level α a priori as a function of the test (i.e., before even looking at the data), whereas Fisher's test of significance defines the significance level afterwards as a function of the data. The NHST synthesis pretends to select α a priori, but actually using a posteriori p-values to evaluate the strength of the evidence. This allows inclusion of the alternative hypothesis but removes the search for a more powerful test (Gill, 1999). The power of a test is actually a problematic issue in hybrid NHST as it is most often undefined. The sampling distributions of both H 0 and H 1 are specified in Neyman-Pearson theory and an effect size or point prediction must be specified for H 1 in order for the concept of power to be meaningful and for defining the sample size required to obtain the required power. Conversely, in hybrid NHST, H 1 is simply specified to be not H 0 and vice versa (e.g., {H 0 : there is no deterministic trend} and {H 1 : there is deterministic trend}), i.e., H 0 and H 1 are written such that they are mutually exclusive and exhaustive (Levine et al., 2008). Moreover, one of the NHST hypotheses is always labeled as the null hypothesis as in the Fisher test, whereas Fisher intended the null hypothesis simply as something to be 'nullified' or falsified in agreement with (and influenced by) the contemporary 1935 Karl Popper's Logic of Scientific Discovery. NHST partially uses the Neyman-Pearson decision process except that failing to reject the null hypothesis is treated as a 'modest' support for the null hypothesis (Gill, 1999).
Leaving aside problems related to some abuses and misinterpretations that can be partially corrected, the hybrid NHST suffers some logical flaws that cannot be overcome: D corresponds to switching, for instance, the statements: "(1) Most people who face a firing squad die from bullet wounds, and (2) Most people who die from bullet wounds have received them from a firing squad!" (Beninger et al., 2012). The (flawed) logic of NHST is as follows: (i) if H 0 is true then the data are highly likely to follow an expected pattern, (ii) the data do not follow the expected pattern, (iii) therefore H 0 is highly unlikely. This can translate to statements such as: (i) if a person is an American then it is highly unlikely she/he is a member of Congress, (ii) the person is a member of Congress, (iii) therefore it is highly unlikely she/he is an American. In other words, a low p-value, i.e. P[Congress|American], does not imply a low P [American|Congress] (Pollard and Richardson, 1987;Gill, 1999). Thus, p-value says nothing about the truth or otherwise of H 0 or H 1 or the strength of evidence for or against either one. In this respect, Neyman and Pearson were very clear "...as far as a particular hypothesis is concerned, no test based on the (objective) theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis" (Neyman and Pearson, 1933). 2. Substantive theories vs. statistical hypotheses: In hybrid NHST, the statistical null hypothesis and the statistical alternative hypothesis are written such that they are mutually exclusive and collectively exhaustive. Therefore, if we accept the incorrect assumption that one could reject H 0 on the basis of a small p-value, then H 1 is inferred to be probably true since no other alternatives (besides H 0 and H 1 ) are logically possible (Levine et al., 2008). However, H 1 can result from multiple and conflicting substantive theories. For example, local step changes in Fig. 1(a) and (b) can result from fluctuations of a persistent process or the superposition of uncorrelated random noise and a deterministic stepwise signal. Accepting a substantive theory on the basis of results concerning a statistical hypothesis relies on the formal fallacy of 'affirming the consequent' (i.e., 'If p, then q; q; therefore, p'), which is the form of all scientific inference aimed at supporting a theory by verifying its observational consequences. Statistical hypotheses are numerical consequences of the substantive theories, not their semantic equivalents (Meehl, 1997).
It should be noted that Fisher did not distinguish between substantive hypotheses and statistical hypotheses (Gigerenzer et al., 1989, p. 97). However, the p-value was intended simply as an informal way to judge whether evidence is worthy of a second look (Nuzzo, 2014), and 'rejecting H 0 ' does not mean a categorical adoption of the belief that it is false. In fact, according to Fisher, "in learning by experience, ... conclusions are always provisional and in the nature of progress reports, interpreting and embodying the evidence so far accrued" (Fisher, 1935, p. 25). On the other hand, Neyman and Pearson introduced their hypothesis test as a 'rule of behavior' to make decisions accounting for possible consequences "without hoping to know whether each separate hypothesis is true or false" (Neyman and Pearson, 1933). In other words, both Fisher and Neyman and Pearson were well aware of the fallacy of 'affirming the consequent' and the impossibility for inductive inference to make conclusions about the truth or falsehood of a scientific hypothesis. Statistical considerations alone cannot lead to a decision. 3. Classic NHST does not apply to exploratory studies: Most research studies can be generally classified as either experimental or observational (Flueck and Brown, 1993). The major distinction is that the former requires the ability of the scientist to control the principal inputs in order to assess the effects on the outputs. Therefore, studies of trends in hydro-meteorological variables can be classified as observational because there is no scope for controlling the inputs (e.g., the researcher cannot control the amount of rainfall), thus making such studies more difficult to plan and analyze than experimental ones. Both experimental and observational studies usually have three stages denoted as preliminary, exploratory, and confirmatory, even if the third stage can or should actually aim to falsify/disprove the scientific hypothesis, according to the so-called 'modus tollens' of deductive inference (i.e., 'If p, then q; no q; therefore, no p') (Meehl, 1997). Leaving aside the preliminary stage concerning general insights into questions about the research topic (e.g., which measurements are useful and can be made, amount of available or collectable data in the study period, etc.), exploratory studies aim to define claims about foreseen or unforeseen relations on the basis of a plausible conceptual model (i.e., a researcher's description of the process of interest) and appropriate scientific evidence, whereas confirmatory studies are specifically defined processes focused on replicating or disproving a result while minimizing sampling and non-sampling errors (Flueck and Brown, 1993) with small probability that results can come from causes different from the tested theory.
Exploratory studies are flexible in their research of evidence (e.g., variables to be included), but this flexibility should not be confused with a superficial treatment of the data and methods. Focusing on data, analyses can rely on randomized or non-randomized samples. NHST requires randomized samples as it involves three steps often overlooked but fundamental: (i) the choice of the probabilities of occurrence, α and β, of Type I and Type II errors (not only the significance level α); (ii) random selection of only n samples from the designed population, whereby n is related to the sampling distribution of the test statistic, α, and β, in order to guarantee the desired test's significance and power; and (iii) the test must be performed only once. All these steps should be performed before collecting data. Therefore, logical flaws apart, NHST yields 'valid' results only if these steps are followed, thus justifying the definition of 'design based' inference. When the above steps do not apply, NHST is out of context because of lack of a priori basis.
In the case of research related to ACC detection, for instance, "most detection studies apply NHST to a sample of data, and determine whether to reject the null hypothesis of zero trend in the atmospheric variable under consideration" (Nicholls, 2001). These studies typically use all the available records and these data are far from being randomly selected samples with size fulfilling the requirements in terms of α and β (Prosdocimi et al., 2014). Moreover, hydro-meteorological observations usually exhibit serial and spatial correlation, and other properties that can be accounted for but make the outcomes further uncertain. Based on the above remarks, it follows that such studies fall in the non-randomized exploratory family, thus excluding confirmatory tools such as NHST (assuming that this is a logically coherent procedure), and requiring in turn the use of split samples or future data subsets to provide confirmatory/ disproving information (Flueck and Brown, 1993). Moreover, in the exploratory stage, one is really not interested in finding a statistically significant effect or trend (which can always emerge by increasing the sample size) but in physically significant effects.
Assuming that one overlooks these aspects, how should the outcome of NHSTs be interpreted? Rejection of H 0 does not necessarily imply the acceptance of H 1 , as the discrepancy of the observations from the conditions corresponding to H 0 can actually result from factors not included in the formulation of H 0 (e.g., larger variability due to lack of independence) and different from H 1 . It is also less legitimate to accept substantive hypotheses owing to the formal logic fallacy of 'affirming the consequent'. On the other hand, if H 0 is not rejected, then this does not mean that it can be concluded that H 0 is true, but only that experimental evidence does not support the rejection of the null hypothesis. Unfortunately, the intricacy of such reasoning is once again a result of the hybrid nature of NHST. In fact, Fisher intended significance tests as tools for screening situations worthy of deeper study (without H 1 ), while Neyman-Pearson hypothesis tests were proposed as rules of action implicitly accounting for the consequences (quantified a priori by α and β) of choosing between two competing alternatives.
Therefore, even overlooking logical flaws, trend NHSTs can only reveal possible changes which are not compatible with random fluctuations corresponding to very specific reference processes (e.g., independent and identically distributed (iid) random variables), thus requiring further investigation. We can then extend to general trend NHSTs what Busuioc and von Storch (1996) recommended for Pettitt test: trend/change tests should be used not as tests but as mere tools for preliminary screening. Small (large) values of the test statistics should be taken as indications for possible upward or downward changes. Such changes should be accepted as physically meaningful if they can be related with a predictable process based on theoretical models (e.g., logistic models describing population growth under limited resources) and/or well identified physical dynamics justifying causality (e.g., dam building, river training, etc.).

Hidden dependence: the limits of short time series and the role of reference models
Once clarified what conclusions can(not) be drawn from trend NHSTs and in which context, we can better discuss the role of persistence in trend detection. The example in Section 2.1 shows that the underestimation of persistence plays a key role and should be accounted for. The estimation of Hurst parameter involved in Pettitt tests adapted for fGn yields H values of 0.5 and 0.66, for the Nera River and Velino River, respectively. However, while = H 0.5 is consistent with the fact that the Nera River data are actually independent, the value = H 0.66 underestimates the actual value 0.8 even though the estimator is corrected for the bias by the method described in Appendix A. This confirms that model identification under scarce observations (i.e. short time series) is a difficult statistical task, subject to large uncertainty and bias. Koutsoyiannis and Montanari (2007) have already investigated this aspect showing that very long time series (thousands of observations) are required to correctly recognize fGn. Therefore, underestimation of persistence is an aspect that should not be overlooked when using whatever trend NHST involving a correction procedure for this property. Note that the underestimation of persistence might lead to consider data approximately independent, and thus applying standard tests with no corrections. This choice can inflate the number of detected significant deterministic trends.
One should also account for the underestimation of variance, which is another well-known phenomenon related to the persistence of some stochastic processes (e.g., Koutsoyiannis, 2011;Tyralis and Koutsoyiannis, 2011;Koutsoyiannis and Montanari, 2015) arising from the fact that the process stays in a given subset of the state space for several time steps, thus requiring much time to explore the entire state space. For example, even though the fGn time series in Fig. 1(j) has theoretical unitary variance, its 100-size subsets have an average variance equal to 0.9. Note that the larger uncertainty related to persistence should not be confused with "the assumption of a deterministic temporal change in the pdf (specifically, the second moment) of a random process (evidently, from historical observations to future analysis period)" (Milly et al., 2015) (see Section 4 for further details). In other words, persistence inflates the overall variance, which is larger than that corresponding to the independent case. Of course, as for persistence, estimating the variance from short time series can yield substantial underestimation, with similar consequences on the outcome of trend NHSTs (i.e., inflated number of detected deterministic trends).
Dependence is introduced in trend NHSTs to build a more realistic H 0 relaxing the assumption of independence in the iid model, whereas deterministic trends relax the hypothesis of identical distribution in the iid model in the H 1 side. Since long-term patterns in finite samples can result from (be effects of) both persistence and deterministic changes in distribution, in trend NHSTs we attempt to compare two hypotheses that can produce comparable effects, knowing a priori that they can. Thus, why is dependence considered a 'null' condition, while deterministic changes in distribution are assumed to produce an effect? The difference between the two schemes is that the deterministic trends require attribution whereas persistence is compatible with pure stochastic processes, implicitly assuming that persistence provides a realistic description of natural systems.
As mentioned above, NHSTs cannot tell us which model is the most credible, and they cannot be used for such exploratory studies but only in a confirmatory/disproving stage by using independent data and a well-specified model reproducing properties that are unlikely to be reproduced by other competitors. Therefore, in absence of physical theory, both options (persistence and deterministic changes) are legitimate, but the main issue concerns their ability to describe variations in the wider population. This is usually only achievable when there are additional sources of data against which each model can be judged.

Distinguishing processes and time series: a matter of attribution
Often trend NHST is the first step to infer systematic changes of the studied process over time and thus its nonstationarity, eventually justifying the adoption of nonstationary models. However, this procedure is logically flawed, and the opposite should be done. Exploratory analysis should suggest a set of theories/models. These models should be used to reproduce challenging properties of the observed data, and then confirmatory/disproving analysis in terms of prediction should be applied. A successful model/theory can provisionally be retained until disproved by further applications on new data. The common inversion of reasoning is partly related to the confusion between processes and time series.
The problem of contaminated data series with trends and seasonal effects has been a matter of common experience for hydrologists. The traditional way of dealing with such an issue is to produce a new time series (the output of a certain filter or adjusting procedure) which represents in some sense an estimate of what the real series would be if the contaminating effect were absent. According to Jaynes (2003, p. 536): "Then choice of the 'best' physically realizable filter is a difficult and basically indeterminate problem; fortunately, intuition has been able to invent filters good enough to be usable if one knows in advance what kind of contamination will occur". When a data set is filtered according to incorrect assumptions, detrending may introduce spurious artifacts that distort the information that statistics and probability theory could have extracted from the raw data; so, caution is advisable especially with refined filters giving a false sense of reliability, whereby this can come only from reasoned judgment. Hence, testing trends on finite and short time series can easily be inconclusive and/or misleading because of the intrinsic difficulty, if not impossibility, of detecting nonstationarity (of a process) solely from data without exogenous information, as is discussed later (leaving aside the logical arguments discussed above).
Let us suppose that a dam was built along a river, thus influencing its regime according to the dam operation rules. If we know a priori the existence of the dam, we do not need to perform a trend analysis because we already know that the flow regime has been changed by dam construction; we can study how the dam impacted on specific characteristics of the flow regime (i.e., the effect size), if this information is not already included in dam design specifications. On the other hand, if we do not have a priori information on the dam existence, trend NHST can only tell us that some source of discrepancy from pure randomness is present; however, this does not allow one to infer nonstationarity of the underlying process without additional information identifying a clear causality rule. In fact, as shown in Sections 2.1, 3.1, and 3.2, multiple factors can generate such discrepancy in finite time series, and trend NHST does not allow one to draw conclusions on the substantive causes. In these circumstances we should propose a set of theories based on plausible reasoning, develop suitable models, and compare their prediction performance with independent observations. However, these models will be credible only if they incorporate rules describing the dynamics of the process (e.g., dam's effects), thus making its evolution predictable (i.e., river flow will follow a given regime until the dam operates).
Therefore, without clear attribution via exogenous information, trend NHST can only provide a generic indication that further investigation is required (according to the rationale of Fisher and Neyman-Pearson original methodologies). In this respect, such attribution cannot be vague or based on some kind of statistical analysis affected by its own uncertainty, because what is needed is not some sort of statistical correlation but a (substantive) causal physical relationship that should be general and valid beyond the period of the observed records. Therefore, even sophisticated regression models (e.g., GLMs, GAMs, etc.) do not fulfill these requirements as they fall in the class of analog models (Flueck and Brown, 1993) for which extrapolation is not advisable (Cooley, 2013) and easily leads to physically inconsistent predictions (Serinaldi and Kilsby, 2015;Luke et al., 2017). Nonstationarity requires the postulation of a law of temporal evolution of the process, and this law should be based upon substantive hypotheses in order to be general and valid for prediction of still unobserved data (Poppick et al., 2017). Using the example of the dam, GLMs fitted on the data can incorporate time dependent terms but these data-driven regression laws do not say anything about the dam operation rules and their effect, and their extrapolation in time is not supported by any reasoned judgment about the causes of the observed patterns (are they real or spurious? how will they evolve?). On the other hand, additional information on the dam existence and operation and its mathematical formalization can justify the introduction of nonstationary models (e.g., Ayalew et al., 2017). Thus, nonstationarity and corresponding modeling strategies are allowed only if we make (a priori) assumptions about the processes, and the causes of nonstationarity are clearly identified and formalized via deductive reasoning about e.g., the effects of a dam on the river regime. Nonstationarity cannot result from inductive inference from data only, as the observed patterns can be the effect of various unknown causes (persistence, nonlinearity, nonstationarity, etc.), which cannot be discriminated in exploratory studies or misusing questionable confirmatory tools.
It should be noted that these remarks are well-known in climatology (Hasselmann, 1997), for instance, but seem to be overlooked in many hydro-meteorological studies relying almost exclusively on trend testing to draw conclusions. Indeed, according to Mitchell et al. (2001, p. 700), "Detection is the process of demonstrating that an observed change is significantly different (in a statistical sense) than can be explained by natural internal variability. However, the detection of a change in climate does not necessarily imply that its causes are understood... from a practical perspective, attribution of observed climate change to a given combination of human activity and natural influences requires another approach. This involves statistical analysis and the careful assessment of multiple lines of evidence to demonstrate, within a pre-specified margin of error, that the observed changes are: unlikely to be due entirely to internal variability; consistent with the estimated responses to the given combination of anthropogenic and natural forcing; and not consistent with alternative, physically plausible explanations of recent climate change that exclude important elements of the given combination of forcings... Detection (ruling out that observed changes are only an instance of internal variability) is thus one component of the more complex and demanding process of attribution".
These recommendations are fully general and not restricted to the problem of ACC detection and attribution. They highlight the importance of defining the magnitude of internal variability (space-time covariance and dependence structure (Hasselmann, 1993;Poppick et al., 2017)), which is a challenging task (as discussed in Section 3.2 and further in Section 6), as well as the need of jointly using deductive and inductive methods, and excluding other physically reasonable explanations before arriving at a clear attribution.
4. Voces significant res mediantibus conceptis 1 : missing interpretant generates a hiatus between sign and object

Stationarity
In the previous sections, we discussed some theoretical and practical limits of trend testing, including the problems posed by the intrinsic nature of the hydro-meteorological data, the misuse of confirmatory tools in exploratory analyses, and the influence of dependence, as well as the basis and logic of NHSTs and their interpretation. All these aspects raise serious questions regarding the actual information and conclusions that can be drawn from trend NHSTs and exploratory studies relying on them. However, to better understand why nonstationarity cannot be inferred from this analysis we need to go back to basic concepts and definitions.
As the title of this section suggests, we put the emphasis on the meaning of common terms in the context of trend testing, as the semantics of those terms is often confusing in the literature. Although there is ongoing debate about this issue (Koutsoyiannis and Montanari, 2015;Milly et al., 2015), we believe it is worth recalling and expanding it, where necessary, because of its importance for correctly setting up and interpreting data analysis. Unless stated otherwise, throughout this paper we use upper case letters for random variables and lower case letters for values, parameters, or constants.
Referring to Koutsoyiannis and Montanari (2015) for a rigorous presentation of the formal definitions of stationarity and nonstationarity, we recall that "... a stationary stochastic process in the sense of Khinchin ... is a set of random variables X t depending on the parameter t, − ∞ < < +∞ t , such that the distributions of the systems coincide for any n, t 1 , t 2 , ...,t n , and τ". This definition has been translated in various ways such as: "Stationarity means that hydrological variables fluctuate randomly within an unchanging envelope of variability" (Bayazit, 2015), or "... stationarity, or temporally stable probability distribution functions (PDF)," (Rice et al., 2015). Even though such definitions are acceptable in informal discussions, the actual meaning of Khinchin's definition merits some further discussion to avoid misunderstandings. Assuming that t denotes time, Khinchin's definition means that the n-dimensional joint distribution of n random variables is identical independently of their location along the time axis. However, since the mathematical definition refers specifically to random variables X t , the sets of reali- are unavoidably different. By the way, Mandelbrot (1982) (p. 384) emphasized that: "When mathematicians first encountered stationary processes having extremely erratic samples, they marvelled that the notion of stationarity could encompass such wealth of unexpected behavior. Unfortunately, this is a kind of behavior that many practitioners insist is not stationary".
So, the actual problem in inductive exploratory analysis of trends is to understand if such fluctuations are consistent with a unique n-dimensional joint distribution or they come from different distributions. Given the uniqueness of observed hydro-meteorological records and the well-known uncertainty in making inference from very short time series (the most common case in hydro-meteorology), the problem is challenging. In order to make the problem easier to treat, one often focuses only on the first moments (actually, up to second order because of the high uncertainty in estimating higher-order moments (Lombardo et al., 2014)), thus introducing the concept of weak stationarity, where Khinchin's definition reduces to identity of population means E[X i ], population variances Var[X i ], and population covariances over n time steps independently of their location along the time axis.

Ergodicity
We cannot emphasize too strongly the clear distinction between the population properties that are deduced logically from the theory and the sample properties that are determined empirically from observations. Sample estimates are derived from time averages whose relationship to the statistical parameters of the theoretical process must be established only in the form of ergodicity. In order to highlight the importance of ergodicity, it is worth recalling that a stochastic process X(t, ζ) is a collection of time functions depending on the outcome ζ of an experiment , L or a collection of random variables over a parametric support t (time, space, etc.) (Papoulis, 1991, pp. 285-286). Fig. 2 helps to clarify these definitions. For a fixed outcome ζ* (i.e., a fixed coordinate along the 'Event space' axis), X(t, ζ*) is a single time function (or trajectory) describing a sample (or realization) of the given process. One of these realizations can be the sequence of the truly observed records, while the others are possible outcomes that did not occurred 1 Signs correspond to objects through interpretants (Eco, 1976). but could. On the other hand, if t is fixed as t* and ζ varies, then X(t*, ζ) is a random variable describing the state of the given process at time t*. If both t and ζ are fixed, X(t*, ζ*) is a number, i.e. the specific value assumed by the process at the specific time.
In real world applications, we often know only a single finite-size sample of X(t, ζ) (e.g., a sequence of daily stream flow values between year t and + t τ). So, a central problem is to infer the parameters of the underlying stochastic process from such sample. This is possible only if the process is ergodic, meaning that the time average of any (integrable) function g(X(t, ζ)) equals the true (ensemble) expectation E[g(X(t, ζ))], as the size of the available sample tends to infinity (Papoulis, 1991, pp. 427-428). Clearly, this is not possible if E[g(X(t, ζ))] depends on t. Therefore, we must assume a stationary underlying process. Focusing on the mean of the process, ergodicity implies that More generally, ergodicity allows the use of the empirical probability density function f obs  (or f , ) of a sample of X(t, ζ) as an estimate of the probability density functions f Xi ( f , Xj ...) of the random variables X(t i , ζ) (X(t j , ζ), ...) describing the state of the process at time t i (t j , ...). If a process is nonergodic, then statistical inference from data is not allowed because sample averages, variances, and distributions are not representative of their population counterparts. Moreover, we should consider that stationarity is a necessary (but not sufficient) condition to ergodicity of stochastic processes (Koutsoyiannis and Montanari, 2015). Therefore, a nonstationary process is nonergodic; thus, estimates from data are not representative of the process when we claim nonstationarity. In fact, nonstationarity implies that the population distributions f f , , X X i j and f Xk in Fig. 2 are not identical to each other, and thus f obs  is no longer representative of any of them. In fact "The histogram is assumed (at least implicitly) to be an estimate of the marginal STATIONARY distribution. Note that second-order stationarity, or one of the other forms of weak stationarity, is not sufficient; strong stationarity at least must exist for the special case of n = 1 (the number of points, not the dimension of the space). If the random function is not stationary, at least to this extent, then the histogram is not an estimate of a distribution related in a known way to the random function" (Myers, 1989).
Similarly, if f f , , X X i j and f Xk have different moments (e.g., mean and/or variance changing in time), the empirical sample moments are not representative of any of these local population moments.
This can be surprising in light of the extensive use of nonstationary models such as GLMs and GAMs with time-dependent parameters in hydro-meteorological frequency analysis, for instance. Of course, the problem is not about these models by themselves, but their misuse. In fact, these models actually fit local trends (observed in the period of record) that can be due to multiple factors (anthropogenic activity, persistence, nonlinearity, etc.), which in turn cannot be identified by data alone. Moreover, they are often justified owing to the better performance compared with iid versions (which are not challenging competitors), and overlooking more realistic options that can yield patterns close to the observed ones. Nonstationary models are legitimate when there is additional information on the cause of time-dependent behavior. The identification of the cause of local trends is paramount for extrapolation in order to be sure that the nonstationary effects continue beyond the period of record. Without this additional information, prediction based on pure data-driven time-dependent patterns easily yields physically inconsistent results when extrapolating into the future or past (Luke et al., 2017;Serinaldi and Kilsby, 2015;Villarini et al., 2009b). Actually, low reliability and high uncertainty in predictions of evolution of nonstationary patterns might be an index of the little evidence supporting the nonstationary choice (Serinaldi and Kilsby, 2015).

Nonstationarity
At this point, a definition of nonstationarity is required. In order to illustrate nonstationary processes, Koutsoyiannis and Montanari (2015) considered the decomposition = + X d , E where t E is a stationary stochastic process and d t is a deterministic function of time d t ≡ d(t). Milly et al. (2015) proposed a similar representation, namely, E where a t and b t are deterministic and t E is stochastic. We slightly generalize the decomposition suggested by Koutsoyiannis and Montanari (2015) as follows where G[ · ] is a generic operator (some examples are given below). According to Koutsoyiannis and Montanari (2015), a deterministic function of time is "precisely known and perfectly predictable" meaning that a system input corresponds to a single system response, contrasting stochastic dynamics where a single input could result in multiple outputs. Since every inductive analysis based on observed data is always affected by uncertainty, a deterministic function cannot be inferred from the data only, but it should result from deductive reasoning and be validated by data which were not used in the model construction. Notice that this definition is consistent with the idea which became famous as Laplace's demon, i.e. the classical definition of strict physical Fig. 2. Explanatory sketch of a stochastic process X(t, ζ). For a given outcome ζ* of an experiment, the trajectory X(t, ζ*) denotes a sample of the process. For example, the 'obs' case refers to a possible observed time series, while the other patterns are alternative samples associated with other possible results of the experiment. For fixed t*, the set of values along the 'Event space' axis describe the state space of a random variable X(t*, ζ), i.e. the given process at time t*. For fixed t and ζ, bullet points denotes the specific value assumed by the process at a specific time. f ,  are the empirical probability density functions of various samples of X(t, ζ) corresponding to some given values of ζ, while f , X i f , X j and f X k are the probability density functions of the random variables X(t i , ζ), X(t j , ζ), and X (t k , ζ) describing the state of the process at time t i , t j , and t k , respectively. See Section 4.2 for further details.
determinism. According to Laplace, the demon is indeed a superhuman intelligence that could know and model all details of the universe to infinite precision: "for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes" (Laplace, 1814). In other words, if all changes in nature are expressible through mathematical functions of time, complete and precise knowledge of the initial conditions at a certain moment allows one to perfectly predict the conditions at all later (and earlier) times.
However, predictability and determinism are also easy to disentangle in practical applications. As shown in many studies on deterministic chaos, the approximate character of scientific knowledge renders dynamical systems unpredictable even though they are fully governed by underlying deterministic laws (Sivakumar, 2016;Yevjevich, 1974). Actually, determinism is a matter of spatio-temporal scales; in fact, even if the process (i.e., vector function representing the deterministic dynamics) is perfectly known, perfect predictability beyond a given temporal horizon can completely be lost owing to very small uncertainty in initial conditions (Berliner, 1992;Koutsoyiannis, 2010;Lorenz, 1963) that is magnified by possible nonlinearity leading to emergence of (deterministic) chaotic behavior (e.g., von Storch and Zwiers, 2003, pp. 1-2). While chaos theory explains unpredictability of a deterministic system in practice, Laplace's demon assumes perfect predictability under ideal, complete and precise knowledge of the system (including initial conditions). Therefore, the two ideas are compatible with each other, as already recognized by Laplace himself, who wrote "All these efforts in the search for truth tend to lead it (human mind) back continually to the vast intelligence which we have just mentioned, but from which it will always remain infinitely removed" (Laplace, 1814).
The decomposition in Eq.
(2) is a rather general description of nonstationarity. In fact, if G[ · ] is the identity operator, Eq. (2) describes the (simplest) decomposition of the process itself, depends on time by a deterministic function, which can however refer to one or more characteristics of the distribution (e.g., mean, variance, higherorder moments, autocorrelation, etc.). For Bm, as well as fractional Bm and autoregressive integrated moving average processes (ARIMA), nonstationarity refers to deterministic functions of statistical moments, and inference is performed on the increment process = + − Y t X t X t ( ) ( 1) ( ) because taking the first difference yields a stationary process by removing the dependence of the moments on time. Whatever is the specific form of nonstationarity (in mean, variance, etc.), statistical inference (e.g., calculation of moments) only applies to a corresponding stationary process obtained by suitable transformations (e.g., differencing) under the assumption that the original process has a specific form of nonstationarity. However, without an attribution identifying the source of the deterministic dependence on time, nonstationarity cannot be claimed from the data only. Claiming nonstationarity (i.e., the existence of a deterministic function of time for some statistical properties of X t ) on the basis of the outcome of NHSTs such as MK, Pettitt, or unit root NHSTs such as Dickey and Fuller (1979) and Kwiatkowski-Phillips-Schmidt-Shin (Kwiatkowski et al., 1992) is simply not possible owing to the problems discussed in Section 3 and the definitions given above.

What is a trend?
We argue that some widespread misconceptions concerning trend detection and attribution result from a lack of definition of 'trend'. Thus, attempting a reasonable definition and opening the debate about this point seems useful. First of all, the concept of trend should be related to nonstationarity. This seems a reasonable assumption, as we are not interested in local but long-term patterns spanning for instance the entire series of records and resulting from e.g. persistence of underlying stationary processes. In this case, the stochastic nature of persistent (quasi-periodic or monotonic) patterns make their magnitude, onset, and end unpredictable, and a pure stochastic stationary description is sufficient. Thus, we should conclude that a trend of true interest (which is the focus of the largest part of hydro-meteorological literature on the topic) should strictly be related to a form of nonstationarity.
If so, recalling Khinchin's definition of stationarity and the discussion in Section 4.3, a stochastic process has a trend if one or more of its statistical properties vary in time according to a deterministic law of time d t . The function d t can be monotonic, non-monotonic, periodic, and can refer to the average, variance, or other statistical properties of the process (see Section 4.3). In this respect, there is no difference, for instance, among (i) trends defined as smooth, long-range changes in some moment/parameter of the time-varying distribution (as used e.g., in GLM/GAM modeling), (ii) stochastic trends captured by random walktype processes (i.e., Bm, fBm, ARFIMA, etc.), or (iii) trend described by physical equations in processes involving stochastic differential equations or different types of physical-statistical models. In all of these cases, for the parameters of GLMs/GAMs, for the variance of Bm and fBm processes, and for the specific characteristics described by the physical part of physical-statistical models, there is a function d t accounting for deterministic time-dependent evolution of the system. Concerning d t , a misconception widespread in GLM/GAM-based hydrological frequency analysis is the belief that replacing t with other variables makes the model nonstationary. This is true only if such variables are themselves nonstationary, and thus time-dependent according to a well-defined function. Replacing t with teleconnection indices (e.g., North Atlantic Oscillation index) or other variables showing clear stochastic behavior simply yields stationary doubly stochastic models.
If a trend is identified with the existence of a deterministic function of time d t , and thus with nonstationarity, remarks on detection and attribution provided in Section 4.3 apply, and in particular we should exclude the possibility to make inference from data only. For example, seasonal cycles are forms of d t resulting from a fundamental deductive reasoning (exogenous information) concerning planetary dynamics and corresponding mathematical theory. This deductive step allows for the choice of inference tools and then a quantitative evaluation of seasonal components from data for each specific case. Seasonal cycles are predictable with negligible uncertainty as we are almost sure of the occurrence of equinoxes and solstices in the next decades or even centuries, unless the occurrence of unpredictable catastrophic events acting at the solar system or galaxy scale. Here, lack of uncertainty refers to the existence of the seasonal d t , and not to its contingent parametrization, which varies in each specific case.
Under these premises, introducing other forms of trend should rely on the same approach, merging deductive and inductive reasoning. For example, a commonly used approach of comparing nested models with time-varying and constant parameters by using some performance criterion is not enough if the time-dependent function d t does not result from deductive reasoning, but results from simple fitting procedures. In these cases, higher parametrized (time-dependent) models might simply account for local apparent trends, giving very poor performance in prediction owing to lack of identification of substantial causes acting beyond the period of record (Serinaldi and Kilsby, 2015).
Trends of interest in hydro-meteorology are often monotonic or lowfrequency type spanning over the period of record (possibly related to anthropogenic activity). We argue that 'frequency' is actually the main difference between seasonal trends and other forms of deterministic time-dependent trends. Seasonal trends look like monotonic or halfwave trends if the focus is on sub-annual time windows, because the process is not fully developed at such a time scale, and one cannot retrieve signal components with frequencies lower than half period of record, as described by the Nyquist-Shannon sampling theorem. Therefore, even the use of effective filtering methods such as singular spectrum analysis, wavelet analysis, or empirical mode decomposition cannot help in trend identification if we are not able to arrive at a clear attribution of the patterns described by the lowest frequency components resulting from filtering.
Being a form of nonstationarity or its expression, trends are allowed only if they rely on exogenous knowledge involving theoretical arguments or empirically well-defined processes (in agreement with McCuen (2003)). Without such an additional information, trends cannot be inferred from the data only, because they refer to the underlying process … X X X ( , , , ) (i.e., observed time series; in agreement with the starting point of Chandler and Scott (2011)). Without attribution to unique substantive cause and exclusion of any other possible cause, exploratory tools, filtering, or model selection can only highlight local (low-frequency, persistent) fluctuations but they do not allow one to make conclusions on stationarity or nonstationarity. This is in agreement with von Storch and Zwiers (2003) (p. 9) who stated: "Trends in the largescale state of the climate system may reflect systematic forcing changes of the climate system (such as variations in the Earths orbit, or increased CO 2 concentration in the atmosphere) or low-frequency internally generated variability of the climate system. The latter may be deceptive because low-frequency variability, on short time series, may be mistakenly interpreted as trends. However, if the length of such time series is increased, a metamorphosis of the former 'trend' takes place and it becomes apparent that the trend is a part of the natural variation of the system". These remarks are general and hold true not only for climate but also in every context exhibiting large uncertainty about the number, type, and effect of the acting physical processes.
Based on the discussion above, we can then provide an unambiguous definition of trend as: time-dependent deterministic and therefore predictable change d t of the properties of a process X t , where the term "deterministic" implies prediction variance equal to zero (oneto-one relationship). This definition highlights that trends (and nonstationarity) refer to the underlying process, and attempting to infer nonstationarity requires both detection and attribution based on a combination of deductive reasoning, which supports and justifies the existence of a time-dependent deterministic function (i.e., trends and nonstationarity) and inductive reasoning, which provides (i) preliminary knowledge by exploratory data analysis, and (ii) quantification/parametrization of d t by confirmatory/disproving analysis and modeling. We stress that 'prediction variance equal to zero' does not refer to the specific parametrization of d t but its existence and its overall evolution. For example, the parametrization of seasonal trends (components) obviously varies even for the same process observed at different locations, but the existence of the seasonal cycle and its effects in terms of alternation of wet/dry and cold/warm conditions along the calendar year are predictable with (almost) no uncertainty. Other forms of trend/nonstationarity are allowed only if they are supported by the same kind of deductive and inductive arguments.

Trend and abrupt change tests: an overview of overlooked critical aspects in practical applications
In this section, we discuss some problems concerning the practical application of two NHSTs for trends, i.e. the Mann-Kendall (MK) (Mann, 1945;Kendall, 1970) and Pettitt tests (Pettitt, 1979). The following remarks apply under the assumption that we disregard logical arguments in Sections 3 and 4, still apply these tests for exploratory analysis, and use them to make conclusions on trends/nonstationarity. Among many available statistical testing procedures devised for assessing the significance of a change (e.g., Kundzewicz and Robson, 2004), the MK and Pettitt tests are widely used rank-based nonparametric tests to check the presence and timing of slowly-varying and abrupt changes in the mean or median of hydro-meteorological variables such as rainfall, runoff, and temperature (e.g., Villarini et al., 2009a;2011b;Ferguson and Villarini, 2012;Rougé et al., 2013;Tramblay et al., 2013;Guerreiro et al., 2014;Sagarika et al., 2014;Rice et al., 2015;Mallakpour and Villarini, 2015;Ahn and Palmer, 2016;Archfield et al., 2016;Do et al., 2017, among others).
The popularity of these tests is related to their simplicity in terms of implementation, their robustness against outliers or measurement errors (as they are rank-based), and the availability of exact or asymptotic distributions of their test statistics under null hypothesis {H 0 : no trend/ no change} and independence, i.e. iid conditions. Moreover, being based on the so-called Mann-Whitney statistic, the Pettitt test and the MK test are formally related to each other, thus highlighting that distinguishing between slowly-varying and abrupt changes is only a matter of scale and time of evolution of the change (Rougé et al., 2013;Serinaldi and Kilsby, 2016a).
Even though these tests are used to check changes in the mean or median, the first myth to dispel is that MK and Pettitt tests are devised to detect changes in the central tendency summary statistics. They actually check a wider hypothesis called stochastic ordering. Given a sequence of random variables where F b (F a ) is the common distribution of the m ( − − n m 1) random variables before (after) the change point (Pettitt, 1979). Therefore, even though these hypotheses are commonly restricted to a shift in the location parameter, these tests are sensitive to all possible conditions resulting in stochastic ordering (Serinaldi and Kilsby, 2016a).
Based on the above belief, when a change point or a monotonic trend is detected, often the magnitude of the abrupt change is quantified by the difference in mean or median between the sub-series before and after the change, while the trend by the so-called Sen's slope (e.g., Khaliq et al., 2009b;Rice et al., 2015;Nilsen et al., 2016;Tananaev et al., 2016). In light of the actual nature of MK and Pettitt tests, such quantification is not justified, especially for MK. In fact, even if we assume that MK only checks for changes in mean/median, it refers to monotonic changes that can be linear or nonlinear (stepwise, Sshaped, or abrupt as a limiting case) resulting from more general changes in the overall shape of the distribution. Of course, the choice of linear trends reflects practical requirements; indeed, assuming more complicated (higher-parametrized) patterns can be unjustified for usually short time series because of the additional uncertainty affecting the estimation. Since Sen's estimator for the slope of a linear trend is rank-based (nonparametric), it is considered more robust than classical mean square error (MSE); however, its nonparametric nature does not make it more coherent with MK outputs than MSE estimates of linear trends. Even though the need of quantifying a possible change is understandable, reducing the indication of possible monotonic trends given by MK to that of a linear trend is too restrictive, and does not reflect the rationale and outcome of MK test. Since perfectly linear trends rarely describe realistic evolution patterns of complex hydrometeorological processes (even under actual deterministic forcings), such a kind of quantification should be at most purely qualitative, and possibly avoided in order to provide a correct communication. In any case, it cannot be considered an actual trend in light of discussion on detection and attribution in Section 4.4.
Of course, trend tests can only detect inhomogeneities within the time interval covered by the observed records. This also explains why they cannot be used to infer nonstationarity: stationarity is a property of the theoretical process X t , for − ∞ < < +∞ t , and concerns the identity of population statistical properties for every subset of random variables in every point of the time line, while trend tests can only check possible changes in finite and usually short time windows where observed fluctuations might easily be spurious. Since we cannot extrapolate conclusions beyond the period of records without identifying a deterministic and predictable cause of such inhomogeneities, the outcome of trend tests cannot be used to justify the application of nonstationary models for frequency analysis. Such a usage is inappropriate and might lead to unrealistic predictions (Serinaldi and Kilsby, 2015).
The previous remarks have important consequences on the procedures used to account for the effects of the temporal correlation. The effect of the autocorrelation on tests devised for independent data is a general increase of the rejection rate of the null hypothesis {H 0 : no trend} of the statistical test, even if the underlying process is stationary. This is due to the information redundancy that makes the effective sample size smaller than the observed size, thus implying that the effective variance of the test statistics to be used in the testing procedure under serial dependence is larger than that provided by standard results obtained under the hypothesis of independence (e.g., Bayley and Hammersley, 1946;Koutsoyiannis and Montanari, 2007). This phenomenon is known as variance inflation and has been accounted for using three general approaches: the explicit calculation of the inflated variance (e.g., Hamed and Rao, 1998;Koutsoyiannis, 2003;Matalas and Sankarasubramanian, 2003;Yue and Wang, 2004;Hamed, 2008;2009b), prewhitening procedures (e.g., Katz, 1988;Kulkarni and von Storch, 1995;von Storch, 1999;Yue and Wang, 2002;Bayazit and Önöz, 2007;Hamed, 2009b), and bootstrap techniques (Khaliq et al., 2009a;Kundzewicz and Robson, 2004).
Referring to Khaliq et al. (2009b) and Bayazit (2015) for a review, we focus on some aspects that are generally overlooked: 1. Firstly, all tests involving the iid hypothesis should be corrected for the effect of autocorrelation. Neglecting this aspect might lead to contradictory results further discussed in Section 6.3. 2. Since some procedures involve trend removal (e.g., Hamed, 2008), this is usually supposed to be linear. As mentioned above this choice is understandable from a practical point of view but less defensible if it is interpreted as a deterministic evolution of some physical (hydro-meteorological) process. Linear trends cover a very limited subset of the actual hypothesis tested by MK and Pettitt tests as well. 3. The procedures proposed in the literature consider corrections based on the autocorrelation values estimated on the data themselves. This poses two problems: (i) for short time series, autocorrelation is generally underestimated (e.g., Koutsoyiannis, 2003;Koutsoyiannis and Montanari, 2007), where the bias is larger if the underlying process exhibits long range dependence (LRD); thus, when the correction procedure involves a specific dependence structure (often Markovian), autocorrelation should be adjusted (see Serinaldi and Kilsby (2016a) and Appendix A); and (ii) it is taken for granted that the dependence structure of the underlying process can be retrieved by the analyzed summary statistics (usually, annual minima, averages, maxima, etc.). The point (ii) is subtle but critical; in fact, the behavior of summary statistics can be strongly influenced by the nature of the underlying process; for example, processes with LRD yields maximum values over blocks of observations that tend to cluster in time (e.g., Bunde et al., 2005;Eichner et al., 2011). This results in apparent trends in terms of frequency and magnitude if the analysis relies on short series of such maxima, even though these summary statistics might easily show no or very weak autocorrelation. Since this behavior is found in stream flow time series (Serinaldi and Kilsby, 2016c), we show in the case study that it might have a dramatic effect on trend NHST outcomes. 4. In some cases, correction procedures are flawed, failing to provide any adjustment. As an example among others, the so-called trendfree prewhitening (TFPW)  was shown to be theoretically flawed (Serinaldi and Kilsby, 2016a), as its original version does not address the variance inflation problem, which can be even exacerbated. Since it has been widely applied thanks to its relative simplicity, results of several analyses reported in the literature should be taken with great care and possibly revised.
Two aspects characterizing several published trend analyses based on NHST need to be mentioned: (i) the correctness of applied tests (or methodology in general) is almost always taken for granted, while a preliminary check of their performance under H 0 (controlled conditions) should be performed (by simulation) before their application, especially if the tests were not developed by statisticians and result from some empirical reasoning without a necessary mathematical proof (as shown for the hybridization of Fisher and Neyman-Pearson methods in Section 3); (ii) empirical results are often interpreted without the necessary rigor, thus resulting in misleading conclusions, confusing artifacts with meaningful results.

Case study
In this section, we investigate the consequences of the above discussion on data analysis and its interpretation. To this aim, we use data already analyzed in the literature to show how results and conclusions can remarkably change if we account for logical, methodological, and practical issues discussed in previous sections. Note that our analysis is not exactly a study of reproducibility because data and some methods are not precisely equal to those applied in previous studies. However, the use of MK and Pettit is justified for sake of comparison with previous studies, and key general results are reproduced and then compared with new findings relying on more realistic null hypotheses.

Observational data
Long term trends in stream flows over the conterminous United States (CONUS) have been extensively studied. Referring to Sagarika et al. (2014) for a recent review, we recall that the past studies focused on various summary statistics and/or data sets, including peak discharge records (Barrett and Salis, 2016;Hirsch and Ryberg, 2012;Lins and Cohn, 2011;Mallakpour and Villarini, 2015;Villarini et al., 2009a;Villarini and Smith, 2010;Villarini et al., 2011a;Vogel et al., 2011), monthly data (Kalra et al., 2008;Lettenmaier et al., 1994), and mean daily observations (Ahn and Palmer, 2016;Lins and Slack, 1999;McCabe and Wolock, 2002;Rice et al., 2016;Sagarika et al., 2014). The interest for such an area is not only practical, but is also related to the great variety of hydrologic regimes/conditions across CONUS, as well as the availability of data and metadata, which allow for more accurate studies than in other parts of the globe.
Since the trend analysis described below (Section 6.2) requires both daily data and summary statistics (i.e., maxima or averages) on a seasonal and annual basis, in this study, mean daily flow records are used. The data set is extracted from the Hydro-Climatic Data Network (HCDN-2009) (Lins, 2012), which comprises 743 stations maintained by the U.S. Geological Survey (USGS). HCDN-2009 is a subset of the wider USGS GAGES-II (Geospatial Attributes of Gages for Evaluating Streamflow, version II) reference stations providing geospatial data and classifications for 9 322 stream gages. HCDN-2009 provides a stream flow data set suitable for analyzing hydrologic variations and trends in a climatic context, as it includes quality-controlled time series from stations that were screened to exclude sites where human activities or other activities affect the natural flow, and with sample size sufficiently large for analysis of patterns in stream flow over time (Lins, 2012). A list of HCDN-2009 stations along with basic attributes can be found at the web site http://water.usgs.gov/osw/hcdn-2009/, while the data set is freely available at http://waterdata.usgs.gov/nwis/sw. This study focuses on 250 stations having continuous and simultaneous observations with no missing values between the water years 1951 and 2011 included (i.e., October 1950 to September 2011). Following Sagarika et al. (2014), the data set comprises only one station on a particular stream within each U.S. hydrologic unit code (HUC), to reduce spatial bias in the results. Moreover, even though some stations have continuous data spanning longer periods, we selected only simultaneous data from 1951 to 2011 to guarantee temporal homogeneity and to allow some remarks on spatial correlation discussed later in Section 6.3. For seasonal analysis, seasons are defined as autumn (October-December), winter (January-March), spring (April-June), and summer (July-September).

Testing local significance
In this study, possible slowly-varying trends and abrupt changes of some stream flow properties are analyzed by MK and Pettitt tests in four different settings: 1. Classical versions devised for independent random variables.
Hereinafter they are denoted as 'standard MK' and 'standard Pettitt'.

A corrected and unbiased TFPW (TFPWcu) version of both tests
accounting for first-order autoregressive AR(1) dependence (i.e., classic Markovian dependence) and bias correction for ACF underestimation (denoted as AR(1)-TFPWcu MK and AR(1)-TFPWcu Pettitt). TFPWcu procedure is applied to show that a correct TFPW procedure yields results different from those of the classical setting, highlighting that the similarities of results often recognized in the literature are actually artifacts (see Section 5). The reader is referred to Serinaldi and Kilsby (2016a) for further details. 3. A prewhitening version accounting for fGn dependence proposed by Hamed (2008) for MK test and adapted by Serinaldi and Kilsby (2016a) for Pettitt. This version allows one to account for long range dependence (LRD) and improves the original prewhitening procedure by introducing bias corrected estimates of the Hurst parameter H (characterizing the fGn ACF) based on the formulas provided in Appendix A. These tests are denoted as fGn-CPW MK and fGn-CPW Pettitt, where CPW indicates 'conditional prewhitening', meaning that the prewhitening procedure is applied only if H is found significantly different from 0.5 at the 5% significance level. These versions are detailed in Hamed (2008) and Serinaldi and Kilsby (2016a). 4. The last version is based on Monte Carlo simulation of daily stream flow sequences in order to check the impact of daily dynamics on the annual/seasonal statistics and trend test outcomes. This way, we introduce a more realistic null scenario in terms of dependence structures that is built by exploiting the whole available information instead of few tens of annual/seasonal summary statistics. In more detail, each daily stream flow time series is deseasonalized (following the procedure described by e.g., Serinaldi and Kilsby (2016c) and Serinaldi and Kilsby (2016b)), and residuals are resampled by the iterative amplitude adjusted Fourier transformation (IAAFT) method (Schreiber and Schmitz, 1996;Kugiumtzis, 1999;Schreiber and Schmitz, 2000;Venema et al., 2006a;2006b;Serinaldi and Lombardo, 2017), which allows for simulation of surrogate data preserving almost exactly the empirical distribution function and power spectrum (ACF) of the observations. IAAFT surrogates are stationary (Franzke, 2013) by construction because of randomization of Fourier phases. Combining surrogate residuals and seasonal components yields synthetic daily stream flow time series under the null hypothesis preserving almost exactly both the marginal distribution and correlation structure of the observed ones. Small discrepancies in marginal distributions do not matter as the used trend tests are rank based. Therefore, summary statistics of interest (here, averages and maxima on a seasonal and annual basis) are extracted and standard MK and Pettitt tests are applied. The procedure is repeated many times to obtain the sampling distribution function of MK and Pettitt test statistics accounting for the dependence properties of the daily process under stationary conditions. In other words, our new null hypothesis is 'observed trends in annual/seasonal values are consistent with patterns coming from a stationary daily process with given (observed) marginal distribution and dependence structure'.

Testing field significance
When data from multiple stations are analyzed, one can ask whether the results imply that there is a significant effect when considering the entire group of stations, i.e. the so-called field significance (Daniel et al., 2012;Katz and Brown, 1991;Livezey and Chen, 1983;Wilks, 1997;2006). This recognizes that, when performing multiple tests, it is more likely to detect significant changes by chance. This probability increases if the data are spatially correlated. In this case, spatial correlation acts similarly to temporal correlation, introducing information redundancy owing to possible similar patterns across spatially correlated sequences. Referring to Khaliq et al. (2009b) for an overview of methods to treat field significance, we recall that they fall in two categories (Daniel et al., 2012). One controls the false discovery rate (FDR, i.e., the expected fraction of local null hypothesis rejections that are incorrect) (Benjamini and Hochberg, 1995;Benjamini and Yekutieli, 2001) and is nearly equivalent to the Walker test (Fisher, 1929;Katz, 2002;Wilks, 2006). The other relies on counting the number of rejections at local level and then comparing these values with the selected critical values obtained from the empirical distribution of number of rejections resulting from bootstrap procedures preserving (approximately) spatial or spatio-temporal correlation (Khaliq et al., 2009b;Wilks, 1997). Daniel et al. (2012) highlighted that the choice of method depends on the spatial nature of the studied effect (e.g., trend). If it is expected that the effect is widespread (local), it would be preferable to use the count-based (FDR approach), while a combination of both approaches could be applied if there is no a priori expectation. In this study, we focus on the Walker test because (i) it is easy to implement and robust to cross correlations (Khaliq et al., 2009b;Wilks, 2006), and (ii) the counting method requires intensive and different simulation procedures for IAAFT-based trend tests and the other tests (Standard, AR(1)-TFPWcu, and fGn-CPW). We recall that the Walker test consists of comparing the smallest of p-value corresponding to the test statistics computed on k time series with the critical value = − − p α 1 (1 ) k W global 1/ : the global null hypothesis {H 0 : no global trend} may be rejected at the a global level α global if the smallest of k independent local p-values is less than or equal to p W (Wilks, 2006). The counting method is only applied to check the variability of field significance for the lag-1 ACF term ρ 1 , and the Hurst coefficient H in the series of annual/seasonal summary statistics (see Khaliq et al., 2009b, p. 121, for a detailed description). All tests are performed at the 5% local and global significance level.

Temporal dependence of maximum and mean flows
Given the influence of autocorrelation on detection of possible trends, the dependence structure of annual and seasonal maximum and mean flows are preliminarily investigated. Owing to the limited sample size of such a type of data, only parsimonious models such as AR(1) and fGn are usually considered, even though the latter would require very long time series for a reliable inference. Figs. 3 and 4 show the spatial distribution of the sites where ρ 1 and H are found to be locally and globally significant, along with the sites where both models are locally significant. Global significance is assessed at the HUC scale. Of course, the number of sites showing possible significant persistence are higher for mean values than for maxima owing to the higher variability of the latter. No particular spatial patterns are evident.
These results are not surprising if we consider the persistence of the underlying flow process and recognize that maxima are values sampled from the daily process, while mean values result from averaging all daily data in each season or year. Nonetheless, Figs. 3 and 4 allow some methodological remarks. When we test significance for ρ 1 and H, we are implicitly assuming an underlying model (AR(1) or fGn) as alternative hypothesis H 1 ; this implies that the estimates should be corrected accordingly for possible bias. For example, testing ρ 1 under the assumption that the underlying process is fGn requires a bias correction procedure different to that used under the assumption of AR(1) process. This aspect is often neglected and ρ 1 is commonly tested using biased estimates yielded by default estimators working under iid assumptions (Koutsoyiannis, 2016).
Another remark concerns the output of the Walker test and counting method for global significance. When the two approaches yield different results, the Walker test tends to identify more areas with significant results; according to the nature of the Walker test (which is sensitive to local effects), in some of these cases, field significance results from a single locally significant station falling in the HUC area, especially if the number of stations in that area is small. Daniel et al. (2012) provided some insight into selecting suitable sub-regions to assess field significance. To avoid bias, sub-regions should be identified a priori (before performing the analysis) and domain limitation should be made as a result of some physical insight. In this respect, HUCs are a credible choice. However, Daniel et al. (2012) also stress that "If some region in a domain is seen to contain a large number of significant stations, it is certainly inappropriate to apply the field significance test over just this limited domain without a physically based justification". Also in this case, the clustering of significant trends might be due to the spatial correlation of (large scale) meteorological variables driving the flow processes in a specific area. In other words, local clusters in space are the natural effect of spatial correlation, as local clusters of high/low values in a time series might reflect temporal correlation. Since it is easy to confuse spatio-temporal correlation with deterministic trends, these remarks highlight the importance of using clear definitions in order to distinguish between stochastic fluctuations and deterministic changes whose attribution should be based on a priori information and causality. Fig. 3. Spatial patterns of significant values (at the 5% significance level) of ρ 1 and H for the annual flow maxima on an annual and seasonal basis. Left column of panels refers to ρ 1 , middle column to H, while the right column highlights the sites where both ρ 1 and H are found significant. Maps in left and middle columns also show hydrologic units where ρ 1 and H are found to be field significant according to Walker test, bootstrap test, and both. See text for further details.

Trend analysis of maximum and mean flows
Results of trend analysis are shown in Figs. 5-8. According to the aim of this study, as for correlation, the following remarks focus on methodological aspects. In fact, the spatial patterns of significant trends and their sign are similar to those published in the literature (e.g., Sagarika et al., 2014), while their interpretation deserves some remark. For example, AR(1)-TFPWcu MK and fGn-CPW MK tests on maxima (Fig. 5) tend to give similar number of rejections, which is smaller than that of standard MK. This contrasts with results often reported in the literature and confirm that they are related to the ineffectiveness of the original TFPW procedure. As a variety of hypotheses has been proposed to explain the disagreement between TFPW and other prewhitening procedures in terms of possible physical causes, it is therefore instructive to discover that they are pure speculations, and results are simply artifacts.
Another important result is the strong decrease of significant trends obtained by IAAFT MK. The annual/seasonal maxima extracted from (stationary) series reproducing the observed persistence of the daily sequences show apparent trends that are stronger than those resulting from persistent models directly fitted on annual/seasonal maxima, thus reducing the evidence for deterministic trends in such summary statistics. In other words, focusing on annual/seasonal maxima might lead to underestimate variability and temporal clustering of high/low values.
Similar remarks hold for annual/seasonal average values (Fig. 6). It should be noted that the residual clusters of positive trends in the New England (north west) for summer maximum and mean values (Figs. 5(t) and 6(t)) are coherent with the spatial correlation and climate dynamics of that area (Kingston et al., 2007). This does not mean that the New England rivers have not witnessed a possible increase in the last 60 years; however, it might be explained in terms of spatial correlation as a shared behavior depending on a common meteorological driver acting in an area characterized by quite a uniform response. Obviously, from a practical point of view, such an increase raises management problems, whose solutions however change if we assume that these changes are deterministic (in the sense specified in Section 4) or stochastic. In fact, F. Serinaldi et al. Advances in Water Resources 111 (2018) 132-155 in the first case we should identify a predictable evolution law (making attribution), which is unlikely linear or polynomial, and cannot be deduced from the data themselves in the form of some arbitrary smoothing function.
Results for the Pettitt test in Figs. 7 and 8 highlight another aspect often overlooked in the literature and related to the fact that both tests rely on the Mann-Whitney statistic and they are theoretically related to each other (Rougé et al., 2013). This implies that standard MK and Pettitt often yield similar results in terms of significant changes in a given direction (upward/downward), even if such results are obviously interpreted in a different manner (slowly varying monotonic changes and abrupt changes) (e.g., Sagarika et al., 2014;Pathak et al., 2016). The link between the two tests also implies that both are sensitive to autocorrelation (Serinaldi and Kilsby, 2016a), and they should yield coherent results when autocorrelation is accounted for. A comparison of Figs. 5 and 6 with Figs. 7 and 8 supports this conclusion showing that both tests yield similar spatial patterns in terms of significant upward/ downward changes. In this context, distinguishing and interpreting slowly varying and abrupt changes can be secondary as it is simply a matter of scales (e.g., Rougé et al., 2013). This highlights once again the need for attribution based on exogenous information which is not derived exclusively on purely statistical non-causal relationships. Results for IAAFT-based tests also show the dramatic decrease of evidence for deterministic changes when fluctuations of seasonal/annual averages and maxima are influenced by the entire flow process at daily scale. Therefore, even if the autocorrelation of the observed seasonal/annual averages and maxima is properly accounted for in AR(1)-TFPWcu and fGn-CPW trend tests, the above results reveal that it is not sufficient and might strongly underestimate the actual persistence of the underlying processes. When this is taken into account, apparently strong trends/ changes might become coherent with the intrinsic variability characterizing stationary but persistent processes.
Recalling the discussion about the natural local clustering of significant results due to spatial correlation and testing multiplicity (Daniel et al., 2012), we studied how the number of significant trends globally changes across the CONUS for the different types of tests, i.e. the different treatment of the autocorrelation. Fig. 9 summarizes the global number of rejections for all combinations of data and tests. Standard tests (especially Pettitt) generally yield higher number of rejections for maximum values. For mean flows, there is not much Fig. 5. Spatial distribution sites showing significant monotonic trends (at the 5% significance level) according to MK test for the annual flow maxima on an annual and seasonal basis. Panels in each row, from left to right, show results for standard, AR(1)-TFPWcu, fGn-CPW, and IAAFT versions of the MK test. Hydrologic units exhibiting field significance according to Walker test are also highlighted. difference between standard tests and AR(1)-TFPWcu and fGn-CPW versions. In all cases, IAAFT version yields a dramatic decrease of significant outcomes. Fig. 9 also reports the 95% prediction bands (continuous red lines) of the number of rejections expected over 250 independent trials (i.e., performed tests). Prediction limits correspond to the 2.5th and 97.5th percentiles of a binomial distribution with parameters 250 and 0.05 (i.e., the number of trials and the rate of successes). The diagrams show that IAAFT-based tests yield a number of occurrences which is almost always consistent with what is expected when a purely random experiment with 5% of probability of success is performed. In some cases the outcome is also close to the expected value ⌊ ⌋ = 250·0.05 12. However, the binomial distribution describes the outcome of independent experiments, whereas stream flow time series are spatially correlated, especially in some specific areas reacting in a similar way to common (large-scale) meteo-climatic dynamics. Cross-correlation can be accounted for by simulation; however, a fast computation, which is very accurate in several cases, can be performed by using the beta-binomial distribution (see Appendix B). Considering the 2.5th and 97.5th percentiles of the beta-binomial distribution (red dot-dashed lines in Fig. 9), the number of significant outcomes always fall within the prediction bands for IAAFT-based tests, while the rate of rejection for AR(1)-TFPWcu and fGn-CPW tests is less unexpected than under spatial independence. In this respect, it is worth recalling that the pairwise spatial correlation terms involved in the beta-binomial parametrization refer to the spatial correlation of seasonal/annual averages and maxima; therefore, stronger correlation and over-dispersion is expected if the spatial correlation of daily series is taken into account. It should also be noted that the estimation of the spatial (cross) correlation is affected by temporal correlation (Katz and Brown, 1991;Hamed, 2009a;. Accounting for these aspects results in stronger spatial correlation and further increase of over-dispersion. This implies wider prediction intervals of the number of significant outcomes, and thus even less evidence for global field significance. These results confirm the dramatic impact of the spatial correlation on field significance (Douglas et al., 2000) and the double effect of the autocorrelation on local trend detection and estimation of spatial correlation. If we also consider that autocorrelation is a non optimal measure of dependence implying systematic underestimation of the actual intensity of persistence (as shown above), it can be concluded that we often strongly underestimate the actual spatio-temporal variability of the processes generating the analyzed data.
We stress once again that all the above analyses do not allow any conclusion about the actual stochastic or deterministic nature of the F. Serinaldi et al. Advances in Water Resources 111 (2018) 132-155 observed trends. Results only tell us that a stochastic stationary representation cannot be excluded, and several results reported in the literature simply depend on the choice of an unrealistic stationary option (iid) as a null hypothesis. In other words, observed trends are consistent with both a stationary and nonstationary assumptions, when we choose a suitable and more realistic stationary benchmark. The choice between the two modeling options depends upon a stringent process of attribution supported by additional information and the clear identification of a cause excluding any reasonable alternative explanation. This process goes beyond the application of trend tests or detection of statistically significant correlation with other hydro-meteorological variables by NHSTs, which, we recall, suffer logical flaws and are not devised for exploratory studies.

Discussion and conclusions
Published trend analysis in hydro-meteorology often consists of a superficial application of statistical tools, such as NHSTs, as cookbook recipes. This attitude is further spread by the availability of powerful statistical software implementing the state of the art of statistical methodologies developed by statisticians but also questionable procedures developed by practitioners. As already highlighted by von Storch and Zwiers (2003), this approach is particularly dangerous for anyone who is not sufficiently acquainted with the basic concepts of statistics. Moreover, software availability also promotes the tendency to jointly apply combinations of sophisticated techniques (often obscure, redundant or even contrasting with each other) that compound and amplify the problems caused by the indiscriminate use of recipes (von Storch and Zwiers, 2003, p. 97).
Since every hydro-meteorological record is the only available realization or trajectory of the underlying process, this prevents the application of confirmatory analysis because any statement, or null hypothesis, cannot be contested with a statistical test since independent data are unavailable. No statistical test, regardless of its power or complexity, can overcome this problem because it is not a matter of methodology but of available information. The only possible solution is extending this information by additional data going backward (i.e. collecting paleo data sets) or forward (i.e. awaiting the availability of new data to test theories). In this respect, the use of physics-based models can partly help (e.g., Poppick et al., 2017). However, even such Fig. 7. Spatial distribution sites showing significant abrupt change (at the 5% significance level) according to Pettitt test for the annual flow maxima on an annual and seasonal basis. Panels in each row, from left to right, show results for standard, AR(1)-TFPWcu, fGn-CPW, and IAAFT versions of the Pettitt test. Hydrologic units exhibiting field significance according to the Walker test are also highlighted. models can never be fully validated/disproved as we do not know if they capture all the important properties of the physical processes, and thus the answers given by such models could simply be spurious. A rigorous attribution is required to attempt the identification of unique causes and exclusion of any other plausible alternative.
Based on the above discussion, the actual meaning, interpretation, and limits of trend tests should be recovered. As every statistical test, trend tests can be valuable tools in appropriate contexts, while they cannot be an appropriate tool to infer nonstationarity. They can at most be used as mere tools for preliminary screening whose outcome should be carefully checked and complemented with exogenous information. If a clear physical mechanism related to a predictable evolution of the properties of the process at hand is not identified, we cannot make conclusions about the reason of rejection or lack of rejection, since multiple factors not included in the null and alternative hypotheses can actually play a role.
The case study presented in this paper shows the dramatic difference resulting from the use of trend tests involving different dependence structures, namely, (unrealistic) independence, AR(1) and HK dependence estimated from the target annual summary statistics, and empirical dependence of target variables resulting from that of the parent daily process. We stress that our discussion and these results do not support stationarity versus nonstationarity. Dependence is only used as a more realistic and challenging alternative to deterministic trends in order to show that NHSTs are actually inconclusive when we compare two options yielding similar observations, and a decision concerning the generating mechanism of the studied process and its modeling cannot rely on data only. Since trend tests are also used as automatic criteria to justify the use of nonstationary models for frequency analysis, we discourage such a kind of cookbook recipe usage, as it is inappropriate and might lead to unreliable and paradoxical predictions (Serinaldi and Kilsby, 2015). Of course, nonstationary modeling is still legitimate when it is justified by preliminary attribution relying on additional deductive information on the cause of timedependent behavior.
It should also be noted that this study does not focus on the interpretation of the physical meaning of probability (e.g., frequentist or Bayesian approach to data analysis) but on the role of inductive and deductive information and reasoning in the scientific inferential procedure. Following Christakos (2011, pp. 177-181), the knowledge bases can be classified between two major categories: general (or core), denoted by KB-G, and specificatory (or site-specific), KB-S, whereby the former may include scientific theories, natural laws, phenomenological models, cultural relations, and long-established worldviews, while the F. Serinaldi et al. Advances in Water Resources 111 (2018) 132-155 latter considers different sources of evidence that are tied to the particular local situation and may be not of general validity (i.e., exact and inexact (uncertain) measurements and records). This study shows how a mechanistic application of tests (and models) involving only KB-S is not sufficient to draw conclusions on stationarity or nonstationarity (regardless of the complexity or refinement of the analysis tools) but seem to be the only source of information used in much literature. On the other hand, we emphasize the fundamental role of KB-G that, being based on wisdom of the past, seems to be "irretrievably lost in the postmodern world" (Christakos, 2011, p. 179). In this respect, how KB-G and KB-S are blended is a secondary aspect, although Bayesian framework offers attractive tools to perform a synthesis (Christakos, 2011, pp. 375-380). Trend tests may be tools for very preliminary screening, for example, in large scale analyses (involving e.g., large number of time series) concerning data quality control, where we are interested in detecting time series affected by systematic instrumental errors. In these cases, we know a priori the mechanism generating 'nonstationarity' (i.e., instrumental malfunction) and its possible effects, and an unambiguous attribution can be made from the knowledge of the instrument specifics. In other cases, rejections can be used to identify primary sub sets deserving further investigations for a clear attribution of the origin of the detected inhomogeneities. However, lack of rejection does not authorize the conclusion that nothing is happening in the remaining subset, which should be further analyzed in any case. Other conclusions in this exploratory context go beyond what the trend tests can tell us. can be found at the web site http://water.usgs.gov/osw/hcdn-2009/, while the data set is freely available at http://waterdata.usgs.gov/ nwis/sw. The authors wish to thank three anonymous reviewers for their remarks and constructive criticisms, and Prof. Hans von Storch (Helmholtz-Zentrum Geesthacht, Germany) for his useful comments on an earlier version of this paper. The analyses were performed in R Development Core Team (2016).  (MK or Pettitt) applied to a given variable (maxima or averages) in a given scale (annual or seasonal). Continuous lines denote the 95% prediction bands for the binomial distribution, while dashed lines indicate the 95% prediction bands for the BB distribution accounting for the spatial correlation.

MK
p ∈ [0, 1] represents the constant probability of a success in n trials, when p is assumed to be no longer constant but fluctuating according to a beta where B denotes beta function, and α and β are two positive shape parameters. The BB density function can be written as (Skellam, 1948) = ⎛ ⎝ ⎞ ⎠ + − + f k n k k α n k β α β ( ) B( , ) B( , ) , (B.1) Letting = + π α α β /( ), the mean and variance of beta-binomial can be written as (Ahn and Chen, 1995 where ρ jk denotes the pairwise correlation between experiment (site) j and k. In the context of this study, ρ BB is therefore the average cross-correlation between the time series recorded across the CONUS area, which is ≈ 0.04. This is the value used to compute the prediction limits reported in Fig. 9. It is worth noting the impact of ρ BB despite its apparently small value.