Environmental concentrations as ratios of random variables

Human-induced environmental change increasingly threatens the stability of socio-ecological systems. Careful statistical characterization of environmental concentrations is critical to quantify and predict the consequences of such changes on human and ecosystems conditions. However, while concentrations are naturally defined as the ratio between solute mass and solvent volume, they have rarely been treated as such, typically limiting the analysis to familiar distributions generically used for any other environmental variable. To address this gap, we propose a more general framework that leverages their definition explicitly as ratios of random variables. We show that the resulting models accurately describe the behavior of nitrate plus nitrite in US rivers and salt concentration in estuaries in the Everglades by accounting for heavy tails potentially emerging when the water volume fluctuates around low values. Models that preclude the presence of heavy tails and the related high probability of extreme concentrations could significantly undermine the accuracy of diagnostic frameworks and the effectiveness of mitigation interventions, especially for soil contamination characterized by a water volume (i.e. soil moisture) frequently approaching zero.


Introduction
Anthropogenic activities and climate change are responsible for deteriorating water, soil, and air quality, calling for improved diagnostic frameworks to assess primary contamination threats and devise appropriate remediation interventions [1][2][3][4]. As increased pollutant concentrations pose substantial threats to human health and, more generally, to the Earth's system resilience [5][6][7], environmental concentrations are key variables to quantify the consequences of environmental changes [8,9].
The impact of pollutants on humans, plants, and animals can be characterized by either a threshold or non-threshold response [10]. The majority of carcinogenic contaminants are generally described by a non-threshold behavior [11,12], meaning that their adverse effects occur at all levels of exposure with linearly increasing risk associated with increasing concentration. On the contrary, most environmental contaminants result in threshold response [13,14], with non-linear exposure-response curves, where vulnerability suddenly escalates as critical thresholds are exceeded [15,16], growing exponentially for concentrations larger than a threshold value [17][18][19][20].
Threshold response behaviors arise, for example, from the response of ecological communities to contaminants [14,21]. Multi-species assemblies of plants and animals are known to be resilient to biotic and abiotic stress, with species diversity being positively correlated with ecosystem stability [13]. Yet, abrupt changes and regime shifts take place when critical concentrations are crossed [14]. These events tend to be related to the presence of ecological tipping points [22][23][24]. Sea-level rise, for instance, has caused significant salinization of coastal wetlands due to seawater intrusion [25,26]. Vegetation can cope with increasing salinity up to a species-specific salt-tolerance threshold, after which mortality rates soar and regime shifts to salt-tolerant vegetation occur [27][28][29][30][31]. Such non-linear responses make it crucial to accurately describe elevated concentrations and their propensity to cross critical thresholds.
Because of this highly fluctuating and unpredictable nature, environmental concentrations have been studied through the lens of probabilistic tools [47,48]. However, unlike any generic random variable, they possess very specific properties due to their mathematical structure [49]. By definition, the concentration of an environmental quantity is the ratio between solute mass and solvent volume. This aspect has been largely overlooked in the literature, potentially causing inaccurate estimates of the probability of harmful concentrations. In fact, as solvent volumes decrease for various reasons, it is not uncommon that these ratios tend to very large values, resulting in heavy-tailed probability distributions [49,50], which are characterized by sub-exponential tails linked to elevated probabilities of extreme values [51,52].
One of the most intuitive examples of such behavior is represented by soil contaminants. Although the mass of a contaminant in the soil varies at times scales order of magnitude slower than soil water content, frequent variations in soil moisture generate extreme fluctuations in the concentration [29,53] (see figure 1). During a drydown, soil moisture can approach extremely low values, and, as a result, the concentration can quickly spike to critical levels [29]. Similar patterns could be observed for atmospheric aerosols that are known to absorb pollutants and facilitate their transport [55,56]. They also transport pathogens like viruses, which concentration (and associated hazard) increases dramatically as the droplet volume and diameter are reduced by evaporation [57,58]. The increasing pathogen concentration and reduced droplet size increase the hazard and biological uptake efficiency until the droplet fully evaporates and the pathogen becomes inactive [59,60]. Hence, an accurate prediction of extreme concentrations necessitates accounting for heavy tails related to ratio distributions, and specific probability density functions (pdf) should be employed for this purpose [50].
In this work, we analyze general probabilistic models that can be obtained by treating environmental concentrations as the ratio of two random variables, thus accounting for their specific mathematical properties. While, a variety of distributions has been already recommended in the literature (figures 2(A) and (B)) [61][62][63], these models are not tailored to ratios of random variables. Among the most widely used, we can recall Exponential [64][65][66], Log-normal [67][68][69], Gamma [70][71][72][73], Beta [74][75][76], Pareto [75][76][77], and Extreme Value Distributions (e.g. the Weibull distribution) [76,[78][79][80][81]. Most of these distributions can be successfully adopted in scenarios where the volume of solvent never approaches zero or when mass fluxes are strongly correlated with discharge (see section 2.2) [82][83][84][85]. This is often the case for atmospheric plumes characterized by turbulent diffusion, and rapid mixing [64,84,86,87]. However, they fall short in the more critical cases in which the volume fluctuates around small values. Extreme Value and Pareto distributions can account for the asymptotic behavior of extreme conditions and, therefore, tend to perform better than other classic distributions in predicting critical values in water, soil, and biota. However, their applicability remains limited, as they do not consider the possibility that the concentration distribution may be strongly skewed to the right (or even diverge) when the volume of solvent fluctuates near zero [8], and they do not typically result from ratios of random variables.
Based on these premises, we propose a more general framework to interpret concentration data and account for right tails heavier than the exponential decay. We apply our framework to the concentration pdf obtained as the ratio of two beta distributions. While beta distributions can reproduce the behavior of most environmental variables constrained by an upper limit [88,89], our approach applies to the ratio of different distribution families [49]. The theoretical pdf 's are tested for nitrate plus nitrite and salt concentration in US rivers and coastal streams. To further emphasize the importance of accounting for heavy tails, we use a stochastic soil contamination model that reproduces the dynamics of a generic contaminant in a semi-arid environment. Our ratio distribution shows good agreement with data, especially in the right tails, of interest in many applications. The role of statistical dependence is also discussed, as the mass of contaminants and volume of solvent may be correlated. Finally, the broad implications of the properties of ratio random variables on data analysis and signal processing are briefly discussed.

Probabilistic description of concentration fluctuations
The solute concentration C of an environmental quantity is defined as the ratio between solute mass M and solvent volume V: The distribution of C is obtained as the ratio distribution of two continuous non-negative random variables [49,90]: where φ(CV, V) is the joint probability density function of M and V. As discussed in the following sections, the shape of the p(C) depends on the properties of the M and V distributions as well as the strength of their statistical dependence.

Ratio distribution for statistically independent mass and volume
In the case of independent random variables, the joint distribution of M and V reduces to the product of their marginal pdf: This hypothesis allows for an exact analytical derivation of p(C) for several marginal M and V distributions [49] and is remarkably realistic when different stochastic processes drive the dynamics of mass and volume. Anthropogenic point-input of a contaminant in a natural water body or soil salinizationwhere soil moisture evolves at time scales orders of magnitude faster than salt mass [91]-are exemplary cases of largely independent solute and solvent dynamics.
As a typical and important case, we solved equation (2) for marginal pdf 's derived from a Beta distribution. Beta distributions have been used to characterize a number of environmental variables, as they can accurately describe most quantities that are bounded by a minimum and a maximum value [88,89]. Solute mass and solvent volume are nonnegative random variables often constrained by a maximum volume V max and mass M max , which can be related to the maximum concentration C max (i.e. solubility) through M max = C max V max .
Since the Beta distribution is defined over the interval [0, 1], mass and volume were first re-scaled so that the derived variables vary between 0 and 1: and with M min and V min being the minimum values of mass and volume, respectively. Assuming X min = Y min = 0 and considering that m and v are described by a classic Beta distribution, the pdf of M and V can be obtained as derived distributions [92] with M = mM max and V = vV max : where B[·, ·] is the Beta function, and α, β, γ, δ shape parameters. Finally, solving equation (2) for (5) and (6), the concentration distribution reads: where N is a constant, 3F2 [·, ·, ·; ·, ·; ·] is the regularized generalized hypergeometric function, and ϵ = α + γ + δ.

Impact of statistical dependence of mass and volume
Environmental concentrations may result from the ratio of statistically dependent variables. For instance, the assumption of statistical independence may not hold for diffuse contamination processes. Events that lead to substantial percolation of contaminants from the soil to rivers can feature simultaneous peaks in solute mass and water volume (see mass and volume dynamics in figure 3). This raises the question of how statistical dependence affects concentration fluctuations and its pdf. We investigate the impact of statistical dependence on the ratio distribution by extracting two random M and V vectors from a Beta copula [95][96][97][98]. The ratio C = M/V was then obtained for a given set of shape parameters of the marginal Beta distributions and different correlation matrices ( figure 4). This analysis suggests that strong positive statistical dependence between mass and volume reduces the probability of encountering elevated concentrations (figure 4(C)), as it is unlikely that high solute mass cooccurs with low solvent volume. Hence, in the case of a strong positive correlation, the absence of heavy tails  [93], while salinity was estimated from specific conductance, typically accurate to ±5% [94]. The map of the major rivers in the contiguous USA was reproduced from Esri (2010). Sources: Esri; Rand McNally; Bartholemew and Times Books; Digital Chart of the World (DCW); U.S. National Geospatial-Intelligence Agency; i-cubed. allows treating environmental concentrations as regular random variables, and classic pdf 's may be used to estimate exceedance probabilities.
On the contrary, when mass and volume are negatively correlated, p(C) tends to have heavier tails (figure SI1 available online at stacks.iop.org/ ERL/17/024011/mmedia). Although negative statistical dependence seems unlikely to occur in the environment, it could cause underestimates of high concentration return periods. Therefore, when strong negative correlations are observed, the joint pdf should be used to solve equation (2).

Case studies
Nitrogen and salt concentration in streams, rivers, and estuaries are excellent examples of environmental concentrations forced by stochastic hydroclimatic processes [46,99,100], raising growing concern. Over the past decades, excessive nitrogen fertilization has resulted in increasing nitrogen losses to leaching [101]. Rising concentration of NO − 3 and NO − 2 in rivers in turn has caused eutrophication and contamination of surface-and ground-water resources [102][103][104]. Analogously, aridification and increasing seawater intrusion into estuarine ecosystems due to sea-level rise have exacerbated rivers, and soil salinity [46,105,106]. As a result, salinization reduces the productivity of a natural and managed ecosystem, threatening to become a global environmental issue [46,[106][107][108]. For these reasons, and given the extensive network of measurements available, the concentrations of salts and nitrogen compounds in rivers and estuaries are chosen here as case studies to test our probabilistic framework

Nitrate concentration in rivers
Historical hourly optical measurements [109] of NO − 3 plus NO − 2 over the period 2014-2021 were obtained from the U.S. Geological Survey (USGS) at three station along the Mississippi (Site 1; USGS 07289000), Missouri (Site 2; USGS 06934500), and Ohio river (Site 3; USGS 03612600, figure 3). For each site, concentration time series were combined with discharge data to derive water volumes (figures 3(C) and (F)) and solute mass (figures 3(B) and (E)) via integration under the assumption of well-mixed conditions. Finally, all the data were aggregated at the daily timescale and the empirical concentration distributions were used to fit the theoretical pdf (equation (7)).
A good agreement with the empirical pdf was observed across all the sites (figures 5(A)-(C)). While nitrate mass and discharge volume are correlated (ρ ⩾ 0.8), our distribution is particularly effective in describing the right-tail and the slow decay of p(C). Because of discontinuous fertilizer applications and seasonality in plants nitrogen uptake, nitrates concentration peaks occur at the beginning of the growing season (April-June, figure 3(D)). Such high NO − 3 plus NO − 2 values do not necessarily depend on large discharge volumes (see figure 3(C)), but rather on the amount of nitrate available in the basin soil. Under these conditions, the Beta distribution of the solute mass tends to collapse to an Exponential distribution with frequent values near zero. Rivers discharge, on the contrary, could be well described by a Log-normal distribution [110,111], of which the Beta distribution represents a more general and viable alternative [112].

Salinity in coastal wetlands
Salinity data at 15 minutes temporal resolution for the 2007-2021 period were retrieved from three sites in the Florida Everglades: Taylor River at Mouth (Site 4; USGS 251127080382100), Trout Creek (Site 5; USGS 251253080320100), and East side creek (Site 6; USGS 250802081035500). The data were processed similarly to the nitrogen data as described in section 3.1. Also for salinity, the model shows excellent agreement with observations and captures the high probability of values greater than seawater salinity (≃35 g L −1 ; figures 5(D)-(F)). While salinity is usually lower than the average seawater concentration, high evapotranspiration and scarce freshwater inputs from rainfall and rivers can occasionally reduce the volume of water and result in high salinity. In the Everglades, the increase in salinity is particularly evident during the dry season, when lowering water levels enhance seawater intrusion and inundations [113,114]. The frequency of these peaks in salinity may increase in the near future due to sea-level rise and consequent coastal ecosystems salinization [27,115].
Compared to nitrate concentration, salt mass and water volume at Sites 4 and 5 show weaker positive statistical dependence (ρ = 0.4 for Site 4 and ρ = 0.5 for Site 5). Part of the reason is that salt mass encroaches into coastal streams as a result of periodic inundations and seawater intrusion [116], and it is not transported by upstream river discharge. On the contrary, strong positive correlation is observed at Site 6 (ρ = 0.98), where water fluxes are dominated by tides, which carry most of the salt mass. Although the salt mass distribution is still skewed to the left, the discharge volume tends to display a more symmetrical distribution due to tidal inundations. Similarly to the nitrate case study, the Beta distribution thus represents a good general fit for both M and V.

Soil contamination
While nitrate and salt concentration represent an important example of environmental concentrations that should be generally treated as a ratio of random variables, large and interconnected water bodies tend to have relatively stable properties [117,118], and their water volume rarely approaches zero. A more compelling case of ratio concentration is the one of conservative soil contaminants, such as heavy metals and salt (NaCl). Unlike water volume in rivers (figures 3(C) and (F)), the volume of solvent in the soil, namely soil moisture, can quickly and frequently approach extremely low values during drydowns [119]. As a result, the concentration occasionally reaches critical values, and heavy tails in the probability distribution emerge. If concentration rises to the solubility limit, solute precipitation could take place and in this case the approach needs to be modified accordingly. However, this limit is unlikely reached for NaCl (C max ≈ 360 g L −1 at 20 • C) because maximum concentration is normally limited by the ratio between salt mass input and output [120] or the salinity at which vegetation water uptake is completely impaired [29,121].
Unfortunately, long-term high-frequency soil quality data are much less available than nitrate and salt concentration in rivers. Here, this limitation is overcome by using a parsimonious stochastic model of coupled soil moisture and concentration dynamics based on previously proposed soil moisture dynamics [54] and salinization models [53], to which we refer for a complete description of the dynamical system. In this context, balance equations for the vertically averaged relative soil moisture s ≡ V and salt mass M dynamics can be written as: where n is the soil porosity, Z r the root depth, P(t) the stochastic precipitation input, ET [s(t)] the evapotranspiration rate depending on soil moisture, L [s(t), t] the deep percolation rate, Υ is the average salt input from precipitation and dry deposition [53,122], and C(t)L [s(t), t] represents leaching salt removal driven by percolation. The concentration of salt C is calculated as C = M/(nZ r s). Stochastic precipitation is modeled as a compound Poisson process [123], and all the rain that cannot be allocated in the soil water storage (nZ r s 1 , with s 1 being the effective soil moisture at field capacity) is assumed to be lost to deep percolation [54] at a frequency equal to the frequency of soil moisture crossing the threshold s 1 [124]. The model also assumes an instantaneous mixing of salt within the active soil depth [29,53,91]. Figure 1 shows relative soil moisture ( figure 1(A)), solute mass (figure 1(B)) and solute concentration (figure 1(C)) resulting from equations (9) and (10), solved for a semi-arid region. It can be noticed that the dynamics of s and M are largely decoupled, and C trajectories closely follow those of s. Drydowns result in exponentially increasing salinity, even when the mass of salt remains almost constant. As a consequence, the salinity probability distribution (figures 1(C.1) and (D)) exhibits a heavy tail with subexponential decay ( figure 1(D)).
As shown by Suweis et al [53], the pdf of C can be analytically obtained by assuming complete statistical independence between s and M. The resulting pdf, obtained as the ratio between the truncated Gamma distribution of s [54] and the Gamma distribution of M [120], displays a power-law decay. However, the hypothesis of statistical independence between s and M may not hold in highly vegetated and salt-affected ecosystems. There, salinity reduces ET depending on vegetation salt tolerance [29,46]. Yet, even accounting for the salinity-induced reduction of ET, s variability time scales are much shorter than those of M, and heavy tails in the concentration distribution still emerge (see [29] for details).

Discussion
The analysis of environmental concentrations concerns applications spanning from ecotoxicology to ecology, biogeochemistry, and ecohydrology [125][126][127][128][129][130]. In particular, the prediction of the response of socio-ecological systems to environmental change and the design of proper interventions and mitigation strategies requires an improved probabilistic characterization of extreme concentrations [131]. Classic probability distributions widely used in the literature (figures 2(A) and (B)) do not capture the extreme nature of environmental concentrations fluctuations related to their definition as ratios between random variables.
Our alternative formulation of the concentration pdf is specifically derived from the ratio distribution of solute mass (M) and solvent volume (V) treated as random variables with Beta distributions. Similar formulations could be obtained as the ratio of variables extracted from different distribution families [49] and can accommodate statistical dependence when important. The ratio distribution proposed here accurately described observed concentrations of nitrate plus nitrite and salinity patterns. In particular, it accounts for the possibility that the pdf is particularly skewed to the right with fat tails (see figures 2(C), (D) and 5). Equation (7) can be used to estimate the probability of exceeding extreme concentration values and, therefore, can provide a direct estimate of the average return period of a contamination event.
Although it was obtained initially under the assumption that solute mass and solvent volume are independent, this framework can conservatively reproduce also the behavior of weakly positively correlated variables (figure 4). A positive correlation between mass and volume leads to less frequent extreme events, with faster decay of the distribution's right tail. When positive correlations are particularly strong, however, the heavy tail tends to disappear (figures 4(C) and (F)), and neglecting statistical dependence could lead to significant overestimates of the probability of crossing critical values. Under these circumstances, simple Gamma or Log-normal distributions may capture the probabilistic behavior of concentration dynamics. Classic pdf 's could be reasonably employed also to analyze concentration data in situations in which the solvent volume remains conspicuously above null values at all times. This is the case with many atmospheric pollutants and water quality indicators. Yet, ratio distributions should be used in the circumstances like soil contaminants, characterized by highly variable solvent volume (i.e. soil moisture) and concentrations growing exponentially during a dry period (figure 1).
The emergence of heavy tails may also have implications for the analysis of tracers to understand chemical and ecohydrological processes at the catchment scale. For example, the transport of contaminants in the whole catchment can be characterized by comparing the power spectra of concentration at the input (e.g. in the rainfall or at a point source) to the spectrum of concentration in the catchment discharge [132][133][134]. Because environmental concentrations are defined as the ratio of two variables, their variance can be extremely high (when the volume fluctuates near zero), calling for extreme care in evaluating power spectral scaling and any other time series analysis based on variance decomposition of concentrations.
Predicting harmful concentrations statistics will also imply dealing with nonstationarity of extremeevent probabilities due to climate change and anthropogenic activities [84,[135][136][137][138]; alternative methods should be implemented to relax the ergodicity assumption [139][140][141], while still taking advantage of the special mathematical properties of the concentration of random variables.

Data availability statement
The data that support the findings of this study are openly available at the Geological Survey's (USGS) National Water Information System (NWIS) website.