Analyzing pathogen suppressiveness in bioassays with natural soils using integrative maximum likelihood methods in R

The potential of soils to naturally suppress inherent plant pathogens is an important ecosystem function. Usually, pathogen infection assays are used for estimating the suppressive potential of soils. In natural soils, however, co-occurring pathogens might simultaneously infect plants complicating the estimation of a focal pathogen’s infection rate (initial slope of the infection-curve) as a measure of soil suppressiveness. Here, we present a method in R correcting for these unwanted effects by developing a two pathogen mono-molecular infection model. We fit the two pathogen mono-molecular infection model to data by using an integrative approach combining a numerical simulation of the model with an iterative maximum likelihood fit. We show that in presence of co-occurring pathogens using uncorrected data leads to a critical under- or overestimation of soil suppressiveness measures. In contrast, our new approach enables to precisely estimate soil suppressiveness measures such as plant infection rate and plant resistance time. Our method allows a correction of measured infection parameters that is necessary in case different pathogens are present. Moreover, our model can be (1) adapted to use other models such as the logistic or the Gompertz model; and (2) it could be extended by a facilitation parameter if infections in plants increase the susceptibility to new infections. We propose our method to be particularly useful for exploring soil suppressiveness of natural soils from different sites (e.g., in biodiversity experiments).


The Ordinary differential equations
The aim is to fit rather complex Ordinary Differential Equation systems (ODE) to data. The first equation we will use is the one pathogen model. Here the change of infections over time dIdt −1 , is described by: (1) with r [time -1 ] being the infection rate and I max [Infected (Plants) Area -1 ] being the maximum number of potentially infectable plants, an t 0 is the resistance time.
The two-pathogen model can be written as: where I p is the number of infected plants due to the treatment pathogen, I c are the number of infected plants in the control; r p and r c are the infection rates of the pathogen and the control treatment, respectively; and t 0 p and t 0 c are the resistance time of the pathogen and the control treatment, respectively. Please see the main article for further details.

Parameter estimation
First, the control treatments ( Fig. 1a) must be fitted using the monomolecular model including a lag phase (eqn (1)) using the mle2() function for general maximum likelihood fits from the package bbmle (Bolker and Team, 2016). The mle2() is used to minimize the negative likelihood function mon.inf.lag.1p.nll() from the infection.nll.r source file (see Bolker and Team, 2016, for detailed information on using the mle2() function). We need to supply the arguments start, containing a list of r and t0 (the start values of the optimization process); data, a list with the headers nI (the number of infected plants in the experiment), tps (the timepoints from the experiment), and Imax (if the number of total plants differs between experimental units); fixed, a list containing the headers steps (the length of a time step that should be simulated by the integration routine; this is set to 0.1 as default and does not have to be supplied), Imax, (a single value for the maximum of potentially infectable plants in the experiment, only needed if not supplied as data), and tracing (if set to 1, the trace including the negative likelihood and the parameters is displayed in the console, useful for error checking, default to 0): R> fit.control = mle2(mon.inf.lag.1p.nll, + start = list(r = 1, t0 = 1), + data = list(nI = y, tps = x1, Imax = x2), + fixed = list(steps = .1, Imax = x2, tracing = 0) + ) Note that the initial values are just placeholders and might be adapted, but see the section "Examples" for details. Imax must only appear once, either in the data list or in the fixed list in dependence if Imax is only a fixed single value or a set of data of the same length as y and x1. Second, after estimating the infection parameters r and t0 for the control, the two-pathogen monomolecular infection model can be fitted to the data with the experimentally added pathogen (equation 2). Again we use the mle2() function, but now we minimize the mon.inf.lag.2p.nll() function. The estimates of the control treatment are assigned to rc and t0c in the fixed argument list. The target infection parameters are still placed in the start argument but now called rp and t0p: One can investigate the fitting results and statistics using the generic summary() function applied on the object containing fitting results (here: fit.control & fit.treatment).

EXAMPLES
The supplemental information contains R-code files. The files infection.frontend1.r and infection.frontend2.r are the front-end files containing the adaptable examples we will discuss below. These files load the source files, infection.models.r and infection.nll.r, containing all underlying functions. We begin with a detailed description of the front-end files which should be sufficient to apply our fitting method to data. Moreover, we kindly invite you to continue reading the descriptions of the source files later on (section "Underlying Functions"). Please extract all required files from the zip-folder into one folder and keep the provided folder structure. If you prefer different organization of data and files you have to adapt the paths described below in the code. Before getting started, use e.g. install.packages() to install the required packages deSolve (Soetaert et al., 2010) and bbmle (Bolker and Team, 2016), but see introductory R Books and manuals for details (e.g. Bolker, 2008;Crawley, 2012). We first discuss the case of a constant number of plants in the experimental units (infection.frontend1.r).
The data comprises three variables, (1) the factorial variable treatment determining if the data belongs to the control or the experimental treatment. The variable time.days contains the time of the measurement in days (you can switch the temporal resolution, the unit here determines the unit of the infection rate), and the variable number.infected containing the information of how many plants are infected (note that this must be an integer as we apply a binomial distribution later on; but see Crawley (2012) and Bolker (2008) for a detailed description on this topic). The data represents single experimental units (independent replicates, each data point represents the last measurement of a time series). Before continuing, we separate the data into two sub data sets containing just the control or just the treatment data using the subset function.
R> data.control = subset(sample.data, treatment == "control") R> data.treat = subset(sample.data, treatment == "treatment") Subsequently, we investigate the data graphically. For a better overview we set the graphical device to display two plotting regions using the par() function (the argument mfrow is set to c(1,2), generating 2 horizontal adjacent plotting regions). Within the plot() functions we fix the y-axes ranges to create comparable plots (ylim = c(0,10)) and display the number of infections (y-axis) depending on the experimental time (x-axis): R> par(mfrow=c(1,2)) R> plot(data.control$time.days, + data.control$number.infected, + ylim=c(0,10), + xlab="days", + ylab="infections control") R> plot(data.treat$time.days, + data.treat$number.infected, + ylim=c(0,10), + xlab="days", + ylab="infections treatment") The control without an experimentally added pathogen shows an early increase (day 3) of infected plants over time (Fig. 1a), but not all plants are infected during the experimental trial. The treatment with the experimentally added pathogen shows the first infections at day 4, but the increase in new infections is much steeper than in the control, and all plants might be infected (Fig. 1b).  (1) and (2) using the functions mon.inf.lag.1p() and mon.inf.lag.2p(). After simulating the numeric average we applied a random number generator (rbinom()) using Imax as size of the binomial distribution and the floating average of the simulation divided by Imax as probability. The control data (a) was simulated using r = 0.025, Imax = 10 and t0 = 1 and the experimental treatment (b) was simulated using rp = 0.19, rc = 0.025, t0p = 5.5, t0c = 1 and Imax = 10.

Analysis of the data -standard approach
First, we analyze the experimental data using the standard mono-molecular growth model (Raaijmakers et al., 2009;Paine et al., 2012). This model ignores the fact that not only the experimentally added pathogen but, in addition, other soil inherent pathogens may infect plant individuals and will be named "wrong method" (fit.treatment.wrong(), note that this method is valid if the medium chosen for the bioassay is sterile or does not contain any alternative pathogens). To fit the model to the data we use the mle2() function from the bbmle package by Ben Bolker Bolker and Team (2016); Bolker (2008). The mle2() function requires (1) a negative log-likelihood function, here the function mon.inf.lag.1p.nll() discussed later (see section -"Underlying functions"); (2) a list containing the model parameters that should be estimated (here the infection rate, r, and the resistance time, t0); (3) a list containing the data the model should be fitted to, here the first element of the list must be named nI (the number of infected plants) and the second element must be named tps (timepoints, the time of the experimental data); (4) we also submit a fixed value to mle2(), again in a list object, containing the maximum number of potentially infected plants (all plants in the experimental unit). Note that Imax does not have to be a fixed value but can also be assigned in the data list if the different replicates contain different numbers of plants (see section "Analysing data with multiple Imax"). Here, we start the fitting optimization with the initial guessed values of r = 1 and t0 = 1 (note that these values are just place-holders and might be adapted by the user,see section "What to do if" below, moreover these values will be changed during the fitting algorithm by the optimizer mle2(); note also that the initial t0 must be equal or smaller than the experimental resistance time, which according to figure 1 is at day 4); we set Imax = 10 (the maximum number of plants in the experiment).

Analysis of the data -corrected approach
Next, we apply the two-pathogen approach to the data, taking infections of the control treatments into account. In the first step we analyze the control data similarly to the treatment data shown above only with exchanging the data supply from the experimental to the control data and decreasing the starting values for the model parameters (we already saw that the first infections occurred earlier and the slope is less step than for the treatment data).

Adding lines to the plot
The graphical representation of data and model fits is common practice. Simple linear regressions can be added as line to an existing plot using e.g. the generic abline() function. Our example is slightly more complex, but not much. First we have to create a vector containing values for the x-axis (time in days in our case) which will later be used to display a line in the plot. Note that non-linear lines need many values to create a smooth appearance of the line. Here we choose to create 100 x-values ranging from t=0 to t=20, the end of our experiment using the seq() function with the third argument set to length = 100 (which creates 100 evenly distributed values ranging from the minimum to the maximum value). To simulate the corresponding y-values we use the lsoda() function from the package deSolve (Soetaert et al., 2010). The lsoda() function builds a complex object including background information on the simulation run not needed for our purpose. To get rid of this background information, we save the object created by lsoda() as a data frame by applying the data.frame() function on the lsoda() function. lsoda() requires the starting density of the infected plants as first argument, y, here c(I = 0); the second argument is the sequence of time points the number of infected plants should be calculated for; the third argument requires the ordinary differential equation model that should be applied (mon.inf.lag.1p(), see below for detailed discussion); and fourth, the parameters of the model must be supplied, here the results of our model fit. To simulate the regression line for the correct treatment fit, we must extend the code described above to incorporate both the control as well as the treatment pathogen parameters. We again use the lsoda() function, but now call the mon.inf.lag.2p() function that allows to model two pathogens. Also, we have to provide two starting densities for the infected plants (zero infected plants by the treatment pathogen, Ip = 0, and zero infected plants by the natural occurring (control) pathogens, Ic = 0). Moreover, the parameter list must now contain five parameters, both infection rates (rp and rc), both resistance times (t0p and t0c) and again the maximum reachable number of infected plants, Imax.  To add the lines to the plot, we again call plot() from above and subsequently lines(). Note that we have to sum both densities of infected plants (due to the treatment pathogen and the natural occurring (control) pathogen) to plot correct line (Fig. 2) for the treatment data.

... I want to see the trace of the integrative fitting process?
We continue the example by adding the element tracing to the fixed list. This allows to track the iteration steps in which the NaNs occurred and the values of the infection parameters that have been used in these steps: R> fit.control.tracing = mle2(mon.inf.lag.1p.nll, + start = list(r = 3, t0 = 0.5), + data = list(nI = data.control$number.infected, + tps = data.control$time.days), + fixed = list(Imax = 10, tracing = 1) + ) Running this code a list of negative likelihood values and infection parameters appears in the console, getting longer the more the fitting advances (to save space here in the text we truncated the output, you might have to scroll up a little in your console output to see the same): ... Clearly, the NaNs are produced when extremely high infection rates (r ∼ 50.3 and r ∼ 10.88) are tested by the optimization algorithm. This causes the lsoda() to fail as the increase in new infected plants is much larger than the step size of the numeric integration routine (we set the default to 0.1, which corresponds here to 0.1 days). Therefore (1) the warning message can be ignored as it applies only to unlikely high infection rates; a change in the starting parameters may avoid that the optimization algorithm randomly picks these high rates (the main example did not show any warnings); or the step size of the numerical integration routine can be decreased to increase the temporal resolution of the simulation.

Negative Likelihood Functions (infection.nll.r)
The fitting algorithm used above, mle2(), requires a negative log-likelihood (nll) function that will be minimized (see Bolker, 2008, for a detailed introduction). The nll()-function for the monomolecular infection model can be written as: if(tracing == 1) print(paste("negLL: ",negLL,"r: ",r,"t0: ",t0)) + return(negLL) + } The function includes seven arguments: the experimentally measured number of infected plants, nI; the corresponding time estimates from the experiment, tps (timepoints); the infection rate, r; the resistance time, t0; the maximum number of infections possible, Imax, (i.e. the maximum number of host plants in the experimental unit); the desired step length of the numerical integration, steps, by default set to 0.1; and the tracing argument which is deactivated by default. First the function checks if either r or t0 are below zero or t0 is greater than or equal to the minimum time of an infection in the real data. If any of these queries is true, the function returns infinity. Next, the model predictions for each experimentally measured value are simulated using sim.inf.1p() (see section "ODE functions"). Finally, the function calculates the negative likelihood using the binomial density function dbinom() and returns it. We choose the binomial distribution as our data is binomially distributed (integer values for infections have a clear defined minimum and maximum number of infections, but see Bolker (2008) for a detailed introduction on this topic). The two-pathogen variant, mon.inf.lag.2p.nll(), is similar to the above described function and we will only discuss the differences. As this function is created to estimate the negative log-likelihood of a two pathogen system, the model parameters consist of rp, rc, t0p and t0c. Moreover, the function sim.inf.2() is used to calculate the number of infections from the model simulation. Also a more complex if-statement is added to ensure that at least one of t0p or t0c falls below the time the first infection occurred in the experiment (otherwise NaNs may be produced): R> mon.inf.lag.2p.nll = function(nI, tps, rp, t0p, rc, + t0c, Imax, steps = 0.  = sim.inf.2p(tps,rp,rc,Imax,t0p,t0c, steps) + negLL = -sum(dbinom(x = nI, + prob = Isim / Imax, + size = Imax, + log = TRUE)) + if(tracing == 1) print(paste("negLL: ",negLL, "r: ",r, "t0: ",t0)) + return(negLL) + } The function sim.inf.1() first creates an empty numeric vector (Iout) to store the calculated values for infections. If Imax is only a single value we apply the lsoda() function (Soetaert et al., 2010) to simulate a single time series of plant infections according to the assigned parameter values (parms = c(r = r, Imax = Imax, t0 = t0)) and a starting density of infected plants of unity (y = c(I = 0)). The numerical simulation of the integration process needs a vector of consecutive points in time. This vector consists of a sequence of consecutive values from zero to the maximum time value (seq(0, max(tps), steps)), and the experimental time values, tps. As no duplicate values should appear in the vector and the time vector should increase consecutively, we first apply the unique() function on the vector to delete duplicates and second, sort the vector by the sort() function. The ordinary differential equation system that should be integrated is given by the function mon.inf.lag.1p(). Please read into Soetaert and Herman (2009) to get a general introduction into the topic "solving ordinary differential equation systems in R using deSolve". After the integration the number of estimated infections are saved according to their appearance in the experimental time series using a for loop. If the experimentally data consists of more than one single value for Imax, the else 12/15 part of the if/else-statement is activated. First, we create three empty numeric vectors to store the the the number of infected plants, the time, and the maximum number of infectable plants in (mres.I, mres.time, mres.Imax). Second, we use a for-loop to calculate the number of infected plants for each Imax. Third, we save the results to Iout as described above using a for-loop, with additionally separating for each Imax. R> sim.inf.1p = function(tps, r, Imax, t0, steps) { + Iout = numeric() + if(length(Imax) == 1){ + mres = data.frame(lsoda(y = c(I = 0), + times = sort(unique(c(seq(0, max(tps), steps), tps))), + func = mon.inf.lag.1p, + parms = c(r = r, + Imax = Imax, + t0 = t0))) + for ( The function sim.inf.2() is similar to the function sim.inf.1() but models the two-pathogen system. The differences are: model parameters consist are rp, rc, t0p and t0c; lsoda() needs two starting values for infections at time = 0 (c(Ip = 0, Ic = 0)); the ordinary differential equation system is given by the function mon.inf.lag.2p(). The results for the total infected plants, estimated by the model, are now calculated by (mresIc + mresIp).

ODE Functions (infection.models.r)
The monomolecular infection model (Raaijmakers et al., 2009;Paine et al., 2012) with an additional lag phase can be written as: The function contains three arguments, the first argument is the time t, the second argument, x, is a list of densities that occur in the differential equation system and the third argument, parms, is a list of constant parameters. We activate the headers of the parms list by using the with() function. Within the with() function the change of infections over time is calculated. We use an if/else-statement to discriminate between zero growth (before the first infections occur, t < t0) and positive new infections above this boundary. Lastly, the function returns a list containing the change of infected plants, dI. The two pathogen monomolecular infection model can be written as: The differences to mon.inf.lag.1p() are: Within the with() function, the change of infections over time is calculated following the two pathogen infection model (see main manuscript); the function returns a list containing the change in infected plants over time of the experimentally pathogen, dIp, and the naturally (control) occurring pathogen, dIc.