Responsible modelling: Unit testing for infectious disease epidemiology

Highlights • Unit testing can reduce the number of bugs in code but is rarely used in our field.• We present a worked example of adding unit tests to a computational model.• Specific issues such as stochastic code are common in infectious disease modelling.• Unit testing can handle particular quirks of infectious disease models.• We hope to increase the use of unit testing in infectious disease epidemiology.


A more realistic modelling example
Here we define a simple model that is less contrived compared to the main text. We again demonstrate how to effectively write unit tests for it in R code. We aim to follow the same structure as in the main text but in this example we do not need to worry about the code being short.
We will implement a stochastic, continuous time, SIR model using code from EpiRecipes (Frost 2020) as a starting point. The number of individuals in the three classes, susceptible, infectious and recovered, are described by the equations In order to simulate continuous time dynamics we use the Gillespie algorithm. In each iteration of the model, exactly one event happens, either an infection or a recovery. The time step in between events is a continuous, positive number that is drawn from an exponential distribution with a rate equal to the sum of two event rates (i.e. infection and recovery).
Code S1: Base example of the multi-pathogen re-infection model.

Basic unit testing Write small functions
To ensure the unit tests are evaluating the exact code as run in the analysis, code should be structured in functions, which can be used to both run unit tests with and to generate results as part of a larger model codebase. Make your functions compact with a single clearly-defined task. We have split the code into a number of functions that do relatively well defined tasks such as initialising the simulation, handling a single transmission event or a single recovery event as well as a large function (runSim()) that combines these functions. (Code S2).
Code S2: Organising code into small functions. As an informal first check we can plot a single simulation and check that it gives similar results to the earlier plot and behaves broadly how we would expect an SIR model to behave. For example, we expect the number of infectious individuals to increase and then decrease again while we expect the susceptible and recovered classes to monotonically decrease and increase respectively.
Code S3: Run a full simulation with the code organised in functions.  Figure S2: Example simulation output from the refactored code.

Test simple cases first
As there are quite a few functions already in our code we will not write unit tests for all of them here. However, in Code S4 we will write some tests for simple aspects of the code. For example we can confirm that the infection() function works as expected. We expect this function to increase the number of infectious individuals by one, decrease the number of susceptible individuals by one and leave the rest of the object unchanged. If we start with 1 susceptible and 0 infectious or recovered individuals, this function should result in 0 susceptible individuals, 1 infectious individual and 0 recovered individuals. We can see that the way we have written the code has separated the process of finding out what event happens, from the book keeping of making the change once an event has been selected. With 0 infectious individuals we would not expect to ever have an infection event, but separating the functionality of the code allows us to test that the bookkeeping works without worrying about other aspects.
Code S4: Using simple parameter sets we can work out beforehand what results to expect.

Test all arguments
calcRates() has three arguments to check. If we consider a baseline population with one susceptible, one infectious and zero recovered individuals, with both beta and gamma set to one we can easily work out the expected outputs. In Code S5 we will focus our tests on the two rates: the transmission rate pf1 and the recovery rate pf2. Given that the SIR transmission rate is βSI and given that β, S and I are all one, pf1 should be one. Similarly, given a recovery rate of γI and with γ set to one, pf2 should also be one.
Code S5: Baseline before testing all function arguments in turn.
From there we can alter each of the three parameters in turn and check the behaviour is as expected (Code S6). First we will set beta to 2 and so we expect pf1 to be 2 while pf2 should not change. Second we will set gamma to 2 and so we expect pf2 to be 2 while pf1 should not change. Finally, we will initialise a new simulation with I0 = 2 and we will then expect pf1 and pf2 to be equal to 2.
Code S6: Test all function arguments in turn.

Does the function logic meet your expectations?
We can also cover cases that expose deviations from the logical structure of the system. The function addNew() changes the object ready for a new iteration in the simulation. Therefore the length of a number of elements should increase by one. Instead of testing their numerical values, we verify logical statements of the results within our macro understanding of the model system (Code S7).
Code S7: Test logical structure of the code rather than the specific numerical values.

Combine simple functions and test them at a higher-level
In the end an entire model only runs when its functions work together seamlessly. So we next check their connections; achieved through nesting functions together, or defining them at a higher level and checking the macro aspects of the model. This step could be considered integration testing depending on how precisely you define unit testing versus integration testing. We have already defined a function runSim() that runs a few different smaller functions to perform one full simulation. We would expect the output from runSim() to be a dataframe with four columns and no NAs and we would expect no values of the time column to be greater than tf. We would also expect all of the values in the S, I and R columns to be integers greater than or equal to zero. These are all general properties that should be true regardless of the exact realisation of the simulation. Furthermore, checking the maximum value in the time column is, to an extent, an aggregate property that could not be tested at a lower level.

Split stochastic and deterministic parts
Isolate the stochastic parts of your code. For example, nextEvent performs stochastic and deterministic operations (Code S9).
Code S9: Isolation of the deterministic and stochastic elements.
The function takes a random uniform number to determine whether an infection or recovery event takes place. It then performs that event, although we note that we have already pulled much of the code out into the two functions infection() and recovery(). However, we can split this code into two functions, one completely deterministic (nextEvent()), and one very small function (chooseEvent()) that carries out the stochastic component. The deterministic function can be tested using the guidelines above while in the following sections we will look at methods to test the stochastic function. We note that perhaps a better method here would be to make the first argument of nextEvent() take a random uniform number. In that case nextEvent() would be deterministic and the only stochastic code would be a single call to runif(). Given that the runif() function is a base R function that is well-tested, further testing is unnecessary. However, for demonstration purposes we have split the function as below so that we can subsequently demonstrate how to test stochastic functions.
Code S10: Isolation of the deterministic and stochastic elements. We now need to redefine runSim() to account for these modified functions.

Pick a smart parameter for a deterministic result
In the same way that we used simple parameters values in Code S4, we can often find simple cases for which our stochastic functions become deterministic. For example, samples from X ∼ Bernoulli(p) will always be zeroes for p = 0 or ones for p = 1. In the case of the function chooseEvent(), this function is deterministic if pf1 (the infection rate) is zero in which case the function will return FALSE i.e. if the infection rate is zero, the next event will not be an infection event. The function is also deterministic if pf1 is equal to pf. This occurs when pf2, the recovery rate, is zero and therefore the next rate will certainly be an infection rate.
Code S12: A stochastic function can output deterministically if you can find the right parameter set.

Test all possible answers (if few)
Working again with a simple parameter set, there are some cases where the code is stochastic, but with a small, finite set of outputs. So we can run the function exhaustively and check it only returns possible outputs. For the function chooseEvent() this is trivially true; for any valid values of pf1 and pf it should return either TRUE or FALSE.

Use very large samples for the stochastic part
While the previous example worked well for a small set of possible outputs, testing can conversely be made easier by using very large numbers. This typically involves large sample sizes or numbers of stochastic runs. For example, if pf1 is positive but smaller than pf, we expect the chooseEvent() to sometimes return TRUE and sometimes to return FALSE. If we run the function only twice, we might get two TRUEs or two FALSEs. However, if we run the function thousands of times, and pf1 is approximately half of pf (i.e. infection and recovery events are equally likely) then it is almost certain that we will get at least one of each logical value.
Code S14: Testing that the code does allow one individual to infect multiple individuals.

# Check that both TRUE and FALSE are returned # and simultaneously check that ONLY TRUE or FALSE # is returned. expect_true(setequal(out3, c(TRUE, FALSE)))
If we have an event that we know should never happen, we can use a large number of simulations to provide stronger evidence that it does not stochastically occur. However, it can be difficult to determine how many times is reasonable to run a simulation, especially if each run of the simulation is time consuming. This strategy works best when we have a specific bug that occurs relatively frequently (perhaps once every ten simulations or so). If the bug occurs every ten simulations and we have not fixed it, we would be confident that it will occur at least once if we run the simulation 500 or 1000 times. Conversely, if the bug does not occur even once in 500 or 1000 simulations then we can be fairly sure we have fixed it. As an example of this, the number of susceptibles in an SIR model should never increase. So we can run the simulation a number of times and check that it does not occur. However, without a previously observed bug, it is impossible to know how many runs of the simulation we would need to run before finding the bug, so in this example we will simply run 100.
Code S15: Assessing if a bug fix was a likely success with large code runs, when the bug was appearing relatively frequently.

# Does S ever increase? #
This returns TRUE if every element of the S vector is # the same size or smaller than the previous element. decreasing <sapply(out, function(x) all(diff(x$S) <= 0)) expect_true(all(decreasing))

Further testing Test incorrect inputs
As well as testing that functions work when given the correct inputs, we must also test that they behave sensibly when given wrong ones. This typically involves the user inputting argument values that do not make sense. This may be, for example, because the inputted argument values are the wrong class, in the wrong numeric range or have missing data values. Therefore it is useful to test that functions fail gracefully if they are given incorrect inputs. This is especially true for external, exported functions, available to a user on a package's front-end. However, it is not always obvious what constitutes an 'incorrect value' even to the person who wrote the code. In some cases, inputting incorrect argument values may cause the function to fail quickly. In other cases code may run silently giving false results or take a long time to give an error. Both of these cases can be serious or annoying and difficult to debug afterwards.
Often for these cases, the expected behaviour of the function should be to give an error. There is no correct output for an epidemiological model with a negative number of susceptible individuals. Instead the function should give an informative error message. Often the simplest solution is to use defensive programming and include argument checks at the beginning of functions. We then have to write slightly unintuitive tests for an expression where the expected behaviour is an error! If the expression does not throw an error, the test should throw an error (as this is not the expected behaviour). Conversely, if the expression does throw an error, the test should pass and not throw an error. We can use the expect_error() function for this task. This function takes an expression as its first argument and reports an error if the given expression does not throw an error as expected.
We can first check that the calcRates() function sensibly handles the user inputting a string instead of a number for the beta parameter. Because this expression throws an error, expect_error() does not throw an error and the test passes.
The call to calcRates() did not throw an error and therefore expect_error() did throw an error. The easiest way to protect against this would be to use defensive programming and check the arguments. We demonstrate this in Code S18. Here we have added two checks that confirm that beta and gamma are positive. Given that calcRates() is run thousands of times (once per iteration) it might be more sensible to run these checks only once at the top of runSim(), but we are showing them here to avoid pasting the longer function again and for consistency with the previous two examples.
Code S18: Add defensive programming so that calcRates fails when it should do.

Test special cases
When writing tests it is easy to focus on standard behaviour. However, bugs often occur at special caseswhen the code behaves qualitatively differently at a certain parameter value of a parameter or for certain combinations. For example, in R, selecting two or more columns from a matrix e.g. my_matrix[,2:3] returns a matrix while selecting one column, e.g. my_matrix[,2] returns a vector. Code that relies on the returned object being a matrix would fail in this special case. These special cases can often be triggered with parameter sets that are at the edge of parameter space. This is where understanding of the functional form of the model can help. Consider a function divide(x, y) that divides x by y. We could test this function by noting that y * divide(x, y) should return x. If we write code that tests standard values of x and y such as (2 * divide(3, 2)) == 3 we would believe the function works for nearly all values of division, unless we ever try y = 0.
In our example we might want to test whether the code works with β = 0 and γ = 0 (Code S19). We again run the simulation and test that the macro properties of the output are correct i.e. that the function returns a data frame with four columns and no NAs. While setting beta = 0 works fine, setting gamma = 0 does not.
Code S19: Demonstrate a failing test.
> sim <-initialiseSim(100, 4, 0) > simulation1 <-runSim(sim, beta = 0, gamma = 1, 10) > # Check the shape of the output > expect_true(inherits(simulation1, data.frame )) > expect_equal(ncol(simulation1), 4) > # Test for NAs > expect_true(all(!is.na(simulation1))) > simulation2 <-runSim(sim, beta = 1, gamma = 0, 10) Error in while (sim$time < tf) { : missing value where TRUE/FALSE needed In general, requesting a simulation with a recovery rate of zero is not without meaning; this is not an objectively wrong use of the function. The error occurs because once all the individuals are in the infectious class, both the infection rate is zero (βSI = 1 × 0 × 100 = 0) and the transmission rate is zero (γI = 0 × 100 = 0). Therefore in the rexp(1, rate = rates$pf) in calcRates, we are trying to draw a random number from an exponential distribution with a rate of zero, which is undefined. Therefore rates$dt is set as NaN, sim$time subsequently becomes NaN and the expression while (sim$time < tf) errors because NaN < 10 returns an NA rather than either a TRUE or a FALSE. For special cases like this, it may be rather subjective what the correct behaviour should be. It might be appropriate to simply declare that this parameter value is not allowed; setting gamma as 0 means this is no longer an SIR model but is instead an SI model. This should be stated in the documentation and the function should throw an error. This is what we will do here (Code S20). We change our defensive lines in calcRates() so that gamma must be greater than 0 rather than greater or equal than zero. We then add a test that checks that this argument is checked properly.
Code S20: Use defensive programming to check for gamma = 0.
Code S21: Check very large values of beta and gamma.