Malaria Incidence Rates from Time Series of 2-Wave Panel Surveys

Methodology to estimate malaria incidence rates from a commonly occurring form of interval-censored longitudinal parasitological data—specifically, 2-wave panel data—was first proposed 40 years ago based on the theory of continuous-time homogeneous Markov Chains. Assumptions of the methodology were suitable for settings with high malaria transmission in the absence of control measures, but are violated in areas experiencing fast decline or that have achieved very low transmission. No further developments that can accommodate such violations have been put forth since then. We extend previous work and propose a new methodology to estimate malaria incidence rates from 2-wave panel data, utilizing the class of 2-component mixtures of continuous-time Markov chains, representing two sub-populations with distinct behavior/attitude towards malaria prevention and treatment. Model identification, or even partial identification, requires context-specific a priori constraints on parameters. The method can be applied to scenarios of any transmission intensity. We provide an application utilizing data from Dar es Salaam, an area that experienced steady decline in malaria over almost five years after a larviciding intervention. We conducted sensitivity analysis to account for possible sampling variation in input data and model assumptions/parameters, and we considered differences in estimates due to submicroscopic infections. Results showed that, assuming defensible a priori constraints on model parameters, most of the uncertainty in the estimated incidence rates was due to sampling variation, not to partial identifiability of the mixture model for the case at hand. Differences between microscopy- and PCR-based rates depend on the transmission intensity. Leveraging on a method to estimate incidence rates from 2-wave panel data under any transmission intensity, and from the increasing availability of such data, there is an opportunity to foster further methodological developments, particularly focused on partial identifiability and the diversity of a priori parameter constraints associated with different human-ecosystem interfaces. As a consequence there can be more nuanced planning and evaluation of malaria control programs than heretofore.


Introduction
Estimation of incidence, and in some cases recovery, rates for malaria infection is a central objective of ongoing community surveillance programs [1]. By incidence rate we mean the number of new infections acquired in a small interval of time per person at risk (i.e. uninfected) at the beginning of the interval. Analogously, a recovery rate is the number of terminations of infection in a small interval of time per person at risk (i.e. infected) at the beginning of the interval.
It is important to observe that incidence rate, as defined above, is not the same as incidence, as commonly used in the contemporary malaria literature [e.g., see reference 2]. In the latter case, incidence is defined as [Number of positive species-specific clinical cases observed during the duration of a survey]/[(Number of people observed over the survey duration) Ã (duration of survey)]. We quantitatively compare and contrast these notions in the Discussion section.
Ideally one would like to have continuous histories of infection status on designated populations so that incidence rates could be directly ascertained over short intervals starting at any designated time. This would facilitate showing the impact of local-in-time weather events, seasonal variation, and the impact of intervention strategies on these rates at arbitrary times of interest to health service workers, government personnel setting malaria control policy, and research investigators.
Continuous infection status histories are virtually never available at a community level. The most common longitudinal data collection plan is a time series of 2-wave panel data sets (or observations of infection status on the same individuals taken at two time points separated by an interval of length Δ), with spacing between waves varying from a few weeks [3,4] to several months [5]. Thus, there are unobserved transitions between states (uninfected and infected) that create a challenge for estimation of incidence and recovery rates. To address this challenge, we require specification of a class of continuous-time 2-state stochastic process models of infection status dynamics that must be shown to be consistent with observed data, and within which estimation of incidence and recovery rates is feasible.
The first attempt to carry out this program was by Bekessy et al. [6] using data from the Garki malaria surveys [3] in Kano State, Nigeria from [1970][1971][1972][1973][1974][1975]. They introduced the 2-state continuous time Markov chains as candidates to represent the unobserved infection dynamics. The empirical question associated with this choice was whether or not a transition matrix P(Δ) generated by a member of this class of models could represent a transition matrixPðDÞ arising from field data. Here Δ is the time interval between observations collected at a survey date T 1 and a later survey date T 2 . P(Δ) is a 2x2 transition matrix associated with a continuous time Markov chain with entries p i,j (Δ) = conditional probability that an individual has infection status j at the end of a time interval of length Δ given that his/her infection status at the beginning of the interval is i. Here, i and j can take on the values 1 = uninfected or 2 = infected.PðDÞ is a 2x2 stochastic matrix with entries n i,j /(n i,1 + n i,2 ), where n i,j counts the number of individuals observed to be in state i at the beginning of interval Δ and in state j at the end of the interval. The entries are interpreted as conditional frequencies of observing a transition i ! j between the consecutive survey dates.
For the Garki surveys, Δ % 10 weeks. The transition matrices P(Δ) have a representation in terms of incidence and recovery rates given by: Here q 1 is the incidence rate at any time t in the interval Δ, and q 2 is the corresponding recovery rate. Bekessy et al. [6] showed that a 2x2 stochastic matrix, P Ã , has a representation of the form Eq (1) if and only if Thus, the empirically determined stochastic matricesPðDÞ can be generated by observations taken at two points in time on a continuous time Markov chain provided tracePðDÞ > 1. Formal statistical tests of this hypothesis were put forth by Singer & Cohen [7].
Interestingly, Bekessy et al. [6], Singer & Cohen [7], and Molineaux & Gramiccia [3] found that all pairs of consecutive surveys during the baseline period of data collection in the Garki project satisfied eq (2). However, there were pairs of consecutive surveys conducted during the intervention phase of the project where tracePðDÞ < 1. In these instances, Molineaux & Gramiccia [3] claimed that incidence and recovery rates were not estimable. While this assertion is correct for the class of model eq (1), the unanswered question as of 1980 was: what alternative and substantively defensible models could generate transition matrices satisfying tracePðDÞ < 1 and within which incidence and recovery rates could be estimated? Surprisingly, to the best of our knowledge, this question has not been taken up in the past 40 years. Nevertheless, its importance stems from the fact that 2-wave panel data in a diversity of malaria surveys/surveillance projects have a majority of their transition matrices,PðDÞ, satisfying tracePðDÞ < 1. In addition, incidence and recovery rates are important quantities for evaluation of malaria intervention programs.
The purposes of this paper are to: (i) present a class of models with associated 2x2 transition matrices, P Ã , within which those satisfying eq (1) are nested, some members of which satisfy trace P Ã < 1, and for all of which incidence and recovery rates can be calculated; (ii) exhibit identifiability and/or partial identifiability criteria arising from specific malaria contexts that ensure uniqueness, or highly constrained non-uniqueness of incidence and recovery rates; and (iii) apply the models and methods in (i) and (ii) to a time series of 2-wave panel data sets from Dar es Salaam, Tanzania as an example of the applicability of the proposed method [5,8,9]. To facilitate dissemination and utilization of the methodology by the malaria community, we developed a code to calculate incidence rates from longitudinal data utilizing the R package v.2.15.1 [10], which we make available as Supporting Information in a documented version (S1 Code) and as a R file (S2 Code). To allow replication of our results, we provide the Dar es Salaam data utilized in this paper (Tables 1 and 2).
An important feature of our considerations is allowance for non-identifiability in situations where subject-matter-based constraints are too weak to ensure unique parameter identification from input information. Depending upon the particular setting, the extent of non-identifiability-or partial identifiability-may be sufficiently limited that the unidentified parameters are restricted to a narrow range of values, with accompanying incidence rates also varying over a small range. This is, indeed, the situation in Dar es Salaam, the site of our empirical data. As a consequence, we show how to explicitly incorporate variability due to a small degree of underidentification together with sampling variability to produce composite uncertainty intervals for incidence rates. Useful discussions of partial identifiability, including in the setting of mixture models, are given by Manski [11], Gustafson [12], and Henry et al. [13].

Ethics statement
Ethical approval to use the UMCP data was provided by the Harvard T.H. Chan School of Public Health Institutional Review Board (Protocol # 20323-101).

Two-component mixtures of continuous time Markov chains
We begin with the observation that every 2x2 stochastic matrix, P, with non-zero entries can be a transition matrix for at least one 2-component mixture of continuous-time Markov chains. Thus, P has a representation of the form where S is a diagonal matrix with entries s i , i = 1,2 in the unit interval, [0,1]. U and V are each stochastic matrices having representations of the form eq (1). Hence they satisfy the condition eq (2): trace U > 1 and trace V > 1. It will be convenient to represent the matrices P, U, and V as points in the unit square; namely p = (p 1 ,p 2 ), u = (u 1 ,u 2 ), and v = (v 1 ,v 2 ). The coordinates p i , u i , and v i , i = 1,2 are the diagonal entries in P, U, and V respectively. Depending on the empirical setting, we will also have occasion to consider data of the form p = (p 1 ,0) and p = (0,p 2 ); i.e. points on the boundary of the unit square. In the first of these conditions, P has a representation of the form eq (3), but with u = (1,0) (hence, trace U = 1) and trace V > 1. For p = (0,p 2 ), the representation eq (3) also holds but now with u = (0,1) and trace V > 1. Data of the form p = (p 1 ,0) occur frequently in the Dar es Salaam data, as discussed later. In these boundary cases, U still has a representation of the form eq (1) but with q = (0,−1) when p = (p 1 ,0), and q = (−1,0) when p = (0,p 2 ). In the first case, the interpretation of q 2 = 1 is that there is an infinitely fast recovery rate. This would be associated with a population where everyone is on prophylaxis, or where effective anti-malarial drugs are administered immediately following a diagnosis of infection. In the second case, q 1 = 1 corresponds to an infinitely fast incidence rate. This would be a situation where there is instantaneous new infection of any exposed individual. For data corresponding to p in the interior of the unit square-i.e.PðDÞ with non-zero entries-we calculate the incidence and recovery rates for the mixture eq (3) via: In terms of U and V, the rates q ðUÞ i and q ðVÞ i , i = 1,2 can be expressed as (note that these formulas are entries in matrix logarithms of U ¼ e Q ðUÞ D and V ¼ e Q ðVÞ Dsee e.g. [7]): When p = (p 1 ,0), we have The recovery rate is 1 for every s 2 > 0. When s 2 = 0, we have An analogous argument yields r 1 and r 2 when p = (0,p 2 ).

Identification
We first re-express Eq (3) via the system of equations where (s i ,u i ,v i ) 2 [0,1] 6 for i = 1,2 and u 1 +u 2 >1, v 1 +v 2 >1. Given p = (p 1 ,p 2 ), eq (6) is an under-identified system. Additional subject-matter motivated constraints must be imposed to either identify (s,u,v) uniquely or restrict this vector to a small set in [0,1] 6 \((u,v):u 1 +u 2 >1, v 1 +v 2 >1). In the context of malaria in Dar es Salaam, we impose the constraints: A full rationale for the above restrictions will be given later when we present the Dar es Salaam case study. However, the central point here is that a system of constraints such as these is essential for parameter identification or partial identification. The conditions that v 1 − v 2 be 'large' and v 1 < 1 require additional comment. First, it is a matter of judgment about what is a high probability of being observed uninfected at consecutive surveys of the V-process, while still not being a sure thing-i.e. v 1 = 1. In identifying parameters, we first select v 1 2[0.9,1) and secondarily choose v 2 as small as possible consistent with the other constraints. Two examples will serve to illustrate the issues.
Example 1. Suppose p = (p 1 ,p 2 ) = (0.7,0.2). This is a case where trace P = p 1 + p 2 < 1. Now, set v 1 = 0.95. Then (s 1 ,u 1 ) is a point on the curve 0.95s 1 − s 1 u 1 − 0.25 = 0, subject to the constraints u 1 0.2 and s 1 < 0.5. This is a curve of admissibility for (s 1 ,u 1 ). Then, from 0.20 = The left hand equality represents a curve of admissibility for (s 2 ,v 2 ). The two curves represent the extent of non-identifiability of the parameters in eq (6) given the a priori constraints for Dar es Salaam. This variation could be further reduced if, for example, we specified s 1 based on additional empirical evidence.
Some judgment enters in choosing v 2 > 0.05. We want v 2 close to 0.05 to make v 1 − v 2 'large', but not so close that v 1 + v 2 = 1 + ε yields an unreasonably small value of ε. The concern here is that the contribution to the overall incidence rate from the V-process is given by q ðVÞ where Δ is the time between consecutive surveys. As ε ! 0, q ðVÞ 1 ! 1. This is, of course, physically impossible. Thus, we choose ε = 0.008 based on the fact that larger values would not have much influence on q ðVÞ 1 , while smaller values, by another power of 10 or more, would yield unrealistically high conversion rates from negative to positive parasitological status. Now, ε = 0.008 implies that v 2 = 0.058 and s 2 = 0.15. This leaves (s 1 ,u 1 ) on the curve 0.95s 1 − s 1 u 1 − 0.25 = 0 as the only source of non-identifiability. Table 3 shows the impact of this source of variation on the incidence rate r 1 .
In Table 3, We set Δ = 40 weeks, which is roughly the average spacing between consecutive surveys in the Dar Table 3. Influence of variation in (s 1 ,u 1 )on the incidence rate r 1 . es Salaam data, as described later. The point (s 1 ,u 1 ) = (0.263, 0.006) corresponds to a very low probability of being observed uninfected at two consecutive surveys of the U-process. Thus, we retain the other 4 points as plausible values and include the variation in r 1 corresponding to them in our overall assessment of variability in the incidence rate. We again emphasize that this is variability induced by non-identifiability. It is an important feature of the methodology introduced herein. Overall variation in incidence rates is a composite of sampling variability and non-identifiability variation, detailed later.
if we set ε = 0.008 as in example 1, then r 1 is given by eq (5) and the model is identified. The values of v 1 and v 2 are determined by the choice of μ > 0 in the specification v 1 = (1 − μ)p 1 with μ small. As a numerical example, suppose p 1 = 0.9 and μ = 0.05. Then v 1 = 0.95 x 0.9 = 0.855. Since 0.9 = s 1 + (1 − s 1 )v 1 , we have s 1 = 0.31. Then v 2 = 1.008 − 0.855 = 0.153 and v 1 − v 2 = 0.702. Finally, we have Then r 1 = 0.01217 using Δ = 40 weeks. Although it does not arise in the Dar es Salaam example, we briefly indicate identification of v when p = (p 1 ,0) and p 1 < 0.5. Here we must have v 2 > v 1 as seen in the following simple, but generic, example. Suppose p 1 = 0.4 and v 1 = 0.95

Assessing variability of incidence rates
When p = (p 1 ,p 2 ) is in the interior of the unit square, we generate 1,000 tables by doing binomial sampling for row 1 with probability p 1 and for row 2 with probability p 2 with sample sizes n 11 + n 12 and n 21 + n 22 , respectively. If, for a particular table, the system of eq (6) has a unique solution (s,u,v), subject to context-specific constraints, then we compute an incidence rate, r 1 , for that table. If there is a zone of non-identifiability, as previously exemplified by the equation 0.95s 1 − s 1 u 1 − 0.25 = 0 in Example 1, then we compute r 1 for each of 100 values u 1 (which then determines s 1 ) subject to the a priori constraints on s 1 and u 1 . This yields a set of incidence rates that reflect variation due to non-identification. We used the minimum, median, and maximum values of r 1 from each such set of 100 values and viewed them as the summary rates for the particular table. Finally we take the summary rates, for tables where non-identifiability is an issue, and the unique rate for tables when the system eq (6) is identified, and rank this full set of rates. We designate the 2.5 th percentile and the 97.5 th percentile of the ranked list as the upper and lower bounds on a 95% variation interval for the incidence rate of the original table. This takes both sampling variability and variation due to non-identifiability into account.
When p = (p 1 ,0), we treat the 0 as a structural zero-in the case of Dar es Salaam-and only do binomial sampling on the first row to generate 1,000 tables having this same structure. We then describe the variation in r 1 in the same manner as indicated above. In applications where p = (p 1 ,0) does not have a structural zero, we perturb the second coordinate to a small valuee.g. 10 −5 or less-and do binomial sampling for the second row with this value. Then we proceed as in the above paragraph to calculate a confidence interval for r 1 .

Submicroscopic infection
Light microscopy has limitations as a technology for diagnosing Plasmodium falciparum infections, particularly in low transmission settings [14][15][16]. In a recent study of Okell et al. [16], the supplementary information for the paper contains an especially interesting and useful table comparing prevalence estimates using microscopy and PCR on the same blood samples. The data come from a wide variety of settings, and exhibit considerable variation in prevalence rates as ascertained via microscopy. The prevalence ratio, p = [prevalence rate from microscopy]/[prevalence rate from PCR] provides a basis for adjusting empirical microscopy rates to what you would expect to find if PCR had actually been done on the same blood samples. This calculation will, of course, only yield adjusted prevalence rates. For our longitudinal data, it would have been desirable to have microscopy and PCR based estimates of p 12 ,p 21 and p 22 , from which we could directly recover p 11 . However, the lack of identifiability of transition probabilities from prevalence rates can still be dealt with in particular settings, such as Dar es Salaam, by invoking an additional, and obviously context dependent, constraint. We exhibit the methodology on p = (0.7, 0.2)-the transition rates in example 1-augmented by a table of counts with entries n ij consistent with these values. Using an adjusted table of counts, n Ã ij , and thereby an adjusted vector, p Ã ¼ ðp Ã 1 ; p Ã 2 Þ, we calculate the incidence rate, r Ã 1 , that represents what we might have expected to find if PCR had been done on the blood samples in Dar es Salaam.
We introduce the table of counts {n ij ,1 i, j 2} with n 11 = 100, n 12 = 43, n 21 = 90, and n 22 = 23. For this table, p = (0.7, 0.2). It is one of a myriad of tables that could have been selected to illustrate our points about submicroscopic infection. However, it is comparable in size to many of the sub-ward tables in the Dar es Salaam data, and thus especially useful for illustrating an adjustment methodology.
The prevalences at the initial and final rounds of data collection for the above  Table S1 in Okell et al. [16], we find the prevalence from microscopy that is closest to the prevalence at initial survey in Dar es Salaam given by 0.4414. This is the prevalence of 0.481 based on data from Guinea Bissau. The corresponding prevalence ratio is p = 0.551. Thus our estimate for a corresponding PCR-based prevalence rate at the initial survey is 0.4414/ 0.551 = 0.8011. In contrast to the initial survey situation, there are four nearby microscopybased prevalence rates to associate with the prevalence rate for the final survey given above by 0.2578. These values, their associated prevalence ratios, and our estimate for the corresponding PCR-based prevalence rates are shown in Table 4. Each PCR rate is equal to 0.2578/p.
For our analysis, we use the average of these PCR rates, namely 0.5173. To obtain an associated table of counts n Ã ij , we first observe that n Ã 11 þ n Ã 12 þ n Ã 21 þ n Ã 22 ¼ 256 ¼ total count from the microscopy-based table with entries n ij . From PCR prevalence at initial survey = 0.8011, we obtain n Ã 21 þ n Ã 22 ¼ 205. From PCR prevalence at final survey = 0.5173, we obtain n Ã 12 þ n Ã 22 ¼ 132. Adding and subtracting n Ã 22 to the equation for total count, we can rewrite it as ðn Ã Then we have that n Ã 11 À n Ã 22 ¼ À81. To be consistent with the microscopy-based vector, p = (0.7, 0.2), where obviously p 1 > p 2 , we choose n Ã 11 to ensure that p Ã 1 > p Ã 2 . Table 5 shows some choices for estimated PCR-based tables. Here we use as an example p Ã 1 ¼ 0:784 and p Ã 2 ¼ 0:590 (third line of Table 5) for subsequent calculations. To obtain what is interpreted as a PCR-based incidence rate, we proceed as in the previously described microscopy-based rate calculations. First set and r Ã 1 as indicated in Table 5.

Case study: Malaria in Dar es Salaam
Dar es Salaam. Dar es Salaam is the largest city and economic capital of the United Republic of Tanzania with a population of 4.4 million as recorded by the 2012 Population Census. The city is divided into three municipalities: (i) Ilala, the administrative area, where almost all government Offices and ministries are located; (ii) Temeke, the industrial area, which holds the main manufacturing centers and the Dar es Salaam Port, and has the largest concentration of low income dwellers; and (iii) Kinondoni, the most populated municipality that concentrates most of the high income housing in the city. Municipalities are divided into wards, and the smallest administrative component is the ten-cell unit.
Analogous to other urban areas in sub-Saharan Africa, fast urban growth was observed in Dar es Salaam in recent decades, but not accompanied by effective provision of infrastructure. It is estimated that in Dar es Salaam 70% of the population lives in unplanned settlements [17], with precarious access to improved drinking water, sanitation [18], and waste collection. Although an extensive network of drains is available [19][20][21], many of the drains lack proper and regular maintenance resulting in waste accumulation, flooding, water stagnation, and the proliferation of breeding sites for disease vectors [22,23].
The dominant malaria vectors in Dar es Salaam are Anopheles gambiae s.s. and An. funestus [24], and they breed mostly in aquatic habitats found in drains, ditches, construction pits, other human-made roles, and habitats associated with urban agriculture [23]. Plasmodium falciparum is responsible for more than 90% of malaria infections, and transmission is perennial [25] with seasonal variations related to the rainy seasons.
Dar es Salaam Urban Malaria Control Program (UMCP). The methodology here proposed was applied to data from the UMCP, a large-scale larviciding intervention conducted in urban Dar es Salaam [8]. Reasons to choose the UMCP were three-fold. First, the availability of longitudinal data, specifically time series of two-wave panel data sets; six survey rounds were collected over a period of 4 years and 7 months (May-2004 to Dec-2008). Second, the intervention was shown to be successful [5,24,26], with the prevalence of malaria infection decreasing from 20.8% (95% CI: 16.8-24.9%) in the first survey round to 1.7% (95% CI: 1.4-2.1%) in the last [5]. Thus, the data describe a scenario of declining transmission to very low levels, similar to what is currently observed in countries/regions reaching malaria pre-elimination stages. Third, the calculation of incidence rates was a necessary input for the cost-effectiveness analysis of interventions [9]. The UMCP started in 2004 targeting 15 wards (Fig 1), five in each of the three municipalities, covering approximately 56 km 2 and more than 610,000 habitants [8]. After a period of baseline data collection and local capacity building, the program introduced the use of larvicides in three phases: in March 2006 it was introduced in 3 wards; in April 2007 it was expanded to 6 additional wards; and in April 2008 all 15 wards were being treated with larvicides. The time period of the survey rounds did not exactly match the launching of the intervention, so wards were exposed to larval control for different periods of time (Fig 2).
The UMCP wards revealed some variability regarding behavioral, socioeconomic, and geographical characteristics. While individual knowledge about malaria transmission was similar across wards, an index of household assets and, especially, housing conditions (quality of walls, screening, and source of water) showed important differences. For example, 47% of houses interviewed in Ilala had access to a good source of water, while in Temeke this number was only 19%. Use of a mosquito net, however, was relatively high, but declined significantly following the successful implementation of larval control [27]. The organization of the urban space was very heterogeneous, combining patches of planned areas (both higher and lower housing quality), squatter dwellings, urban agriculture, open spaces, industry, commerce, and areas of recent expansion (where foundations of houses, usually unfinished, often accumulate rainwater and offer conditions for mosquito breeding).
Data structure. Data available for estimating incidence rates consist of a time series of 2-wave panels covering six survey rounds collected over a period of 4 years and 7 months (May-2004 to Dec-2008), in the 15 UMCP wards [5,8,24]. Each survey round had five stages, and in each stage interviews were conducted in 3 out of the 15 UMCP wards (one in each municipality)-the order of the municipalities interviewed in each stage remained the same throughout the study (Fig 2). Therefore, five 2-wave panel data sets are available including baseline and intervention phases. A total of 5,223 participants were followed up twice, 2,349 three times, 1,236 four times, 472 five times, and 99 individuals participated in every survey round. Intervals between consecutive rounds of data collection varied from 6-12 months. Thus, repeated observations on the same individual can be separated by one or more unobserved changes in infection status (interval-censored data).
Individuals in the survey were tested for a malaria infection, upon consent. Finger-pricked blood samples were analyzed using Giemsa-stained thick smear microscopy. Quality check was conducted on a 10% sample of blood slides at the Muhimbili University of Health and Allied Sciences-MUHAS (a center of excellence in laboratory analysis), indicating a 94.5% specificity rate and 95.7% sensitivity rate [22]. All individuals found to be infected with malaria were treated with appropriate front-line regimens-sulphadoxine-pyrimethamine until August 2006, after which it was replaced with artesunate-amodiaquine.
Rationale for model constraints. We assume that there are two types of individuals governed, respectively, by the Markov chains with transition probabilities U and V. When p 2 (0,1) 2 = interior of the unit square, U represents people who do not follow anti-malarial drug regimens if they become infected, who do not sleep under bed nets, and whose houses lack adequate screening and tend to be in close proximity to larval habitats. Once infected, they remain so-also experiencing some superinfection-for sustained periods of time. This is reflected in the a priori assumptions that u 1 0.2 and u 2 = 1.
The V process represents individuals who go for a diagnosis at the first sign of possible malaria symptoms and follow a prescribed drug treatment regimen if they are found to be infected. The quantification of this behavior is to have v 2 small (and certainly < 0.5), which implies a high frequency, 1 − v 2 , of infected to uninfected transitions between surveys. Since all infected individuals are treated, a large value of 1 − v 2 implies that the treatment regimen is usually followed. Further, reinfection rates will be low (corresponding to high v 1 , and hence Finally, we assume that the U-process people are a minority in a locality where a control program is operative and local knowledge about malaria tends to be high. Thus, we assume, as a minimal condition, that s 1 < 0.5. When p = (p 1 ,0), we assume that p 2 = 0 is a structural zero. Thus, anyone found infected is transferred to the uninfected state via a regimen of anti-malarial drugs. Then U takes on the special form u = (1,0), and q ðUÞ 1 ¼ 0, q ðUÞ 2 ¼ þ1. This means that there are no transitions from uninfected to infected, and immediate transition from infected to uninfected. The V population operates under the same assumptions discussed above.
Estimated incidence rates in the context of the UMCP (based on microscopy). Estimates of incidence rates (cases per week) based on the UMCP data, detailed by pairs of consecutive survey rounds (Fig 3 and Table 6) and by ward (Fig 4 and Table 7) were obtained utilizing the methodology here proposed. Estimated rates ranged from 0.1202 cases per week in R12 (0.1537 in Azimio ward)-baseline period when no larviciding had been launched yetto 0.0010 cases per week in R56 (a third of the wards had a rate of zero)-the intervention achieved full coverage in R6. An overall declining trend was observed, even before larviciding was introduced in three wards in March 2006 [5]. As the coverage of the intervention expanded, fluctuation in the rates was reduced, and rather stable rates were observed during R56. Just for illustration, we also plotted rainfall information lagged by one month [5] in Fig 3. Although seasonality in malaria transmission following the rains is often observed, rainfall patterns per se cannot fully explain fluctuations in incidence and prevalence rates, since many other factors can potentially overcome or augment the impact of rains [5].
Disaggregating by individual wards (Fig 4) raises an important issue: the lack of a perfect match between the time and duration of each stage/survey round, and the time when each phase of the intervention was launched. Phase 1 started in March 2006 (see Fig 2), slightly after the onset of the main rainy season, and at the end of data collection in round 3 (R3). Considering that there is a biological time lag between reducing larval survival and reducing malaria transmission from human to human, and the initial programmatic challenges of launching the intervention [28,29] (augmented by the heavy rains), we expect that incidence rates estimated for R34 (in the three wards targeted with the intervention) should be the first to potentially reflect changes due to the larval control. However, each one of these three wards (first line of three graphs in Fig 4) was interviewed at a different stage, and thus exposed to the intervention for different periods of time. Between March 2006 and the beginning of interviews there were about 180, 240, and 300 days in Buguruni, Mikocheni, and Kurasini wards, respectively. Except from Buguruni, incidence rates stabilized at lower levels after R34. One peculiarity of Buguruni is the intense migration from northern areas of the country. Phase 2 started in May 2007, a month after interviews for R5 had started, expanding larviciding application to six additional wards (second and third lines of graphs in Fig 4). Here, between May 2007 and the beginning of the interviews there were about 30 days for Mzimuni and Keko, 60 for Miburani, and 90 for Ilala. In the case of Vingunguti and Mwananyamala, however, R5 interviews started in April 2007, before the introduction of larviciding. Thus, changes in incidence rates observed in these wards for the period R45 cannot be associated with larval control.
Phase 3 started in April 2008, a month after interviews for R6 had started, expanding larviciding application to the remaining six wards (fourth and fifth lines of graphs in Fig 4). The number of days between the onset of phase 3 and R6 interviews was about 180 for Magomeni and Kipawa, 150 for Ndugumbi and Mtoni, and 60 for Mchikichini. Azimio, however, had R6 data collection between March-May 2008, and thus data for this ward were not suitable to capture any potential contribution of the larval control intervention.
A crude analysis of the estimated incidence rates by wards is also difficult due to the large diversity in social, economic, and environmental conditions. In addition, changes in the urban landscape are dynamic, reducing or augmenting the risk of malaria transmission. As a result, wards interviewed at the same period (thus subjected to same rainfall pattern), targeted by the same intervention phase, may show distinct trends in incidence rates. For example, Ndugumbi and Mtoni were both targeted during the 3 rd phase of larviciding, and interviewed during stage IV of each survey round (see Fig 2), but while Ndugumbi stabilized at very low incidence rates even before larval control, Mtoni did show a steep decline in rates after the introduction of larviciding (see Fig 4). Also, Mwananyamala and Vingunguti were targeted during the 2 nd phase of the larval control intervention, and interviews conducted during stage I of each survey round. Estimated incidence rates for both wards were among the highest for R12, reducing significantly in R23. However, while Mwananyamala maintained low rates (even before the introduction of larviciding), Vingunguti observed an increase in incidence rates in R34 and R45. Since interviews in Stage I started before the launching of phase 2 of the intervention, any  Variability intervals consider the uncertainty due to sampling and partial identifiability of the mixture model. Prevalence rates were calculated using two distinct sets of data. First, the longitudinal data also utilized for the calculation of incidence rates (Prevalence-Follow-Up). Second, crosssectional data collected at the same time and wards of the longitudinal data, but in different tencell units (Prevalence -Cross-Section). Each column of graphs shows wards from one municipality: the first column has wards from Kinondoni, the second column those from Temeke, and the third column shows wards from Ilala. Wards in the first line of the Figure were included in the first phase of the intervention; second and third line of graphs include wards targeted during the second phase of the intervention; and the last two lines show the wards included in the third and last phase of the intervention. The line indicating the onset of the larval control was placed on the pair of survey round at which any impact of each phase of the larval control could be observed. The Roman number after the name of the ward indicates the stage when interviews were conducted in each survey round (refer to Fig 2). For ease of visualization, estimates and variability intervals for Azimio and Vingunguti in R12 were truncated, as well as the variability interval for Mwananyamala in R12.
doi:10.1371/journal.pcbi.1005065.g004 impact would only be detected in R56, and a decline was indeed observed in Vingunguti by then. Distinct patterns were also observed for Magomeni and Kipawa-both interviewed in stage V of each survey round and targeted in phase 2 of the intervention. In such a complex scenario, even the consideration of a large number of variables to define control wards (to be compared with intervention wards) is a near to impossible task. Important to note, however, are the large uncertainty intervals observed in some wards (Fig 4), mainly due to sampling variability (as detailed next). Sensitivity analysis. Sensitivity analysis of the incidence rate estimates by survey round (Table 6) and by ward (Table 7) was assessed through variation intervals that reflect uncertainty from non-identifiability of the mixture model, and variation intervals that combine both non-identifiability and sampling variation. Results indicate that variation from non-identifiability of the mixture model here proposed was rather small, as we would expect considering the rigorous context-based constraint conditions that we assumed. Thus, the bulk of the uncertainty in r 1 could be attributed to sampling variability. Considering the level of detail used for the estimation of incidence rates by ward (15 wards Ã 5 pairs of consecutive survey rounds = 75 estimates), the fact that individuals who tested positive for a malaria infection were treated with anti-malarials, and the decline in transmission observed in the UMCP (even before the larviciding intervention was launched), implied that the small counts for n 21 , n 22 , and n 12 ( Table 1) that were observed in some pairs of survey rounds were not unexpected.
Thus, sampling variation becomes rather important, especially for p 2 . The magnitude of uncertainty was smaller for the estimates by stage of the survey round since data were more aggregated (5 stages Ã 5 pairs of consecutive survey rounds = 25 estimates), and thus larger counts for n 21 , n 22 , and n 12 ( Table 2) obtained.
The largest variation intervals in estimated rates were observed in three wards in R12 that recorded particularly high counts for n 12 and thus lower values for p 1 ; Azimio had p 1 = 0.368 (and thus p 2 > p 1 and v 1 − v 2 is small), Vingunguti had p 1 = 0.542 (and v 1 − v 2 is small), and Mwananyamala had p 1 = 0.603 (and v 1 − v 2 is small).
Comparison of incidence and prevalence rates. We compared estimated incidence rates with prevalence rates calculated directly from survey data. Since the UMCP collected both longitudinal and cross-sectional data [5], we were able to calculate two sets of prevalence rates: one utilizing the follow-up data (the same data used to calculate incidence rates), and another utilizing the cross-sectional data collected at the same time of the follow-up data, but in different ten-cell units. Fig 5 shows that, overall, prevalence rates followed the same pattern of the incidence rates, becoming much lower and more stable after phase 3 of the intervention. The same was observed when analyzing the rates disaggregated by wards (Fig 4). The two prevalence rates were similar in magnitude, with only three exceptions: Azimio, Vingunguti, and Mwananyamala, at the time when incidence rates were also extremely high.
It was reassuring that prevalence rates calculated from follow-up and cross-sectional data had similar values (with a few exceptions when the intensity of transmission was high). Despite possible intra-ward variability in malaria transmission, the ten-cell units sampled for follow-up and during cross-sectional surveys seem to be representative of the transmission pattern in the ward, which is reflected in similar prevalence rates for both data types.
Estimated incidence rates based on PCR adjustments. The same pattern seen for estimated incidence rates based on microscopy data was observed for PCR-adjusted incidence rates (S1 Fig). Prior to the final survey rounds (R56) of data collection, the estimated PCRadjusted rates were slightly higher than those estimated based on microscopy data. However, as larviciding was scaled-up, estimated PCR-adjusted incidence rates near the end of the program did not differ qualitatively from what was calculated based on microscopy. Nevertheless if microscopy were to indicate an incidence rate of zero-consistent with elimination-the PCR rates being positive but small would not support such a claim.

Discussion
The methodology we introduced for estimating incidence rates from 2-wave panel data satisfying tracePðDÞ < 1 utilized 2-component mixtures of continuous-time Markov chains as a class of models to represent such data. This was accompanied by an unavoidable imposition of inequality constraints to facilitate parameter identification or, in some instances, partial identification. The application of this strategy for estimation of incidence rates in an urban malaria control initiative in Dar es Salaam represented a practical instance where such tools played a decisive role in quantifying reductions in malaria acquisition rates in a low endemicity setting. With this general methodology and particular example in hand, several further issues regarding incidence rate estimation require clarification.

a) Alternative specification of waiting time distributions
The continuous-time Markov chains all have exponentially distributed waiting time distributions in each state, which implies that their hazard rates are constant. Two-wave panel data, assumed to be generated by some 2-state continuous-time stochastic process, is not sufficiently rich to provide a basis for testing this hypothesis. However, several basic facts about malaria in diverse ecosystems make this assumption untenable. For example, in the Garki study [3], persons who survive repeated episodes of malaria in infancy and childhood have antibody titers that ensure an increasing hazard rate in the infected state-i.e. the longer an individual has detectable parasites, the more likely he/she is to clear parasites free of any intervention, and return to the uninfected state. For uninfected individuals at the end of a dry season, the hazard rate for onset of a new infection is also increasing, corresponding to the propensity for rain and, thereby, standing water. One of many possible parameterizations of these qualitative ideas is given by the waiting time distributions F (i) (t), t > 0, where i = 1,2 designate the states Estimated incidence rates, variability intervals, and prevalence of malaria infection by stages of the survey rounds. Variability intervals of the incidence rates consider the uncertainty due to sampling and partial identifiability of the mixture model. Prevalence rates were calculated using two distinct sets of data. First, the longitudinal data also utilized for the calculation of incidence rates (Prevalence-Follow-Up). Second, cross-sectional data collected at the same time and wards of the longitudinal data, but in different tencell units (Prevalence-Cross-Section). Each survey round of the UMCP conducted interviews in three wards at a time, or stages. Thus, each round had five stages in order to cover all 15 wards. uninfected (i = 1) and infected (i = 2), and having probability density functions and hazard rates h i ðtÞ ¼ f i ðtÞ 1ÀF ðiÞ ðtÞ ; i ¼ 1; 2: Here h i (t) is: increasing if α i > 1, constant if α i = 1, and decreasing if α i < 1.
The family of Gamma densities eq (7) are the basis for obtaining an expression for P(Δ) within the class of semi-Markov models by solving the backward integral equation system [30].
where δ ij = 1if i = j, δ ij = 0 if i 6 ¼ j, and 1 i, j r with M = km ik k an r x r stochastic matrix having m ii = 0 for 1 i, j r. Specification eq (8) holds for general r-state process and waiting time densities f i (t), 1 i r. For our purposes, r = 2, m 12 = m 21 = 1, and we focus on the 2-parameter family eq (7). Here we set t = Δ and identify p ij (0,Δ) with the (i,j) entry in P(Δ).
The equation system eq (8) is amenable to the following interpretation: (i) when i 6 ¼ j, p ij (0,t) consists of the sum of products of three factors: the probability of a first departure from state i at time s, the probability of a transition from state i to state k at that instant, and the probability of transferring to state j by some combination of moves in the interval t−s The summation is over all intermediate states k and all time divisions s in (0, t); (ii) when i = j, in addition to the preceding term, there is the probability of not transferring out of state i during (0, t). This is given by the first term.
With empirical data, solving the system of equationsp i;j ðDÞ ¼ p ij ð0; DÞ; i ¼ 1; 2 for parameter values as in the Gamma specification above, requires a priori context-dependent constraints on the parameters-to secure identification or partial identification-and numerical inversion calculations in eq (8). Going back to the empirical analyses in the Garki baseline surveys [3], the disconcerting issue that now arises is that the entire set of 2-wave panel data sets shown in Singer & Cohen [7], with incidence and recovery rates computed within the class of continuous time Markov chain models, could just as well have been used to estimate incidence and recovery rates within the class of 2-state semi-Markov models with Gamma distributed waiting times. The same can, of course, be said for the rates computed in the prior section for Dar es Salaam now using, in some of the trace P < 1 cases, 2-component mixtures of semi-Markov models with analogous inequality constraints facilitating identification or partial identification of parameters. b) In low-endemicity settings, do we need to take formal account of the possibility of many (i.e., 3 or more) unobserved transitions?
If we use a crude incidence rate given by r 1 ðcrudeÞ ¼p 12 ðDÞ D , this essentially assumes that there is at most one unobserved transition in the inter-survey interval, Δ. For the Garki surveys where Δ = 10 weeks, we find, not surprisingly, that r 1 (crude) <r 1 (Markov) in the baseline surveys. In the lower endemicity setting of Dar es Salaam, the same inequality holds when traceP > 1, but now Δ % 40 weeks. If Δ had been approximately 10 weeks for Dar es Salaam, we would anticipate very little difference between r 1 (crude) and r 1 (Markov) .
We also find that r 1 (crude) < r 1 (Mixture) when traceP < 1 in Dar es Salaam, but this is decidedly influenced by the long inter-survey intervals. As two examples, consider the ward Buguruni, for survey intervals R23 (with Δ = 42.2 weeks) and R56 (with Δ = 49.2 weeks). For R23, r 1 (crude) = 0.0031 and r 1 (Mixture) = 0.0081. For R56, r 1 (crude) = 0.0007 and r 1 (Mixture) = 0.0039. Thus, during the last survey interval, R56, when the larvicide intervention was operating effectively, we still have r 1 (crude) and r 1 (Mixture) differing by a factor of 5.6. Taking formal account of the possibility of multiple unobserved transitions clearly makes a difference, and is preferable to the crude incidence rate, generally. For the 2-wave panel data in Dar es Salaam, we have: where N = n 11 +n 12 +n 21 +n 22 . Using eq (9) we compare Inc with r 1 (Mixture) for two wards, Buguruni and Kurasini, early in the larviciding program, R23, and at the end of it, R56 ( Here there is no apparent transmission between survey rounds 5 and 6. The infected individual at survey is, in accordance with the study protocol, treated with anti-malarial drugs. The general inequality Inc < r 1 (Mixture) is basically a consequence of the fact that unobserved transitions are not taken into account in Inc.
In moderate to high transmission areas, we anticipate that Inc will be substantially downward biased as a result of lack of formal consideration of unobserved transitions. d) Estimating incidence from single and multiple prevalence surveys Different methods have been proposed to generate incidence (as defined in this paper, or the rate of occurrence of an event) from prevalence data [31][32][33][34]. As more applications of the methodology here introduced are made, analysis that would produce both incidence and prevalence rates could provide a unique opportunity to evaluate the best approach to obtain incidence rate estimates from prevalence rates-currently, it is unclear what is the best strategy. Such an exercise could lead to clear recommendations that would have both a wide applicability in malaria endemic countries and a crucial importance for National Malaria Control Programs (e.g., planning and evaluation of the cost-effectiveness of interventions).
In conclusion, this paper introduced new methodology for estimating incidence rates from 2-wave panel data with interval censoring, which is applicable in the many cases where the extant Markovian models are inapplicable. The methodology is suitable to settings with any malaria transmission level, given the availability of longitudinal data. In addition, we present a strategy for quantifying the uncertainty in estimation of incidence rates. It is hereby distributed with a well-documented programming code that allows the use of the method in R software.
Supporting Information S1 Code. R-code to estimate malaria incidence rates from interval-censored longitudinal data, and to calculate variability intervals for the estimated rates. This file contains a commented version of the code written in R. The.R file is also available for download (S2 Code). (DOCX) S2 Code. R-code to estimate malaria incidence rates from interval-censored longitudinal data, and to calculate variability intervals for the estimated rates. This file contains the.R file (which can be directly read and used in R). (R) S1 Fig. Estimated incidence rates and PCR-adjusted incidence rates, and partial identifiability intervals by wards and consecutive pairs of survey rounds. Each column of graphs shows wards from one municipality: the first column has wards from Kinondoni, the second column those from Temeke, and the third column shows wards from Ilala. Wards in the first line of the Figure were included in the first phase of the intervention; second and third line of graphs include wards targeted during the second phase of the intervention; and the last two lines show the wards included in the third and last phase of the intervention. The line indicating the onset of the larval control was placed on the pair of survey round at which any impact of each phase of the larval control could be observed. The Roman number after the name of the ward indicates the stage when interviews were conducted in each survey round. Assumptions for the calculation of PCR-based rates were extracted from Okell et al. [36]. (TIFF)