An algorithm for data reconstruction from published articles – Application on insect life tables

: Data collection in life table experiments is generally time-consuming and costly such that data reconstruction of published information provides an avenue to access the original data for purposes of further investigation. In this paper, we present an algorithm that reconstructs life table raw data using a summary of results from published articles. We present the steps of the development and implementation (in the R computer language) of the algorithm, its scope of application, assumptions, and limitations. Statistical background of the algorithm is also presented. The developed algorithm was then applied to reconstruction of life table data of two insect species, Chilo partellus and Busseola fusca, from published information. Welch ’ s two-sample t -test was applied to test the difference between the original and reconstructed data of the insect life stages. C. Partellus results were not significantly different, but, for B. fusca , pupa development time, and larva and pupa development rate were significantly different at the 95% confidence level. It is concluded that the algorithm could be used to reconstruct original data sets from cohort life table data sets of insects, given published information and sample sizes.


Introduction
Insect life tables are a convenient method for summarizing the amount of mortality in each generation of an insect population. Generally, life table data provide a detailed description of the D.N. Kareithi ABOUT THE AUTHOR Dorcas Kareithi is a trained biometrician who has been handing data in research centers and organizations for over 5 years. She is an expert in data management and analysis, and in using computational tools to improve data collection and analysis. Her current interests and works include exploring methods to use publicly available data for biological and medical research. These methods include data mining and machine learning methods and how such methods can contribute to effective and accurate research in science.

PUBLIC INTEREST STATEMENT
Data collection and analysis in all scientific studies are very time-consuming and expensive. In previous cases, if a researcher wanted to run more tests or find out more information about a species, they would have to set up an experiment to enable them to do the calculation. This is done even for species whose information has been publicly published. Instead of setting up expensive experiments to run basic calculations, we provide a way for researchers to reconstruct original life tables from published information. In this paper, we test the algorithm and compare the results for two species.
survivorship, development, and expectation of life and gives the researcher an opportunity to assess and evaluate the impact of various factors on the population (Carey, 2001). In the study of populations, life tables are usually used to highlight the various growth parameters in the life history of the species. The construction of a life table is an important component in understanding the population dynamics (demography and biology) of a species as explained by Deevey (1947). Life table data, unlike other data collected for research, is collected in stages, starting from birth, then continuing with frequent observations on the demographical processes of the population of the species under study. These demographic processes include births, life stages, deaths, emigration, evolving, and any other process that affects the sample size and composition of the population. The timing and frequency of these processes are the interests in a life table, adopting the time of these events as indicators of interest. Life table data collected over a specific time period is called a time-specific life table (Carey, 2001). A longitudinal perspective and follow-up of a generation of a population from birth through the consecutive ages till death of all individuals in the generation is known as a cohort life table (Carey, 2001). Both categories can either be complete or be abridged, where complete means the functions are computed for each day of life while an abridged life table deals with age intervals greater than 1 day, such as over a complete stage (Bellows, Van Driesche, & Elkinton, 1992;Deevey, 1947). The distinction between complete and abridged has solely to do with the length of the time period considered.
Once a population of interest is chosen and an observation time period is defined, the construction of a life table generally involves the following five phases (Deevey, 1947;Dublin, Lotka, & Spiegelman, 1950): select a sample size and depending on a factor the species is exposed to, observe and record the current stage of the population, observe and record the number of individuals alive at each stage, calculate the number dead at each stage, and observe and record the duration of each stage.
Preparation of life tables, therefore, requires considerable resources and time as several subjects are followed over time. The availability of data obtained from research activities is a challenge that continues to persist. Data reconstruction techniques provide an opportunity to obtain information from published studies. Existing studies on data reconstruction have focused largely on geographical information system, such as reconstruction of maps, image reconstruction (Ediev & Gisser, 2007), genotype data (Stephens et al., 2001;Stephens & Donnelly, 2003), and dose response for patients (Kahneman, Krueger, Schkade, Schwarz, & Stone, 2004). Methodological difficulties have been pointed out as one of the challenges in reconstructing data from multiple published articles (Kita, 1993). Sometimes source articles from which the data is to be reconstructed may not have complete information. Another challenge to data reconstruction is that published articles report data in intervals, for example, 6-month intervals or an age range, a mean age, and breaking it down or reestimating the values to fit in time step of a day interval require much time and skill (Kita, 1993). Another difficulty in using published or reported data is the problem of dealing with gross errors, misreporting, and coverage errors (Ediev & Gisser, 2007). Further, the inability to confirm the findings by comparing results from the reconstruction to real collected information poses a challenge.
In pest management, life tables provide an important tool in understanding the changes in the population of insect pests during different developmental stages throughout their life stages. Life tables may reveal when a pest population suffers high mortality. Such knowledge can be used to make time-based application of intervention for the management of the insect pest. Data collection in life table studies is generally time-consuming and costly such that if such data are available in publications and one would wish to carry out further research on them, reconstruction of the original data would be an advantage and economical as compared to repeating the data collection. This paper, therefore, presents an algorithm that has been developed for reconstruction of published lifetable data that can be used by researchers in pest management science.
We describe how the algorithm can be applied in reconstructing life tables on published mortality rates, development rate, and total oviposition per female of an insect species. We assess the effectiveness and accuracy of the algorithm by replicating the original analyses using the reconstructed data. The results are discussed in light of opportunities, limitations, and implications of the use of the algorithm in life table data.

Life stages of invertebrates
Life tables of vertebrates such as fishes, amphibians, reptiles, birds, and mammals have no wellstructured or defined life stages, while invertebrates such as arthropods and worms have specific and defined life stages from inception till death. Some of the most common stages are discussed in the following section.

Egg
The egg stage begins with the females of the population under study laying eggs after fertilization. For insects and other animals, the eggs can be laid either internally or externally (eggs laid on leaves, in water, or on the ground, for example). Usually, the development and survival of eggs depend highly on local temperature, oxygen, and water (Potter, Davidowitz, & Woods, 2009;Woods, 2010). During this stage, the life table characteristics that can be observed include mortality, development rate, development time, stage duration, and the number of individuals that transit to the next stage, usually the larva stage (Diamond & Kingsolver, 2009).

Larva
The eggs hatch into larvae, which start feeding, and some either become locomotive or remain in a dormant state (Gordh & Headrick, 2001). This is the stage where an insect grows most in size. Some insects have other substages within the larval stage called instars, whereas other species go straight to pupa. Just as with eggs, the life table characteristics observed for larva include mortality, development rate, development time, stage duration, and the number of individuals that transit to the next stage, the pupa stage.

Pupa
When the larvae develop, they move to the pupa stage. In this stage, the insects are known to rest, form their wings or other internal organs, and develop to form adults (Gordh & Headrick, 2001). Life table characteristics observed for pupa include mortality, development rate, development time, stage duration, and the number of individuals that transition to adult.

Adult
Adults constitute the last stage of development. All living organisms have this stage, and it is the end of a generation and also the beginning of another generation through the eggs laid by the female of the species. The life table characteristics observed for adults include mortality, senescence rate, and the life span.
The following assumptions are made on the data collected at each insect stage: (1) a constant daily survival rate within stages, (2) the duration of a life stage is considered to be identical for all individuals that move to the subsequent stages, (3) the unit of observation or time step is a day, and (4) all model parameters are presented in the published article for adequate application data reconstruction.

Algorithm for life table data reconstruction
The life table algorithm for insect population reconstruction presented here adopts an improvised approach to estimate the number of individuals entering a stage during development. The number of eggs at the beginning of the study and the results presented in the published information are key elements for successful reconstruction, as these variables are used as starting values of the algorithm. The approach also requires knowledge of the duration of all stages of the life cycle of the insect. Using the model parameters, the sample size, and the approximation of the life stage duration, we computed the total number of insects expected at the end of each stage of the experiment, which are then distributed using a uniform distribution according to the number of days. The uniform distribution approach was adopted, as it contains values that are between two limits α and β, which, in the developed methodology, refer to the beginning of the stage and the last day of the stage. To assess the performance and accuracy of the algorithm, values analyzed from the reconstructed data were compared to the originally published information (Table 2).
To reconstruct data based on published scientific articles, books, or papers, the procedure is as follows: (1) Identify the information published, ensure that the pre-requisites of reconstruction are obtained.
1. Identify the sample size, n, the starting value. This can be obtained from the 'Materials and methods' section. 2. Identify models used and their estimated parameters. This can be obtained from the 'Materials and methods' section. The model parameters and stage duration are obtained from the 'Results' section. If parameters are not published, re-estimate the parameters by solving a system of simultaneous equations. 3. Recognize and categorize assumptions used by the author and the level of factors used in the study (e.g. temperature). This is to make sure that the assumptions of the original study fit the assumptions of the presented methodology. This can be obtained from the materials and methods/methodology section. 4. Identify the stage duration. This will be obtained from the 'Results' section.
(2) Identify the initial day of the egg hatch. This can be obtained from books on the species of interest.
Let T 50 be the median number of days.
(3) Enter the identified information in the algorithm, i.e. temperature ranges, the median number of days, T 50 , and the model parameters based on model reported. (4) Write the inverse of the model identified.
The reconstruction algorithm was implemented in R programming language version 3.2.1 (Team, R.C., 2014) with detailed steps, to make it easier for the end user to follow and apply.

Application of the developed algorithm on insect life table reconstruction
To illustrate the steps, the algorithm presented was applied to two populations of insects to reconstruct the life tables and the results obtained from the reconstructed life tables were compared with those published. The following sections describe in detail the application of the algorithm.

Data description
Species used for reconstruction are the noctuid lepidopteran stem borer Busseola fusca (Khadioli et al., 2014b) and the Swinhole Chilo partellus Lepidoptera, Crambidae (Khadioli et al., 2014a) (Table 1). These data were selected because they represent the varied scenarios for the application of the developed algorithm and further satisfy the data assumptions of information published. Reports of studies of the parameters produced when fitting the various models were input directly to the algorithm. For those whose parameters were not provided but models used were given, we estimated the parameters using the method of solving simultaneous equations as recommended by (Broyden, 1965;Haavelmo, 1943;Zellner & Theil, 1962). This is possible given the number of parameters to be estimated and values along the line of best fit.

Reconstruction
For both species, the models used in analysis and consequently reconstruction are development time: Logit, development rate: Sharpe DeMichelle, mortality: second-order exponential polynomial, total oviposition: polynomial regression. For development time, the probability that an insect whose stage lasted until the next stage change given the i-th temperature at a fixed day is used to estimate the median development time. On the development rate, the Sharpe de Michele model (Sharpe, Curry, DeMichele, & Cole, 1977) was used: In Equation (1) the dependent variable r = Development rate (1/day). The independent variable, T = temperature in°C, is considered as a continuous variable in this case. T 0 is a constant temperature, usually taken to be 25; T L is the temperature at which the rate-controlling enzyme is half active and half low-temperature inactive, H A is the enthalpy of activation of the reaction catalyzed by a ratecontrolling enzyme, ΔH L is the change in enthalpy associated with low-temperature inactivation of the enzyme, T H is the temperature at which the rate-controlling enzyme is half active and half high temperature inactive, ΔH H is the change in the enthalpy associated with high-temperature inactivation of the enzyme, R is the universal gas constant, and ρ is the developmental rate assuming no enzyme activation (Schoolfield, Sharpe, & Magnuson, 1981;Sharpe et al., 1977;Wagner, Wu, Sharpe, Schoolfield, & Coulson, 1984).
The algorithm as applied to re-estimate development rate and development time was as follows: (i) Identify the information published The methodology involves simplifying Equation (1) to where RT i is the development rate at the i-th temperature; Y, ρ, and V are parameters to be estimated; and T max is the maximum temperature for development.
Let DT i be the median development time at i-th temperature, then For example, C. partellus (Larva): Y = 0.03, ρ ¼ 0:17, T max ¼ 37:58, and V = 5.51.  (3) of Section 3.2.1 yields Substituting values of Table 1 in Equation (4) yields reconstructed RT i values as shown in Table 3. The respective development rate results for each temperature and species are as shown on Table B1.

Cohort life table
To reconstruct the cohort life table, the mortality rate was used. If p is the observed mortality rate and q the survival rate, the species mortality rate therefore becomes and survival becomes Mortality for Busseola fusca (Khadioli et al., 2014b) and the Swinhole Chilo partellus (Lepidoptera, Crambidae) (Khadioli et al., 2014a) was estimated from a second-order exponential polynomial, which is the simplified Gompertz-Makeham model (Gompertz, 1825).
where MT i is the mortality rate and b 1 ; b 2 ; and b 3 are parameters to be estimated (Gompertz, 1825). The restriction b 1 þ b 2 T i þ b 3 T i 2 0 means that b 1 ; b 2 ; and b 3 are constrained.
From Equation (6) in Section 3.2.2, Equation (7) can be used to estimate the survival rate, ST i ; as Let n i be the sample size for each temperature T i , DT i be the median development days for each temperature T i ; IN i 1 be the day the first egg hatches, for each temperature T i ; DS i be the number that survived daily, and Total D i be the total number of insects that developed for any specific stage. According to Richards et al. (1960) method, estimation of the total number that survived daily is essential for any life table construction or reconstruction. They state that this number is bound between two values [a,b], where a is the total number surviving in the last stage and b the number that will have survived by the end of the stage Richards et al. (1960). This is a uniform distribution with limits [a,b].
Using the uniform distribution with limits a; b ½ , where a and b are the number of days the stage lasts, the total number of insects that survived the entire stage is where a is the initial number of days from the last stage and b is the last day of each stage.
Given the daily number that survive over the period, DS i , the number that survived per day is Example: for i = 25, parameters obtained in the previous example are b 1 ¼ 6:27; b 2 ¼ À0:59; and b 3 ¼ 0:01, T i = 25, n i = 165, DT i = 33, IN i = 10. Simulating 33 random numbers from the uniform distribution that lie between 0 and 165, with seed set to 879 2 the number that survives each of the 33 days ¼ 147 Equation (13) yields estimates of the number that survived each day as shown in Table 4. The respective mortality table results for each temperature and species are as shown on Table C1.
In cases where b 1 ; b 2 ; and b 3 are not provided, but specific MT i have been reported, b 1 ; b 2 ; and b 3 can be estimated using the system of solving simultaneous equations for three unknowns based on

Reproduction
The oviposition data are recorded as described above for life table data. The number of eggs oviposited should be retrieved for the cohort of females included in this experiment. The total oviposition represents the expected total number of eggs laid per an insect female during her whole life span and it is expressed as a function of temperature. This relationship is modeled with a nonlinear function just as in mortality.
The algorithm as applied to reconstruct the female oviposition file was Mathematically, the methodology applied total oviposition using the Quadratic model: where b 1 , b 2 , and b 3 are parameters to be estimated, FT i is the average number of eggs laid at temperature T i . Demographically, this is the gross reproduction rate (Deevey, 1947).
Let FD i be the total number of eggs laid by all females, Fn i is the number of females in the experiment for each temperature T i .
The total female development days FD i then becomes Using Microsoft Excel to distribute the total number of eggs laid FD i to the female numbers Fn i in the median female development days FDT i .
In cases where b 1 ; b 2 ; and b 3 are not provided, but specific MT i have been reported, b 1 ; b 2 ; and b 3 can be estimated using the system of solving simultaneous equations for three unknowns based on For C. partellus, b 1 ; b 2 ; and b 3 were not provided, but some FT 18 ; FT 25 and FT 35 were provided. Using the set of Equations (17) Therefore, for T i = 25, Fn i = 43, FDT i = 10 Distributing the 16138 eggs over 43 females in 10 days yields the oviposition Table 5. The respective oviposition results for each temperature and species are as shown on Table D1.

Accuracy tests
The reconstructed data were analyzed using ILCYM software (Tonnang et al., 2013), which was the same software used by original authors to model original insect behavior. The output from the reconstructed data was then compared with the published data. The Shapiro-Wilk normality test was conducted and the results showed that the p-value was greater than the alpha level of 0.05 and thus the null hypothesis that the data came from a normally distributed population could not be rejected. The two-sided Aspin-Welch-Satterthwaite two-sample t-test (Welch, 1937) was used to test whether there are any statistically significant differences between results obtained from the reconstructed data and the results published for development time, development rate, mortality, and total oviposition for all immature life stages ( Table 2). The test assumes normal distribution of the two populations being tested and unequal variances, testing the hypothesis that true difference in means between published and reconstructed estimates is equal to zero.

Results
Results from the two-sided two-sample t-test show that all of C. partellus results were not significantly different, but for B. fusca, pupa development time and larva and pupa development rate were significantly different at 95% confidence level (Table 6).

Discussion
The algorithm as described in Section 2 yielded accurate estimates as shown in Table 6. However, the algorithm assumed that (1) data sets analyzed in the published paper are from laboratory-reared cohorts and as such the data used for publication had little or no errors from external factors and are only affected by variables under study and (2) interval of measurement in the original data is 1 day. Information regarding this assumption is indicated in the 'Materials and methods' part of the published paper.
(3) If there is any censoring, it is interval-censored data and retrieves the interval limits from the previous row, (4) all eggs laid were fertile and that failure to hatch was due to natural mortality and not accident damage during handling, (5) mortality rates in all stages were equally applicable to males and females, and (6) individuals of a species are reared at a series of constant temperatures.
From our two examples, it is apparent that it is possible to re-estimate development rate and development time, to reconstruct a species' life table, and to reconstruct the female file based on any published information, without having to set up the whole experiment. The differences in B. fusca's data and corresponding reconstructed data could be attributed to either chance or inaccurate results reported. The algorithm that has been proposed in this paper is simple and flexible yet realistic. It has a few technical requirements. However, the use of the algorithm is limited to laboratory-reared species. This is because species in the natural environment are dynamic and factors that affect their development, survivorship, and ability to lay eggs are varied, whereas in the laboratory, the researcher can limit and control for these factors.
The data in many reports and published papers, although obtained and presented properly for their original purposes, need to meet the assumptions stated for the reconstruction to be successful. Many articles presented part or all data graphically. This becomes a challenge in the reconstruction process, unless the models are listed in the 'Materials and methods' section of the legacy information. This also makes it challenging to the algorithm when it is unable to re-estimate some of the parameters and results in computational difficulties (NAs), rendering the reconstruction of some values improbable. During initial versions of the algorithm, we attempted distributing the total number at the end of the experiment using the Fibonacci sequence (Falcon & Plaza, 2007;Falcón & Plaza, 2007;Horadam, 1961;Zhang, 1997), using the theory as in Er (1984) on sums of Fibonacci numbers by matrix methods. This methodology, however, yielded an output with values that were skewed to the left, implying a significant percentage of zero's in the first few days of the larva and pupa stages of both insects. This became problematic when the tool used for verification of the data produced, ILCYM, showed that there was a failure of convergence. This could be attributed to the fact that the tool used to test the results employs maximum likelihood method of estimation to estimate the parameters. As a result, we had to take another approach to the reconstruction of the life table data and in the end used the uniform distribution. The algorithm is therefore limited when key data such as sample sizes and total number of the species are not reported. This limitation made it impossible to reconstruct the life tables of the mealybug, Phenacoccus solenopsis Tinsley (Hemiptera: Pseudococcidae) (Fand, Tonnang, Kumar, Kamble, & Bal, 2014), noctuid lepidopteran stem borers, Sesamia calamistis, and the Potato Tuberworm, Phthorimaea operculella Zella (Sporleder, Kroschel, Quispe, & Lagnaoui, 2004). The algorithms we developed here are therefore recommended for use for any user who is limited in resources or time and needs to develop a comprehensive life table based on published data for purposes of study where temperature variations are concerned.
The algorithms discussed in this paper were specifically generated for the complete life table and made use of these parameters. This opens up further areas to be exploited, to find out if the same algorithms can be used for other types of life tables. This research also opened up a gap where future researchers can establish if the same and other algorithms developed can be used for natural populations and also if the same algorithms hold for other factors other than temperature. In the circumstance where there are missing parameters, constructing and solving the system of simultaneous equations guide the estimation.
A computer code written in R programming language is available as an R package (pending publication of the package) and from the author's GitHub account (https://github.com/DeeKareithi/ Insect-LifeTable-Reconstruction-ILTR) to carry out the calculations described in this paper.   Pub.

T i
Pub.
Recon. You are free to: Sharecopy and redistribute the material in any medium or format. Adaptremix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms: Attribution -You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions
You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.