A novel nonparametric mixture model for the detection pattern of COVID-19 on Diamond Princess cruise

The outbreak of COVID-19 on the Diamond Princess cruise ship has attracted much attention. Motivated by the PCR testing data on the Diamond Princess, we propose a novel cure mixture nonparametric model to investigate the detection pattern. It combines a logistic regression for the probability of susceptible subjects with a nonparametric distribution for the detection of infected individuals. Maximum likelihood estimators are proposed. The resulting estimators are shown to be consistent and asymptotically normal. Simulation studies demonstrate that the proposed approach is appropriate for practical use. Finally, we apply the proposed method to PCR testing data on the Diamond Princess to show its practical utility.


Introduction
The epidemic of the novel coronavirus disease  outbroke in December 2019 in Wuhan, China. Since its outbreak, the epidemic has progressed rapidly and has emerged in more than two hundred countries. It has become an unprecedented global epidemic crisis. The transmissibility patterns in open spaces like households, offices and public places are quite different from those in confined spaces such as aeroplanes, trains and cruise ships. Among the outbreaks of COVID-19 all over the world, one of the most well-known eruptions is the one on the Diamond Princess cruise ship. The high contagiousness of COVID-19 on the Diamond Princess cruise has attracted much attention (Mizumoto et al., 2020;Sekizuka et al., 2020;Zhang et al., 2020). Its speciality can be observed through Johns Hopkins' daily released confirmed cases over the world, where the infected number of cases on Diamond Princess ship is reported separately adjacent to that of Japan.
The Diamond Princess cruise ship started on January 20, 2020 in Yokohama, Japan, visited five places including Hong Kong and returned Yokohama on February 3 (Sekizuka et al., 2020). During this period, an 80-year-old passenger who disembarked on January 25 in Hong Kong, was confirmed for COVID-19 on February 1. After the disembarkation of Diamond Princess at Yokohama, Japanese government asked 3711 individuals, including 2666 passengers and 1045 crew members, to stay onboard to carry out a 14-day quarantine period from February 5 to February 19. The health status of all individuals on board was investigated, making daily time series of PCR testing data, including number of tests and number of patients testing positive each day, publicly available (Mizumoto et al., 2020). Table 1 reports the daily time series data with number of tested individuals and individuals with positive results.
The vessel with confined spaces offered a rare opportunity to understand features of the COVID-19 that are otherwise hard to investigate. This is different from studying the spread in a wider population, where only some people, typically with severe symptoms, are tested and monitored. Closed confines like cruise ship are an ideal place to study how COVID-19 behaves, since almost the whole population and the PCR testing result for everyone are known (Mallapaty, 2020). Testing almost all of the passengers and crews helps us to understand a key blind spot in COVID-19 outbreaks. The comprehensive key information on the Diamond Princess allows us to investigate the infection patterns, including infections with no symptoms. Outbreaks seed easily on cruise ships because of the close environments and high proportions of older people, who tend to be more vulnerable to the disease. Since the Diamond Princess, at least 25 other such vessels and aircraft carriers have confirmed a high number of COVID-19 cases (Mallapaty, 2020).
Hence, studying the extent of transmission of COVID-19 in encompassed spaces like Diamond Princess cruise is of great importance to understand the disease progression and to manage the epidemic. It has major implications Table 1. Number of tests and number of individuals testing positive for passengers and crews on the Diamond Princess cruise ship, Yokohama, Japan, February 2020 (n = 3711).

Reported
Number of  Number of individuals  The cumulative number of  date  tests  testing positive  individuals testing positive   Feb 5  31  10  10  Feb 6  71  10  20  Feb 7  171  41  61  Feb 8  6  3  64  Feb 9  57  6  70  Feb 10  103  65  135  Feb 12  53  39  174  Feb 13  221  44  218  Feb 15  217  67  285  Feb 16  289  70  355  Feb 17  504  99  for controlling and anticipating the trajectory and impact of the pandemic. Precise knowledge of the infection distribution is crucial for the prevention and control of these diseases. Correct understanding of the virus transmission pattern might give some guidelines when designing the passenger cabins and making the cruise travel more safe in the future. The available data on the Diamond Princess have unique features because at the beginning, the upper-respiratory specimens were collected from symptomatic individuals and their close contacts for PCR testing. Starting from February 11, due to the expansion of laboratory capacity, quarantine officers systematically collected respiratory specimens from all passengers by age group, starting with those aged 80 years and older as well as individuals with comorbidities, such as diabetes or a heart condition. This means that a non-random sampling was implemented in the selection for PCR test. In addition, the individual data are not observed. The only available data are the aggregated data, which provides weak information and brings difficulty in statistical inference.
Taking the feature of selection bias and the incomplete aggregated data into account, the main purpose of this paper is to propose a novel mixture model to fully characterize the data structure. We introduce a cure mixture model that combines the nonparametric distribution for the detection time with logistic regression modelling the cure fraction, where the detection time is defined as the time the infected individual begins to be detected by PCR test. The maximum likelihood approach is introduced to jointly estimate the probabilities in nonparametric infection distribution and parameters in logistic regression. The proposed model can also estimate the distribution of detection time and total numbers of infection that can be detected after 14 days of quarantine based on PCR test data performed on the Diamond Princess cruise.
The rest of this paper is organized as follows. Section 2 introduces the proposed cure mixture model and the maximum likelihood estimation approach. The large sample properties, including consistency and asymptotic normality, of the proposed estimator are given in Section 3. Finite sample performances of the proposed estimator are investigated via simulation studies in Section 4. In Section 5, we apply the proposed method to the Diamond Princess cruise ship PCR testing data to illustrate its practical utility. Finally, some remarks are concluded in Section 6. All the technical proofs are relegated to the Appendix.

Model
The COVID-19 data collected on the diamond princess cruise were very limited. All information was summarized in Table 1. The information only includes the number of tests and number of individuals testing positive each day during the quarantine. There were 3711 individuals, including 2666 passengers and 1045 crew members, on the cruise. However, according to Table 1, the total number of tests is 3063. Therefore, we assume each individual was only tested once and the sensitivity of the test was 100%. We will discuss the limitation in Section 6.
Suppose there are n * subjects (including passengers and crew members) on the Diamond Princess cruise ship who have experienced a quarantine period that lasted 14 days. Each day a number of subjects were chosen for PCR testing. Let x i and y i be the number of testing positive cases and number of tests at day i, respectively. This means, n = 14 i=1 y i (n n * ) subjects have PCR testing results, but n * − n individuals do not have.
Denote the detection time as the time the infected individual begins to be detected by PCR test. Let ξ ij be the detection time of the j-th subject who was tested at day i, j = 1, 2, . . . , y i , i = 1, 2, . . . , 14. Let G(x) be the cumulative distribution function of detection time ξ calculated from February 4. Therefore G i = G(i) is the probability of the detection time occurring before day i starting from February 4. That is, G i is the probability that an infected individual can be detected by PCR test at day i. For example, G 1 represents the probability of testing positive on February 5. According to the non-decreasing property of the distribution function, G i should satisfy the constraint G 1 G 2 · · · G 14 .
In the real data, instead of observing the exact detection time ξ ij , we observe the number of testing positive individuals x i , which is equal to If there is no selection bias, it is a standard current status data problem discussed extensively in the statistical literature, for example, (Sun, 2006). The nonparametric likelihood method can be used directly to estimate G i . The observed likelihood function is However, Figure 1 shows that the observed frequency and estimated probability on each day have a large discrepancy, especially in the first week. This demonstrates that the random selection process was violated. Next, we provide a novel mixture modelling strategy that fully utilizes the non-random sampling. Before the quarantine, people were unaware of the existence of virus on the Diamond Princess cruise ship. The cruise had shows and dance parties and opened public facilities that attracted large crowds, including fitness clubs, theatres, casinos, bars and buffet-style restaurants. During this period, all passengers and crew members were susceptible to the COVID-19. After the quarantine, people gradually realized the high contagiosity of the virus and began to take actions to avoid the infection. The anti-epidemic measures became stricter as time went on. Passengers with confirmed cases were reported to be taken ashore for treatment. Some individuals even left the cruise in advance. Therefore, it is reasonable to assume some individuals were insusceptible at this stage. Taking the above facts and the incubation period into account, we suppose the detection patterns of the first week and second week of the quarantine period are different. We divide subjects into susceptible and insusceptible individuals and suppose the two weeks have different compositions. All the people in the first week are susceptible, and the proportion of insusceptible individuals in the second week grows.
We assume a cure mixture model (Farewell, 1982) for the detection time ξ ij . Specifically, the mixture modelling of the cure rate assumes a decomposition of the detection time, where ξ * ij < ∞ denotes the detection time of a susceptible subject, and η ij indicates, by the value 1 or 0, whether the sampled subject is susceptible or not.
It is worthy notifying that the observed data of most infectious diseases, including COVID-19 on the Diamond Prince cruise ship, are aggregated data. The individual data are unavailable. Therefore, models on specific subjects are impossible to be identified. We have the aggregated data on each testing day. The detection results of each day should follow different patterns, because the anti-epidemic measures became stricter and the insusceptible proportions increased as time went on. Thus, we assume the susceptible proportion and the distribution function of detection time depend only on testing day i. Let Pr(η ij = 1) = λ i , the proportion of susceptible patients among tested subjects at day i. At each day i, we suppose that tested individuals are a mixture of a proportion of λ i susceptible individuals who eventually get infected and a proportion of 1 − λ i who are not susceptible to COVID-19 and will never get infected. Let F(x) be the distribution function of detection time of a susceptible subject, that is, For notation simplicity, we write F i . = F(i). F i is the probability that a susceptible infected individual can be detected by PCR test on day i. The proposed model is useful since a proportion of tested subjects will never be infected by COVID-19. This model is like survival models with cure rate, which have arisen in many disciplines (e.g. biomedical sciences, economics, sociology, engineering science, etc) and have received much attention (Lu & Ying, 2004;Wang et al., 2020).
Motivated by the priority in choosing symptomatic or high-risk groups, all chosen people in the first week were likely to be infected and detected, and the susceptible probabilities maintained a high level nearly 1. Since symptomatic and vulnerable individuals were tested first and some sick individuals disembarked at the end of first week, it is expected that the proportions of non-susceptible individuals became larger as time went by. In other words, starting from the second week, λ i , i = 8, 9, . . . , 14, decrease. We suppose the mixture proportion λ i varies across i in the logistic form to add model flexibility. Specifically, we assume λ i = 1, 1 i 7, and with unknown parameters θ 1 and θ 2 . It is easy to see that suspectable proportions in the first week are supposed to be the same, and the suspectable probabilities λ i , 8 i 14 in the second week have a logistic regression form and decrease as i increases. Different forms of λ i are designed to account for the data collection difference between the two weeks. Under the proposed model, the true detected number during the quarantine period, N, is In summary, we formulate a cure model by assuming that the underlying population on the Diamond Princess cruise ship is a mixture of susceptible and non-susceptible subjects. All susceptible subjects are vulnerable to be infected and detected by COVID-19, while the nonsusceptible ones are never infected and detected. Thus, we model separately the detection distribution for susceptible individuals and the fraction of nonsusceptible ones.

Estimation
The proposed mixture model uses a nonparametric approach to estimate the detection distribution F and a parametric approach to estimate the suspectable proportion λ. Since Pr(δ ij = 1) = λ i F i , the conditional expectation of x i is E(x i ) = y i λ i F i . The observed data are summarized as {(x i , y i ); i = 1, . . . , 14}, which are constituted by n = 14 i=1 y i independent and identically distributed random replications. The observed likelihood is then written as We view F as a piecewise constant non-decreasing nonparametric function that only jumps at i = 1, . . . , 14. So far we have 16 unknown parameters F i , i = 1, . . . , 14, θ 1 and θ 2 but only have 14 pairs of observed data. To ensure identifiability, we impose the constraints F 2 = (F 1 + F 3 )/2 and F 13 = F 12 . Then The maximum likelihood estimators (MLEs) (F i ,θ k ) are derived by maximizing n (F, θ), that is, Then, we can estimate N byN Remark 2.1: We only have 14 days aggregated data, but 28 unknown parameters, λ i , F i , i = 1, . . . , 14. To overcome the non-identifiability, we carefully account for the data characteristics, estimate F i nonparametrically with two constraints, and impose a parametric model on λ i . Logistic regression is the most common model for the proportion. The logistic model provides an approximation for the susceptible proportion. We set one regression coefficient as negative to describe the decreasing trend. One may also use other parametric models, for example, the probit model. We have fitted the probit model λ i = (θ 5 + θ 6 i), 8 i 14, θ 6 < 0 to the Diamond Princess cruise ship data and found the estimated number of cumulative infection cases was 1072, which is almost the same with the estimated number 1074 using the logistic model. This shows the robustness and rationality of the imposed assumptions. If we impose strict assumptions on F i and constraints on some λ i , one may also estimate λ i nonparametrically.

Remark 2.2:
The two constraints F 2 = (F 1 + F 3 )/2 and F 13 = F 12 are imposed according to the preliminary data analysis. We can impose other alternative constraints. For example, we assume a Weibull distribution for F which is commonly used in epidemical modelling. Suppose F i = F(i) = 1 − exp{−(θ 4 i) θ 3 }, i = 1, 2, . . . , 14. Maximum likelihood is used for parameter estimation. Applying this parametric approach to the Diamond Princess cruise data, the estimated total infected number at the end of quarantine is 1036, which is quite close to the estimated number using the proposed nonparametric method. This shows the robustness and rationality of the imposed constraints.
We impose the following regularity conditions.
(C1) The parameter space B is a bounded and closed subset of (0, 1) 12 × R × R − . Based on the theoretical results forβ established in Theorems 3.1 and 3.2, we can follow the delta method to easily get the asymptotic properties ofF 2 andF 13 .

Simulation studies
In this section, we conduct simulation studies to assess the finite sample performance of the proposed method. We generate data to mimic the PCR testing data on Diamond Princess cruise ship. Specifically, suppose the total number of people (include passengers and crew members) on the cruise for quarantine is n * = 3711. There are 14 pairs 31,71,171,6,57,103,53,221,217,289,504,681,607,52) is consisted by the number of total tests each day. The susceptible probabilities with θ 1 = −0.518, θ 2 = −0.232. Given y i , x i is generated from Binomial distribution B(y i , λ i F i ) with success probability λ i F i , where F i , 1 i 14 are set as F = (F 1 , . . . , F 14 ) = (0. 208, 0.208, 0.208, 0.208, 0.208, 0.631, 0.696, 0.696, 0.850, 0.900, 0.950, 0.950, 1.000, 1.000) . Under this configuration, the true number of infections is N = 1042. We simulate 500 datasets and use bootstrap to derive estimated standard errors and confidence intervals of the unknown parameters. 100 bootstrap samples are generated based on the nonparametric mixture model with estimated parameters. The 95% confidence intervals (CI) are derived through normal approximation, where the estimated standard errors are calculated as the standard deviation of bootstrap sample estimators. Table 1 summarizes the simulation results, where the true parameters, the empirical biases (Bias), the empirical standard deviations (SD), the estimated standard errors (SE) and the empirical coverage probabilities (CP) are given. We can conclude that the empirical biases are negligible, the empirical standard deviations and estimated standard errors match each other quite well, and the empirical coverage probabilities are close to the nominal one 95%.

Real data analysis
In this Section, we apply the proposed method to the Diamond Princess cruise data to show its practical utility. We use the time series daily report PCR testing data, including the number of tests and number of patients testing positive each day during the quarantine period, to estimate the distribution of detection time, and the varying proportions of susceptible individuals, along with the total number of infections that can be detected. The data are publicly available, for example, in Table 1 of Mizumoto et al. (2020).  In the real data, 14 pairs of (x i , y i ) are observed. As stated in Section 2, we impose the constraints F 2 = (F 1 + F 3 )/2 and F 13 = F 12 to solve the potential identifiability problem. Under the proposed nonparametric mixture modelling framework, the maximum likelihood estimators (MLEs) are derived by maximizing the joint log-likelihood with time series daily report data. The nlm() function in R is employed to estimate F i , 1 i 14, i = 2, 13, θ 1 , and θ 2 . LetF 2 = (F 1 +F 3 )/2 andF 13 =F 12 . The proportion of susceptible individuals incorporated in the PCR test at day i,λ i , can be derived from the logistic regression form with estimatedθ 1 andθ 2 . We use the estimated parameters to simulate bootstrap samples, which include time series daily data of number of tests and confirmed cases. 200 bootstrap samples are generated to estimate the standard errors, and the confidence intervals are based on normal approximation. Table 3 lists the estimatorsF i , 1 i 14,θ i , i = 1, 2,N,λ i , 8 i 14, the corresponding estimated standard errors and confidence intervals. The estimated F 1 is about 0.2 and F 9 is close to 1 (Table 3), which means that about 20% of susceptible individuals will be detected at the beginning of quarantine. And 9 days later, all the susceptible individuals on the board will be detected. Figure 2 presents the observed detection rates x i /y i (scatter points), along with the fitted detection rates λ i F i (black solid line). We use different colours and different symbols to demonstrate x i /y i with y i > 100 or y i < 100, respectively. For example, red circles represent scatter points x i /y i with y i > 100, while blue squares describe scatter points x i /y i with y i < 100. Figure 2 suggests that the estimated nonparametric distribution F i and the parametric susceptible proportion λ i characterize the pattern of detection quite well. This shows the plausibility of the assumption that λ i decreases with i in the logistic regression form.
In contrast to the officially reported 634 individuals with PCR-positive results after the 14 days quarantine, which as of April 27, 2020 had increased to 712 as released by the Johns Hopkins University, we conclude that the estimated total number should be 1064. Zhang et al. (2020) used a completely different method to estimate the reproductive number (R0) of the novel virus in the early stage of the outbreak and estimate the cumulative cases on the ship. They estimated the cumulative cases as 1514 (1384-1656) if the R0 value remained 2.28 as the early stage on the ship. If R0 value was reduced by 25% and 50%, the estimated total number of cumulative cases would be reduced to 1081 (981-1177) and 758 (697-817), respectively. A great deal of the transmission on the ship had occurred before the quarantine when people were even not notified about the virus. As the containment measures became stricter, it is expected that the R0 value reduced. We estimated the total number as 1064 (984-1144), which is almost in accordance with the number when the R0 value was reduced by 25%.

Concluding remarks
In this paper, motivated by the real PCR testing data on the Diamond Princess cruise ship, we propose a novel mixture model to estimate the distribution of detection time among susceptible subjects and the susceptible proportion among tested people each day. As a by-product, the total number that can be detected after the quarantine period is estimated as 1064, which means that 42.5% of infected cases were undetected on the cruise. The estimated number 1064 is larger than the released 712. The discrepancy might be caused by the false-negative result of the PCR test (Kucirka et al., 2020) or the occurrence of infection after the test. Some asymptomatic cases may be missed due to the imperfect sensitivity of the PCR test, and they had the high transmissibility. We conclude that COVID-19 spread in the cruise ship is easier and faster than in open spaces. Strict containment efforts should be scaled up prior to local outbreak.
Like all medical papers, we have to acknowledge the possible weakness in our approach. The COVID-19 data collected on the Diamond Princess cruise were very limited. All information was summarized in Table 1. We assume that each selected individual was tested by PCR only once and assume that the sensitivity of the test was 100%. This might be not true because small proportion of individuals may be tested twice or more, and there may be false positives. Our method should be modified if additional relevant testing information was available. Nevertheless, we believe our approach has at least reduced the possible bias in the data collection process, though our solution may not be a perfect one. We would be happy to read other innovative approaches from other authors in the future. During the outbreak of a pandemic, it would be useful to make quick statistical inference based on very limited information, though, it may not be a very accurate one. Besides, in the second week of the quarantine, the number of symptomatic and asymptomatic patients testing positive was publicly available. However, we did not take these information into consideration. Incorporating such data in statistical modelling warrants future research.