Weakly unsupervised conditional generative adversarial network for image-based prognostic prediction for COVID-19 patients based on chest CT

Because of the rapid spread and wide range of the clinical manifestations of the coronavirus disease 2019 (COVID-19), fast and accurate estimation of the disease progression and mortality is vital for the management of the patients. Currently available image-based prognostic predictors for patients with COVID-19 are largely limited to semi-automated schemes with manually designed features and supervised learning, and the survival analysis is largely limited to logistic regression. We developed a weakly unsupervised conditional generative adversarial network, called pix2surv, which can be trained to estimate the time-to-event information for survival analysis directly from the chest computed tomography (CT) images of a patient. We show that the performance of pix2surv based on CT images significantly outperforms those of existing laboratory tests and image-based visual and quantitative predictors in estimating the disease progression and mortality of COVID-19 patients. Thus, pix2surv is a promising approach for performing image-based prognostic predictions.


Introduction
The rapid global spread of the coronavirus disease 2019  has placed major pressures on healthcare services worldwide.
During the year of 2020, over 70 million COVID-19 infections and over 1.6 million deaths due to COVID-19 were reported worldwide ( WHO, 2020 ). Because of the wide range of the clinical manifestations of COVID-19, fast and accurate estimation of the disease progression and mortality is vital for the management of patients with COVID-19.
Chest computed tomography (CT) is the most sensitive chest imaging method for COVID-19 ( Harmon et al., 2020 ;Mei et al., 2020 ). Recently, several computer-assisted image-based predictors have been reported for prognostic prediction of COVID-19 patients based on chest CT images. The basic idea of these predictors has been to extract various features from CT images, and to subject these features to a classifier (logistic regression) or a survival prediction model. Most studies have used a small number of well-understood manually defined size, shape, or texture features that are extracted from regions of interest, such as those of segmented ground-class opacities, semi-consolidation, and consolidation ( Colombi et al., 2020 ;Huang et al., 2020 ;Lanza et al., 2020 ;Liu et al., 2020 ;Matos et al., 2020 ;Wang et al., 2020 ;Yu et al., 2020 ;Zhang et al., 2020 ). Other studies performed a radiomic analysis by extraction of a large number of radiomic features from a segmented complete lung region, followed by feature selection to determine a manageable set of key features ( Homayounieh et al., 2020 ;Wu et al., 2020 ). After the calculation of the prognostic features, the prognostic prediction has usually been performed by use of logistic regression, which limits the analysis to a binary prediction of the disease severity or survival at a specific time point ( Colombi et al., 2020 ;Homayounieh et al., 2020 ;Lanza et al., 2020 ;M. D. Li et al., 2020 ;Matos et al., 2020 ;Xiao et al., 2020 ). Instead of logistic regression, some methods performed a traditional survival analysis by subjecting the features to a Cox regression analysis for the calculation of the time-to-event information which is needed to perform a complete survival analysis for clinical tasks ( Francone et al., 2020 ;Wu et al., 2020 ;Zhang et al., 2020 ).
These previous studies had several limitations. Semi-automated quantification of CT images requires manual guidance and suffers from inter-and intra-observer variability. Extraction of features from segmented regions of interest is vulnerable to segmentation errors, it can exclude important information from nonsegmented lung regions, and manually or mathematically defined features may not be ideal for the construction of optimal prognostic predictors. Furthermore, few studies have made use of traditional survival analysis, which enables the calculation of the survival probability at any time point for performing important clinical tasks, such as the calculation of survival curves. Finally, the role of deep learning in these methods has been limited to a segmentation of CT images, and such deep learning models were trained with a supervised learning approach that requires the availability of an annotated training dataset. Therefore, there is an unmet need for an objective survival analysis system that would automatically extract, select, and combine image-based features for the calculation of complete time-to-event information for optimally predicting the prognosis of patients with COVID-19, without the laborious and expensive annotation effort that is required by supervised learning schemes.
Recently, an adversarial time-to-event model based on a conditional generative adversarial network (GAN) has been shown to be able to generate predictions for the survival analysis of epidemiologic data at a higher accuracy than those of traditional survival methods ( Chapfuwa et al., 2018 ). The model focused on the estimation of time-to-event distribution rather than event ordering, and it also accounted for missing values, high-dimensional data, and censored events. Such an approach has the advantage that the distribution of the survival time can be estimated directly from the input predictor. However, to the best of our knowledge, no conditional GAN-based methods have been proposed for estimating the survival time directly from images.
In this study, we developed a weakly unsupervised conditional GAN, called pix2surv, which enables the estimation of the distribution of the survival time directly from the chest CT images of patients. The model avoids the technical limitations of the previous image-based COVID-19 predictors discussed above, because the use of a fully automated conditional GAN makes it possible to train a complete image-based end-to-end survival analysis model for producing the time-to-event distribution directly from input chest CT images without an explicit segmentation or feature extraction efforts. Also, because of the use of weakly unsupervised learning, the annotation effort is reduced to the pairing of input training CT images with the corresponding observed survival time of the patient.
We show that the prognostic performance of pix2surv based on CT images compares favorably with those of existing laboratory test results computed by the traditional Cox proportional hazard model ( Cox, 1972 ) and those of image-based visual and quantitative predictors in estimating the disease progression and mortality of patients with COVID-19. We also show that the time-toevent information calculated by pix2surv based on CT images enables stratification of the patients into low-and high-risk groups with a wider separation than do those of the other predictors. Thus, pix2surv is a promising approach for performing imagebased prognostic prediction for the management of patients.

Background
The intent of a time-to-event model is to perform a statistical characterization of the future behavior of a subject in terms of a risk score or time-to-event distribution. A time-to-event dataset can be formulated as . . . , x ip ] are the predictors, t i is a time-to-event of interest, l i is a binary censoring indicator, and N is the size of the dataset. A value of l i = 1 indicates that the event is observed, whereas a value of l i = 0 indicates censoring at t i .
Let T denote a continuous random variable (or survival time) with a cumulative distribution function F ( t ). The survivor function of T is defined as the fraction of the population that survives longer than some time t , as ( Kleinbaum and Klein, 2012 ) where h (I t ) is a hazard function that describes the rate of the occurrence of an event over time. Given a set of predictors, x , the relationship between the corresponding conditional hazard and conditional survival functions can be expressed as where f (t | x ) is the conditional survival density function. Time-toevent models typically characterize the relationship between the predictors x and a time-to-event t by estimation of the conditional hazard function of Eq. (2) by use of the relationships of where Cox proportional hazard model ( Cox, 1972 ) is a popular timeto-event model, which is based on the assumption that the effect of the predictors is a fixed, time-independent, multiplicative factor on the value of the hazard function (or hazard rate). The estimation depends on event ordering rather than on the time-to-event itself, which is undesirable in applications where the prediction is of the highest importance. The accelerated failure time (AFT) model ( Wei, 1992 ) is another time-to-event model, which is based on the assumption that the effect of the predictors either accelerates or delays the event progression relative to a parametric baseline time-to-event distribution. The parametric time-to-event distribution is represented by use of a limited parametric form, such as exponential distribution, which is often violated in practice due to the inability of the model to capture unobserved variation.
A deep adversarial time-to-event (DATE) model is yet another type of time-to-event model, which makes use of a conditional GAN to estimate the time-to-event distribution, p(t| x ) , where t is a non-censored time-to-event from the time at which the predictors x were observed ( Chapfuwa et al., 2021( Chapfuwa et al., , 2018. This makes it possible to implicitly specify the time-to-event distribution via sampling, rather than by learning the parameters of a pre-specified distribution. Also, the use of a GAN penalizes unrealistic samples, which is a known issue in likelihood-based models ( Karras et al., 2018 ). For censored events, the likelihood of p(t > t i | x i ) should be high, whereas for non-censored events, the pairs { x i , t i } should be consistent with the data generated by p(t | x ) p 0 (x ) , where p 0 (x ) is the (empirical) marginal distribution for the predictors from which we can sample but whose explicit form is unknown.
The generator function of the conditional GAN of the DATE model can be modeled as where p (ε) is a simple distribution such as uniform distribution, and θ denotes the parameters of the generator. The generator defines an implicit non-parametric approximation q θ (t| x, l = 1) of the non-censored samples of p(t| x ) . Ideally, the pairs { x, t } generated by Eq. (5) should be indistinguishable from the observed data Given a discriminator function D φ ( x, t ) with a parameter set φ, the cost function of the conditional GAN for non-censored data can be expressed as where p nc ( t, x ) is the empirical joint distribution for the noncensored subset D nc ⊂ D . The expectation terms are estimated through the samples { x, t } ∼ p nc ( t, x ) and ε ∼ p ( ) only.
To leverage the censored subset D c ⊂ D for updating the parameters of the generator, a second cost function is introduced as where the role of max ( 0 , •) is to incur no loss from G θ (x ; ; l = 0) as long as the sampled time is larger than the censoring point.
For cases where the proportion of the observed events is low, the cost functions of Eqs. (6) and (7) do not account for mismatches between the time-to-events and the ground truth, t . To penalize G θ ( x ; ε; l = 1 ) for not being close to the event time t for non-censored events, a third cost function, or distortion loss, is introduced as The conditional GAN of the DATE model is trained by optimizing the combination of the cost functions of Eqs. (6) -(8) , , by maximizing it with respect to φ and θ . It has been demonstrated that the use of the DATE model yields a significant performance gain in the survival analysis of epidemiologic data over those of traditional methods ( Chapfuwa et al., 2018 ), such as the Cox-Efron ( Efron, 1974 ), the random survival forest ( Ishwaran et al., 2008 ), or a deep regularized AFT model ( Chapfuwa et al., 2018 ).
In Section 3 , we describe how we generalized the concepts of the DATE model to convolutional neural networks for performing prognostic prediction for COVID-19 based on the CT images of patients in this study.

Methods and materials
3.1. pix2surv Fig. 1 shows a schematic structure of the pix2surv survival prediction model for CT images. The training of the model involves the optimization of a time generator ( Fig. 1 a) and a time discriminator ( Fig. 1 b). The time generator, G = G θ , is used to convert an image into an estimated survival time by converting the feature maps of a fully convolutional encoder network into a scalar time value by use of a fully connected network. The details of the implementation of G are discussed in Section 3.3 . During training, the estimated survival time, t est , is converted into an estimated survival time image (orange rectangle in Fig. 1 a), which contains t est as a scalar value at each pixel and is provided as input to the time discriminator.
The time discriminator, D = D φ , is trained to differentiate "real pairs" of an input image and the corresponding observed (true) survival time image (blue rectangle in Fig. 1 b), which is based on the observed true survival time, t obs , from "estimated pairs" of an input image and a corresponding estimated survival time image (orange rectangle in Fig. 1 b) generated by G . The implementation details of D are described in Section 3.3 .
The training of pix2surv involves the optimization of G and D based on the images of a training dataset. The cost function is a modified min-max objective function which contains three distinct loss functions adapted from Eqs. (6) -(8) . The first of these loss functions, is the standard loss function of a conditional GAN ( Isola et al., 2017 ;Mirza and Osindero, 2014 ), where p data denotes the empirical joint distribution of an input image x and a survival time image t, p z ( z ) denotes a Gaussian distribution, and z is a latent variable. The loss function of Eq. (10) encourages D to identify incorrect survival times (or survival time images) generated by G , whereas G is encouraged to generate survival times t est that have a low probability of being incorrect, according to D . The two other loss functions of Eq. (9) , and further constrain G to generate survival times that are similar to the observed true survival times of censored and non-censored patient images, respectively. The trade-off between L censor (G ) and L non-censor (G ) relative to L cGAN ( G, D ) is controlled by the parameters λ c and λ n of Eq. (9) .

Prognostic prediction for patients based on their CT images
In this study, the image-based prediction by pix2surv for estimating the survival time of a patient was performed based on an analysis of the 2D CT image slices of the patient. For this purpose, the pix2surv was first trained by use of the individual 2D CT images of patients, where the CT images were paired with the observed survival time of the corresponding patient.
After the training, the survival time of a patient was estimated by subjecting the CT images of a patient to the time generator (see Section 3.1 ), which yielded an estimated survival time for each CT image. The survival time of the patient was then calculated as the median of the estimated survival times of the CT images of the patient. We used the median value because, in our experiments, this yielded more accurate predictions than the use of other firstorder statistics for estimating the image-based survival time.

Implementation of pix2surv
To reduce the computation time of the training step, we subsampled the input CT images from their original 512 ×512-pixel matrix size to a 256 ×256-pixel matrix size. Also, we constrained the number of CT images per patient to a maximum of 100, by a random selection of the CT images whenever the CT acquisition series of a patient contained more than 100 CT image slices. These two steps reduced the training time by 80% with essentially no change in the performance of the prognostic prediction (see Appendix A for an ablation study). Thus, these two steps substantially improved the throughput without compromising the prognostic performance.
The architectural details of G (the time generator) which we used in our experiments are shown on the right margin of Fig. 1 a. There were four convolution layers and three fully connected layers. The architectural details of D (the time discriminator) are shown on the right margin of Fig. 1 b. We implemented D as a patch-based fully convolutional neural network (PatchGAN), similar to that of the pix2pix GAN model ( Isola et al., 2017 ;Li & Wand, 2016 ). There were five convolution layers. The PatchGAN is designed to penalize unrealistic structures at the scale of small image patches by averaging the outputs of the network convolutions across the input image into an aggregate output likelihood that is used to determine if the input image is considered as real or synthetic.
We implemented the pix2surv model by use of PyTorch 1.5 Paszke et al., 2019 ). The calculations were performed by use of Linux graphics processing unit (GPU) servers equipped with 48GB RTX 80 0 0 GPUs (NVIDIA Corporation, Santa Clara, CA) and 10core 3.7 GHz Core i9-10900X CPUs (Intel Corporation, Santa Clara, CA). No data augmentation was performed. The values of the free parameters of pix2surv were determined during training by use of grid search, where the time generator and time discriminator of pix2surv were trained by use of the Adam optimizer with β 1 = 0 . 5 and β 2 = 0 . 999 . The dropout ratio was set to 0.3, batch size was 64, the learning rate was 2.0 ×10 -4 , and the trade-off parameters of Eq. (9) of the censored and non-censored loss functions of Eqs. (11) and ( (12) with respect to the standard loss of Eq. (10) were set to λ c = 10 and λ n = 10 .

Materials
This study was approved by our institutional review board (IRB). All procedures involving human participants were performed in accordance with the ethical standards of the IRB and with the 1964 Declaration of Helsinki and its later amendments. The informed consent of the patients was waived for this study.
We established a retrospective multi-center database of COVID-19 cases with the associated CT image acquisitions, where the cases were collected between March 1 and June 28, 2020, from the medical records of the Massachusetts General Hospital and the Brigham and Women's Hospital through the Research Patient Data Registry and the COVID-19 Data Mart at the Mass General Brigham (Boston, MA), and they were followed up until July 28, 2020. The medical records were reviewed by an expert pulmonologist to in- Table 1 The demographics, clinical characteristics, and CT parameters of the progression and mortality analysis patient cohorts. IQR = interquartile range. clude patients who (1) were at least 18 years old, (2) had been diagnosed as COVID-19 positive based on a positive result for severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) by reverse transcriptase-polymerase chain reaction (RT-PCR) with samples obtained from the nasopharynx, oropharynx, or lower respiratory tract, and (3) had a high-resolution chest CT examination available. The resulting cohort consisted of 302 patients. After excluding the patients whose CT examinations had been performed for diseases other than COVID-19, we established a database of 214 COVID-19 patients for this study. All these patients were included in the study regardless of the diagnostic quality of their CT images. Table 1 summarizes the demographics, clinical characteristics, and CT acquisition parameters of the two types of patient cohorts used in this study. All 214 patients were considered for mortality analysis, whereas only 141 patients were considered for progression analysis because patients who had their CT examination af-ter the intensive care unit (ICU) admission were excluded from the progression analysis.
The chest CT images of the patients were acquired by use of a single-phase low-dose acquisition with a multi-channel CT scanner (Canon/Toshiba Aquilion ONE, GE Discovery CT750 HD and Revolution CT/Frontier, Siemens SOMATOM Definition AS/AS + /Edge/Flash, SOMATOM Force, Biograph 64, and Sensation 64) that used auto tube-current modulation and the parameter settings shown in Table 1 . The CT images were reconstructed by use of a neutral or medium sharp reconstruction kernel.
The 214 patients generated a total of 84,971 CT images for the study. As a pre-processing step, the intensity values of the CT images were clipped to a Hounsfield unit (HU) range of -1024 to 1024 HU and mapped linearly to the range of -1 to + 1.
Because not all of the values of some of the reference predictors were available for all the patients, we evaluated the comparative performance of the predictors both in terms of the maximum number of patients that were available individually for each predictor, as well as in terms of specific subcohorts of patients, called "common cases", where the values of all the reference predictors were available for all patients of the subcohort. In progression analysis, there were 105 such common cases, whereas, in mortality analysis, there were 171 common cases.

Reference predictors
We compared the prognostic performance of pix2surv with those of reference predictors that had been reported in the peerreviewed literature for COVID-19 by the time our experiments were carried out. These reference predictors included (1) a combination of the laboratory tests of lactic dehydrogenase, lymphocyte, and high-sensitivity C-reactive protein (abbreviated as Lab) ( Ji et al., 2020 ;Yan et al., 2020 ), (2) visual assessment of the CT images in terms of a total severity score (TSS) (K. Lyu et al., 2020 ), (3) visual assessment of the CT images for the total severity score for crazy paving and consolidation (CPC) ( Lyu et al., 2020 ), and (4) semi-automated assessment of the CT images in terms of the percentage of well-aerated lung parenchyma (%S-WAL) ( Colombi et al., 2020 ).
The results of the laboratory tests (Lab) were collected from the patient records. The TSS (value range: 0-20) was estimated by an internist with over 20-year experience (C.W.) based on the descriptions of previously published studies Lyu et al., 2020 ), as a sum of the visually assessed degree of acute lung involvement at each of the five lung lobes on the chest CT images. The CPC was assessed as the sum extent of crazy paving and consolidation in terms of the TSS criteria, where the sum involvement of the five lung lobes was taken as the total lung score (value range: 0-20) ( Lyu et al., 2020 ). The %S-WAL was calculated by use of previously published image processing software ( Kawata et al., 2005 ), based on the descriptions of a previously published study ( Colombi et al., 2020 ), as the relative volume of the well-aerated 3D lung region determined by the density interval of -950 HU and -700 HU with respect to the volumetric size of the complete segmented 3D lung region on the chest CT images.
The predictions by Lab were calculated by use of the elasticnet Cox proportional hazard model ( Simon et al., 2011 ). To calculate the time-to-event distributions provided by the image-based reference predictors, each predictor was subjected to the conditional GAN of the pix2surv model of Section 3.1 except that, for each predictor, there was only one input image per patient, where the input image was constructed by storing the feature value of the predictor in the channel dimension for each pixel. The other computations were performed as described in Section 3.1 . Previously, we have demonstrated that, when the time-to-event distribution of a single-valued predictor is estimated by use of pix2surv as described above, the resulting prognostic performance is similar or even higher than if the predictor had been subjected to a traditional Cox proportional hazards model () ( Uemura et al., 2020 ). This observation is consistent with the previously reported result that the predictions generated by the DATE model (see Section 2 ), the inspiration behind our pix2surv model, are more accurate than those generated by traditional survival models ( Chapfuwa et al., 2018 ).

Training and validation with bootstrapping
To obtain an unbiased estimate and 95% confidence intervals of how well our model would generalize to external validation patients, we performed the evaluations by use of the bootstrapbased procedure recommended by the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) consensus guideline ( Moons et al., 2015 ). The bootstrapping evaluations were performed with 100 bootstrap replicates on the pix2surv model as well as on the reference predictors described in Section 3.5 . The associated statistical analyses were performed by use of R 4.0.2 ( R Core Team, 2020 ).
The bootstrap evaluations were performed by use of per-patient bootstrapping, i.e., when a patient was assigned to a training or test set in the bootstrap procedure, all the CT images of the patient were assigned to that set. See Appendix B for details about the implementation of the per-patient bootstrap procedure. It took approximately 276 hours (11.5 days) to perform 100 bootstrap replicates for 214 patients on a single GPU by the use of the architecture and parameter settings of pix2surv described in Section 3.3 .

Prediction performance
We measured the performance of the prognostic prediction in terms of survival time. For the analysis of COVID-19 progression, the survival time was defined as the number of days from the baseline CT image acquisition to that of either ICU admission or death (for uncensored patients), or to the most recent follow-up date (for censored patients). For the analysis of COVID-19 mortality, the survival time was defined as the number of days from the baseline CT image acquisition to the death of the patient (for uncensored patients), or to the most recent follow-up date (for censored patients).
We used the concordance index (C-index) ( Harrell et al., 1996 ) as the primary metric of the performance of the prognostic prediction. The C-index is technically similar to the area under the receiver operating characteristic (ROC) curve (AUC) that is used for evaluating classification performance for binary outcomes, except that the C-index estimates the concordance between predicted and observed outcomes in the presence of censoring. The C-index is focused on the estimation of usable pairs, in which one patient is known to have an outcome before the other patient, who may have an outcome later or who may be censored. The C-index has a value range of 0%-100%, where 50% indicates random prediction and 100% indicates perfect prediction.
As a secondary metric of the prognostic performance, we calculated the relative absolute error (RAE) of the predictions with respect to the range of the events. The RAE is defined as It should be noted that we did not include the Lab reference predictor in the RAE or survival time estimation results of Section 4 . As noted in Section 2 , the estimate of the Cox proportional hazard model that was used for calculating the prediction of Lab ( Section 3.5 ) is based on event ordering rather than on time. Thus, it does not provide the time-to-event distribution necessary for the calculation of the RAE or the distribution of the survival time.
We also quantified the uncertainty in the predictions of the progression and mortality across the predictors by use of the coefficient of variation as a metric ( Chapfuwa et al., 2021( Chapfuwa et al., , 2020. The details and results of this analysis are provided in Appendix C .

Risk stratification
We evaluated the performance of the pix2surv model in risk stratification by use of the Kaplan-Meier estimator ( Kaplan and Meier, 1958 ). The Kaplan-Meier estimator is a non-parametric statistic for estimating a survival probability function of a popu-lation as a function of time from time-to-event data. A plot of the Kaplan-Meier estimator, called a Kaplan-Meier survival curve, results in a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population.
For each patient in a cohort, the predicted survival time from pix2surv was calculated by use of the per-patient bootstrapping ( Section 3.6.1 ). Then, the median of the predicted survival times of all the patients in the cohort was used as a cut point ( Harrell et al., 1996 ) for stratifying the patients into low-and high-risk groups, i.e., the patients whose predicted survival times were shorter than the cut-point time value were categorized into the high-risk group, Fig. 4. The Kaplan-Meier survival curves, stratified into low-and high-risk patient groups, of the common cases in the progression analysis cohort included in Fig. 2 . The estimated survival curves for the low-risk group (n = 52) and high-risk group (n = 53) are shown in blue and red, respectively, with shaded areas representing the 95% confidence intervals. The P values were obtained by application of the log-rank test to the two survival curves. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 3 . The estimated survival curves for the low-risk group (n = 85) and high-risk group (n = 86) are shown in blue and red, respectively, with shaded areas representing the 95% confidence intervals. The P values were obtained by application of the log-rank test to the two survival curves. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) whereas those whose predicted survival times were longer than the cut-point value were categorized into the low-risk group. For each group, the Kaplan-Meier survival curve was generated by use of the Kaplan-Meier estimator, and the difference between the survival curves of the two risk groups was evaluated by use of the logrank test ( Mantel, 1966 ), which is a non-parametric test for testing the null hypothesis that there is no difference between populations regarding the probability of an event at any time point. The log-rank test is based on the same assumptions as those of the Kaplan-Meier survival curves ( Harrington, 2005 ;Harrington and Fleming, 1982 ).

Equivalence of estimated versus observed survival curves
We evaluated the equivalence of the estimated Kaplan-Meier survival curve S 1 (t) with that of the patient cohort S 2 (t) by use of a non-parametric equivalence test ( Möllenhoff and Tresch, 2020 ), where the equivalence margin was set to 0.15. The null hypothesis on the difference of the two survival curves over the entire period, max t | S 1 (t) − S 2 (t) | ≥ , was tested at a significance level of 0.05. If the null hypothesis was not rejected, the survival curves were considered equivalent. Fig. 2 shows the comparative performance of the pix2surv and the reference predictors in the prediction of COVID-19 progression, as measured by the C-index and RAE with the 100 bootstrap replicates. This progression analysis of common cases (see  Table 2 Comparative performance of pix2surv and the reference predictors in the prediction of the COVID-19 progression (left) and mortality (right) for the subcohorts of patients, where, for each predictor, all available patients (as shown in the second and fifth columns) were used.   Fig. 3 shows the comparative performance of the pix2surv and the reference predictors in the prediction of mortality, as measured by the C-index and RAE with 100 bootstrap replicates. This mortality analysis of common cases (see Section 3.4 ) included 171 patients, of whom 40 had expired. For the C-index, the prediction performance of pix2surv ( Table 2 shows the comparative performance of pix2surv and the reference predictors in the prediction of the COVID-19 progression (left) and mortality (right) for the subcohorts of patients, in which the maximum numbers of patients that were available individually for each predictor, as indicated in the second and fifth columns for the progression and mortality, respectively, were used to calculate the result for the predictor. For pix2surv, the C-index values for both progression and mortality were increased by 2.7% from those shown in Fig. 2 and Fig. 3 . The RAE values of pix2surv for progression and mortality were decreased by 3.1% and 0.5%, respectively. Similar to the trend shown in Figs. 2 and 3 , pix2surv statistically significantly ( P < 0.0 0 01) outperformed the other predictors.

Prognostic prediction performance
The above results indicate that pix2surv outperforms the reference predictors by a large margin in prognostic prediction. The results of the quantification of the associated uncertainties that are provided in Appendix C also show that pix2surv is at least as precise in prognostic prediction as the reference predictors.

Risk stratification performance
Figs. 4 and 5 show the Kaplan-Meier survival curves, stratified into low-and high-risk groups, of the common cases of COVID-19 patients included in Figs. 2 and 3 , respectively. In both progression and mortality analysis, both visual assessment and the P -values of the log-rank test indicated that the separation between the two curves was largest with pix2surv, indicating that pix2surv was the most effective predictor in the stratification of the progression and mortality risk of COVID-19 patients.

Equivalence of survival curves
Figs. 6 and 7 show the Kaplan-Meier survival curves estimated by pix2surv and the three image-based reference predictors for the Fig. 9. An example of the predicted overall survival time (mortality) of a 67-year old male who expired 10 days (red dotted line on the plot on the right) after the chest CT examination. The image on the left shows a representative example of the CT images. The plot on the right shows the predicted survival times (circles) by pix2surv and the image-based reference predictors, with 95% confidence interval bars superimposed on the boxplots that represent the bootstrap results. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) progression and mortality of the common cases of COVID-19 patients included in Figs. 2 and 3 , respectively, in comparison with the actual (baseline) survival curves of these patient cohorts. The non-parametric equivalence test described in Section 3.6.4 showed that, in both progression and mortality predictions, the survival curves estimated by use of pix2surv were identified as equivalent to the actual survival curves over the period of 0 to 30 days, whereas those estimated by the reference predictors were not. Also, visual assessment indicates that pix2surv approximates the actual survival time better than do the reference predictors.
For the cases in Figs. 8 and 9 , the two-one-sided t-test (TOST) ( Schuirmann, 1987 ) with an equivalence margin of 15% and confidence interval of 95% showed that the survival times predicted by pix2surv were equivalent to the observed survival time ( P < 0.0 0 01), whereas those of the reference predictors were not, indi-cating the potential usefulness of the pix2surv model for the prediction of survival times of COVID-19 patients. For the case shown in Fig. 10 , all the predictors yielded a longer survival time than what was observed, possibly because the involvement of the consolidation is limited to the posterior and peripheral lung on the CT images. However, pix2surv still approximated the observed survival time more accurately than did the reference predictors.

Discussion
Fast and accurate clinical assessment of the disease progression and mortality is vital for the management of COVID-19 patients. Although several computer-assisted image-based predictors have been proposed for prognostic prediction of COVID-19 patients based on chest CT, those previous predictors were limited to semiautomated schemes with manually designed features and supervised learning, and the survival analysis was largely limited to logistic regression. To the best of our knowledge, the weakly unsupervised conditional GAN model (pix2surv) that we developed in this study is the first prognostic deep-learning model that can be trained to estimate the distribution of the survival time of a patient directly from the CT images of the patient without image segmentation. The use of deep learning as an integral part of pix2surv makes it possible to train a complete image-based endto-end survival analysis model for estimating the time-to-event distribution directly from input images without explicit segmentation or feature extraction. Also, our weakly unsupervised approach eliminates the time, costs, and uncertainties plagued by manual image annotation efforts that are still required by traditional supervised learning approaches and that can slow down the development of solutions for addressing new diseases such as COVID-19 ( Greenspan et al., 2020 ).
We demonstrated that the prognostic performance of pix2surv based on chest CT images for estimating the disease progression and mortality of patients with COVID-19 is significantly better than those based on established laboratory tests or existing image-based visual and quantitative predictors. The time-to-event information calculated by pix2surv for chest CT images also enabled stratification of COVID-19 patients into low-and high-risk groups by a wider margin than those calculated by the reference predictors. The nominal performance of pix2surv could be improved in a number of ways. One approach would be to use data augmentation for enhancing the training dataset ( Shorten and Khoshgoftaar, 2019 ). Another approach could be to expand our COVID-19 dataset by use of public imaging repositories that are currently being constructed. However, it is not clear if such repositories will include chest CT examinations and the kinds of specific clinical information that were available to our study.
For this study, we implemented the pix2surv as a 2D deep learning model, where the prediction is based on an analysis of a stack of 2D CT image slices of a patient, rather than on a volumetric analysis of the chest CT volume. At present, effective use of 3D deep learning is constrained by the limitations of the currently available datasets and GPUs, which introduce several obstacles . First, the use of 3D volume as a basic unit would reduce the amount of training data over using 2D images. Because of the 2D implementation, we were able to perform the training and evaluations by the use of up to 84,971 CT images, whereas with a 3D implementation we could have used only up to 214 CT image volumes, thus introducing a convergence problem in the training phase. Second, in clinical practice, chest CT studies are still being acquired at an anisotropic image resolution, which makes their volumetric analysis less meaningful than an independent analysis of the image slices. Third, because of the memory limitations of the currently available GPUs, it is not straightforward or sometimes not even possible to fit an isotropic high-resolution chest CT volume and a 3D deep learning model into a single GPU, at least not without compromising performance. However, in the future, we anticipate that the use of a 3D pix2surv model with a large enough training dataset of isotropic chest CT volumes could be used to yield an even higher performance than that reported in this study.
It should be noted that most of the COVID-19 data of this study were collected during the first six months of the pandemic outbreak, at a time when relatively little was known about COVID-19. Since then, rapid developments in COVID-19 treatments and vaccinations have substantially improved the patients' survival, and survival models that have been trained only on previously collected COVID-19 data may have limited relevance in today's context. Thus, topics such as generalization of previously developed prediction models to more recently collected COVID-19 data, including issues such as "Long COVID" ( Sudre et al., 2021 ), provide ideas for future studies.
The reference COVID-19 predictors of this study were limited to those that had been published in peer-reviewed literature at the time our experiments were carried out. The purpose of this study was to develop and to demonstrate the feasibility of a weakly unsupervised pix2surv model for performing prognostic prediction for COVID-19 based on chest CT images, rather than to perform an exhaustive evaluation with any potentially available predictors. This is a topic to be explored in a future study.
The main limitations of this study include that this was a retrospective study based on early COVID-19 data, and that the evaluation was limited to an internal validation with bootstrapping. The proposed method only considers image-based information, and therefore, integration of non-imaging clinical data to the model could improve the accuracy of the predictions. Potential future topics include the application of the pix2surv model to more recently collected COVID-19 data and to other diseases that are manifested in medical images, as well as an external validation with prospective cases.

Conclusions
We developed a weakly unsupervised conditional GAN, called pix2surv, that can be used to calculate time-to-event information automatically from images for performing prognostic prediction. We showed that the prognostic performance of pix2surv based on chest CT images compares favorably with those of currently available laboratory tests and existing image-based visual and quantitative predictors in the estimation of the disease progression and mortality of COVID-19 patients. We also showed that the timeto-event information calculated by pix2surv based on chest CT images enables stratification of the patients into low-and highrisk groups by a wider margin than those of the other predictors. Thus, pix2surv is a promising approach for performing imagebased prognostic prediction for the management of patients.

CRediT authorship contribution statement
Tomoki Uemura and Janne J. Näppi contributed to methodology, software development, experiment, formal analyses, and writing the manuscript. Chinatsu Watari contributed to clinical data collection and formal analyses. Toru Hironaka contributed to data collection and software developments. Tohru Kamiya provided technical consultation. Hiroyuki Yoshida conceptualized and supervised the study, developed methodology, carried out formal analyses, and reviewed and edited the manuscript.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. During the conduct of this study, Janne J. Näppi received NIH grants R21EB024025 and R21EB22747, as well as Massachusetts General Hospital (MGH) Executive Committee on Research (ECOR) Interim Support Funding, as the PI of the grants; Hiroyuki Yoshida received NIH grants R01EB023942 and R01CA212382 as the PI of the grants; Tomoki Uemura was partly supported by the NIH grant R01EB023942; and Toru Hironaka was partly supported by the NIH grants R21EB024025, R21EB22747, R01CA212382, and R01EB023942. Table A.1 provides an ablation study regarding the prediction performance of the pix2surv model when it was trained using the method of Section 3.3 ("256 ×256 ×100"), i.e., by subsampling of the original input CT images to a 256 ×256-pixel matrix size and by constraining the number of CT images per patient to a maximum of 100, in comparison to using all the available CT images of patients ("256 ×256xAll") or using the original 512 ×512-pixel matrix size of the CT images ("512 ×512 ×100"). The results of Table A.1 indicate that the method of Section 3.3 for reducing training time yields a reasonable approximation of the prediction performance.

Appendix B. Per-patient bootstrap procedure
Let N be the number of patients included in a patient cohort (i.e., progression or mortality analysis cohort in Table 1 ), and let x i denote the CT images of patient i ( i = 1 , . . . , N ) . Let x = ( x 1 , x 2 , . . . , x N ) be the set of all CT images for the cohort of the N patients. Let C( x train , x test ) denote the value of a C-statistic (e.g., C-index or RAE) that is obtained when pix2surv is trained on the training set x train and tested on the test set x test . Here, the training and test of pix2surv are performed as described in Sections 3.1 and 3.2 . The per-patient bootstrap evaluation of pix2surv for obtaining a bias-corrected estimate of the C-statistic value is performed as follows ( Efron and Tibshirani, 1993 ;Sahiner et al., 2008 ): (1) First, we initialize pix2surv with random weights, and then calculate a resubstitution estimate of the C-statistic value, C( x , x ) , by training pix2surv on the cohort of patients x and by testing it on the same cohort of patients x .
(2) Next, we generate B bootstrap replicates (also called boot- is obtained by randomly drawing N patients, with replacement, from x . It can be shown that, for a large N , each of these bootstrap replicates contains, on average, 1 − lim 2% of all the patients ( Efron and Tibshirani, 1997 ).
(3) We then train pix2surv on each bootstrap replicate ˆ x b , and test it on ˆ x b and x to obtain the following bias of the resubstitution estimate: Here, the first term of w b can be regarded as the resubstituting C-statistic value in a so-called "bootstrap world" ( Boos, 2003 ), whereas the second term of w b can be regarded as the test C-statistic value in the bootstrap world. (4) The average of w b over the B bootstrap replicates provides an estimated bias of the resubstitution estimate. Thus, the biascorrected bootstrap estimate of the C-statistic is obtained by

Appendix C. Quantification of uncertainty
Predictions made by artificial intelligence suffer from various uncertainties, such as those related to the input data or the correctness of the underlying prediction model ( Ghoshal et al., 2020 ). One of the metrics to measure such uncertainties is the coefficient of variation (CoV), which characterizes the dispersion of predictions around the mean in a distribution. In practice, it is desirable Table A.1 Prediction performance of the pix2surv model when trained with the method of Section 3.3 ("256 ×256 ×100") in comparison to using all the CT images ("256 ×256xAll") or using the original matrix size of CT images ("512 ×512 ×100").

Progression
Mortality Width  for a time-to-event prediction model to generate concentrated predictions. Thus, a low value of CoV indicates that a prediction is more precise than those obtained with a large value of CoV. Table C.1 shows the CoV of pix2surv and those of the reference predictors in the prediction of the COVID-19 progression (left) and mortality (right). It should be noted that predictions based on the Cox model (Lab) have been excluded from this analysis, because the Cox model estimates a risk score and thus predictions based on the Cox model cannot be evaluated on CoV.