Dataset of tugHall simulations of cell evolution for colorectal cancer

Dataset contains results of multiple parallel calculations using the tugHall simulator. Output data of simulations are variant allele frequencies for four genes (APC, KRAS, TP53, and PIK3CA) related to colorectal cancer. During each simulation tugHall stochastically reproduces Darwinian evolution for cancer cells and calculates clonal heterogeneity. The probabilities of stochastic processes depend on a correspondence matrix between genome information and cancer hallmarks. As a result, tugHall records variant allele frequencies for the final stage of evolution. The number of trials is several million to get rich statistics of stochastic processes. These data can be used for approximate Bayesian computation and other statistical methods to get personalized coefficients for patients with colorectal cancer. The procedure of usage data is explained in our paper [Bioinformatics, 36, 11 (2020) 3597] in which the part of these data was used.


Specifications
Bioinformatics Specific subject area Colorectal oncology and Mathematical modeling Type of data Statistical simulation for variant allele frequencies of clonal evolution of colorectal cancer How data were acquired The dataset is a set of results of 9.6 millions of simulations using tugHall simulator. To get a large number of simulations we used the resources of the supercomputer "SHIROKANE" of the Human Genome Center of the University of Tokyo [ https://supcom.hgc.jp ]. Data format Simulation data (tabular format), Modeling workflow (figure) Parameters for data collection Identification number of simulation, names of models, initial conditions, the format of input parameters, weights for hallmarks and genes, compaction factors, probabilities of stochastic processes, as well as results of simulations (variant allele frequencies, numbers of primary tumor cells and metastatic cells, last time-step, number of clones). Description of data collection Data were collected as a result of multiple simulations. Parallel calculations with 40 nodes and 960 cores were used to perform 9,60 0,0 0 0 trials for 4 models, 3 types of initial clones, and 2 types of input data. In total there are 24 combinations with 40 0,0 0 0 trials for each. Data

Value of the Data
• Dataset has the usage potential to predict the target gene for the treatment of colorectal cancer in personalized medicine. • Dataset can be useful in the field of bioinformatics and biostatistics to choose the target gene, and it is complementary to survival analysis to improve the probability of survival for a patient using genome information. • Approximate Bayesian computation allows us to extract the weights of the relations between driver genes and cancer hallmarks in the tugHall model for a particular patient. Using the personalized weights, it is possible to predict which gene should be blocked to stop cancer development. • The accuracy of the prediction depends on the size of the dataset. That's why it has results of several million simulations, and it will continue growing in the future.

Data Description
The dataset provides results of 9.6 million calculations using the tugHall simulator. Output data of simulations are variant allele frequencies (VAF) for four genes APC, KRAS, TP53, and PIK3CA related to colorectal cancer [1] . During each simulation tugHall stochastically reproduces Darwinian evolution for cancer cells and calculates clonal heterogeneity [2] . Calculations of VAF were performed at last time-step of simulation as well as statistical data like numbers of the primary tumor and metastatic cells, final time-step and number of clones. The VAFs were calculated for two cases. First one is VAF for all cell in the simulation pull,  second is VAF only for primary tumor cells. These results are divided into two files with the same structure (data_base_MODELS_ALL.txt and data_base_MODELS_PRIMARY.txt respectively). The file data_base_STATISTICS.txt contains the statistical data of simulations. In total the dataset has 8 files: 3 files with output data, 4 files with input data and file with analytic data ( Table 1 ). Table 1 shows the short description of each file as well it's size. Table 2 shows analytical data for different types of simulations. Each simulation can finish with two possible cases: the first one is when all cells died and without any output data (unsuccessful), another case is a simulation with output data (column "success" in Table 2 ). The successful simulations consist of two subsets: with zero output for all VAFs and with non-zero Table 3 The results of simulations in the file "data_base_MODELS_PRIMARY.txt". VAF at least for one gene (column "non_Zero" in Table 2 ). Hereinafter, VAF means VAF for primary tumor cells. In total the dataset has 1,706,179 records with 284,674 non-zero outputs from 9,60 0,0 0 0 trials of 24 types of simulations. Correspondence between input and output data is connected with an identification number of simulation (ID_Simulation). Table 3 shows the first several rows of the dataset for the results of simulations. It has information about the names of models, initial cells, the format of input parameters, and identification number of simulation as well as results of simulations. The results are summary statistics such as variant allele frequencies that are used in bulk-cell signaling data [1] and represented in the descending order of VAF for each driver gene. The dataset has the 5 largest values of VAF for each gene (APC, TP53, APC, and PIK3CA) but in Table 3 we have shown only the first values of VAFs for each gene. The structure of data in the file "data_base_MODELS_ALL.txt" is the same.
• name_init -name of initial clones: "Mutated_cell", "Thousand_cells" or "Mutated_cell_in_Thousand_cells".  Table 4 shows the first several rows of the dataset for initial parameters for each simulation. It has information about probabilities of the environment death, parameters of sigmoid function for apoptosis death, etc. and also weights between cancer hallmarks and driver genes [2 , 3] . The file "Initial_parameters_Discrete_ALL.txt" has the same structure but the data are discrete. Table 5 shows how the weights from input file correspond to the hallmarks of cancer. The weights here are a quantitative representation of qualitative dependencies from the dataset of somatic cancer genetics at high-resolution (COSMIC) [3] . For each simulation, the values of weights from Table 5 are written as a row vector in Table 4 . The files "Com-paction_Factor_Discrete_ALL.txt" and "Compaction_Factor_Continuous_ALL.txt" have the same structure and contain the input data of compaction factors ( Table 6 ). The hallmarks are denoted as an abbreviation, for example, Ha -apoptosis hallmark (see Table 5 ).
• ID_Simulation -identification number of a simulation.
• Mutated_Gene -the name of a driver gene for an initial cell.
• coefficients and initial probabilities in the simulator tugHall [2] : Table 4 The dataset of initial parameters from the file "Initial_parameters_Continuous_ALL.txt".   Table 5 Keys for the weights between hallmarks and genes.  • E0 -the environmental variable gives the maximum number for logistic growth as 1/E0, • F0 -this parameter serves to extend the maximum cell number defined by E0, through angiogenesis, • m0 -parameter to define the probability of point mutation, • uo, us -the probabilities that oncogenes and suppressor genes are impaired by point mutations, respectively, • s0 -coefficient in sigmoid function, • k0 -the probability of environmental death, • d0 -the initial probability of division.

Experimental Design, Materials and Methods
Fig . 1 shows the flowchart of the procedure for simulations, using 4 models, 3 initial conditions and 2 types of input data. Therefore, there are 24 types of simulations ( Table 2 ). The procedure includes 4 models with a common part that are fully described in the supplementary materials of the manuscript [2] . There are differences in a few conditions for hallmarks and the initial clones. Models are divided by two criteria: Table 7 Relations between models' names and parameters of models: the condition of invasion/metastasis transformation and presence or absence of compaction factors.

With compaction factor
Without compaction factor g k · w k , where g k = 1 when function of gene k is destroyed, and g k = 0 for normal state, w k is a weight for related gene. The index x relates to the hallmarks [4] : -The invasion/metastasis transformation condition: i m = 1 (strong condition) or i m > 0 (weak condition).
So, there are 4 models: with and without compaction factors, and with the strong or weak condition of invasion/metastasis transformation ( Table 7 ). Discrete and continuous weights define the value type of weights ( Fig. 1 ). Firstly, the generation of initial parameters occurs with the saving them to the files "Initial_parameters_Discrete.txt" and "Initial_parameters_Continuous.txt", also the files "Compaction_Factor_Continuous.txt" and "Compaction_Factor_Discrete.txt" include the values of compaction factors for each simulation.
The dataset has three cases for initial clones: CASE I: "Mutated_cell" with few exceptions, the tumor cell population(s) in a human, including metastatic one, are originated from only one cell (clonal mutations) [5] . So, the clones usually have a single common ancestor. That is why we set one possibility to start from just one primary cell. If we start from 1 cell in simulation, however, the cell population becomes extinct in most cases. In the case of extinction, we have to automatically "restart" the simulation (by default we set 100 as the number of restarting). The restarting function is implemented as an additional part of this case. To accelerate the simulation the initial primary cell should have a driver mutation at one gene (the column "Mutated_Gene" in Table 4 ). CASE II: "Thousand_cells" another case is to start from 10 0 0 primary cells in order to increase the probability of mutation in one simulation and decrease computational cost. However in this case there are possibly several tumors originated from different normal cells and the tumors do not share any mutations. CASE III: "Mutated_cell_in_Thousand_cells" and the third case is a combination of two previous cases. We start with 10 0 0 primary cells, where one cell has a driver mutation.
The flowchart in Fig. 1 shows the procedure of each simulation. tugHall gets initial parameters and (if it is needed) compaction factors related to simulation ID. The weights for hallmarksgenes relations are generated in accordance with statistics data of the Catalogue of Somatic Mutations in Cancer [3] . Then it chooses the case of the initial cell and a model. After the simulation tugHall saves VAF to the file. Finally, Approximate Bayesian computation (ABC) uses the dataset of VAF to get personalized weights related to the VAF of a patient, for example, in our works [2 , 6] we used VAF of patients from the open datasets of the Cancer Genome Atlas [7] .
For calculations, the supercomputer SHIROKANE was used [8] . To get a large number of simulations, we designed a new version of tugHall v.2.1 [6] that allows to accelerate the calculations up to 10 4 times in comparison with version 1.0. In R script the parallel library was used, which allows making parallel simulation in one node/computer with many cores. For usage of multiple nodes, an array job was used (one job for each node with parallel simulations at each node). The number of jobs was 40 with 10,0 0 0 simulations for each type of simulation. The computational cost was around 44 −47 h per job or node with 24 cores. In total 960 processors for 47 h were used for 9.6 million trials of 24 types of simulations.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.