Modeling metastatic progression from cross-sectional cancer genomics data

Abstract Motivation Metastasis formation is a hallmark of cancer lethality. Yet, metastases are generally unobservable during their early stages of dissemination and spread to distant organs. Genomic datasets of matched primary tumors and metastases may offer insights into the underpinnings and the dynamics of metastasis formation. Results We present metMHN, a cancer progression model designed to deduce the joint progression of primary tumors and metastases using cross-sectional cancer genomics data. The model elucidates the statistical dependencies among genomic events, the formation of metastasis, and the clinical emergence of both primary tumors and their metastatic counterparts. metMHN enables the chronological reconstruction of mutational sequences and facilitates estimation of the timing of metastatic seeding. In a study of nearly 5000 lung adenocarcinomas, metMHN pinpointed TP53 and EGFR as mediators of metastasis formation. Furthermore, the study revealed that post-seeding adaptation is predominantly influenced by frequent copy number alterations. Availability and implementation All datasets and code are available on GitHub at https://github.com/cbg-ethz/metMHN.


Introduction
Metastasis is the primary cause of cancer-related death.It occurs as tumors evolve, when the primary lesion extends beyond its initial boundaries, invading adjacent healthy tissues, lymph nodes, and blood vessels.Cancer cells can then enter the bloodstream and spread to different locations within the body.At these new sites, the disseminated cells face novel selective pressures, leading to the elimination of many, but not all, cells.The survivors adapt and eventually colonize these foreign tissues, forming metastases (Lambert et al. 2017).This last step, the establishment of a (detectable) metastasis at a distant site, is what is commonly referred to as metastatic seeding.The development of cancer, or tumorigenesis, is predominantly driven by the progressive accumulation of genomic alterations, including somatic mutations and copy number alterations in cancer driver genes (Weinberg 2014).These alterations often result in divergent genotypes between a primary tumor and its associated metastasis.Extensive clinical sequencing efforts like the MSK-MET study (Nguyen et al. 2022) recently compiled genomic data from primary tumors and metastases.In principle, such datasets may inform about the timing and genetic mechanisms of metastasis formation, but revealing these pieces of information is challenging.
Cancer progression models aim to infer interactions between genomic alterations based on their co-occurrence patterns in cross-sectional data.Such models can then be used to both predict the future progression of tumors as well as to explain the past by inferring the order in which observed alterations accumulated.These models have their roots in the pioneering work of Fearon and Vogelstein (1990).Since then, a variety of models and algorithms have emerged to refine and expand upon this concept.They include Conjunctive Bayesian Networks (Beerenwinkel et al. 2007), CAPRI (Ramazzotti et al. 2015), Network Aberration Models (Hjelm et al. 2006), HyperTraPS (Greenbury et al. 2020), and Mutual Hazard Networks (Schill et al. 2020).All of these models only consider the progression of a single sequence and thus cannot capture the divergent, branching patterns characteristic of metastatic disease progression.Therefore none of the above mentioned models can leverage the information provided by matched primary tumor and metastasis samples from the same patient.Methods like REVOLVER (Caravagna et al. 2018) or TreeMHN (Luo et al. 2023) can account for this branching behavior as they model evolution of tumors on a clonal level.However, they require phylogenetic data and are not explicitly designed to model metastatic branching.
Here, we present Mutual Hazard Networks for metastatic disease (metMHN), a cancer progression model that captures the branching progression observed in primary tumors and their metastatic offshoots.The model is designed to infer

Materials and methods
metMHN extends the Mutual Hazard Network (MHN) framework, originally introduced by Schill et al. (2020) and further developed by Schill et al. (2023), which models the progression of primary tumors.We first establish the notation employed by MHNs and then introduce metMHN.

Mutual hazard networks
MHNs (Schill et al. 2020) model the progression of primary tumors as a continuous-time Markov chain (CTMC) fXðtÞ; t ≥ 0g on the binary state space f0; 1g n .Specifically, a state x 2 f0; 1g n represents a cancer genome, where x i ¼ 1 indicates that event i 2 f1; . . .; ng (e.g. a somatic driver mutation or copy number alteration) was detected in the cancer genome, whereas x i ¼ 0 indicates that it was not detected.metMHN thus models the progression of consensus mutational profiles, without accounting for subclonal structure.Let pðtÞ 2 ½0; 1� 2 n denote the probability distribution over states at time t, where the states are ordered lexicographically.The evolution of the probability distribution over time is governed by the Kolmogorov forward equation Here pð0Þ denotes the distribution over states at the start of the progression.It is assumed that all tumors start in a wild type state, where no event has occurred yet, thus pð0Þ ¼ ð1; 0; . . .; denotes the transition rate matrix on the state space.Events are assumed to accumulate irreversibly and one at a time.Therefore, the only non-zero off-diagonal entries of Q are the transition rates from states that differ by exactly one event i.The transition rates are parameterized by a much smaller matrix Θ 2 R n × n ≥ 0 as Here Θ i;i denotes the base rate with which event i spontaneously occurs in a tumor and Θ i;j the multiplicative effect of the presence of event j on the rate of event i.No assumption is made about the biological mechanisms underlying such rate changes.However, within the context of this specific analysis, rate changes between mutational events may represent evolutionary dependencies (Mina et al. 2022) and positive rate changes between copy number events progressively increasing levels of chromosomal instability (Potapova et al. 2013).The age of a tumor at the time of its diagnosis is unknown.In Schill et al. (2020), it is assumed to be exponentially distributed with mean 1 and independent of the state of the tumor.Marginalizing over t in the solution of Equation ( 1) yields the time-marginal distribution (3) where I denotes the identity matrix.Let p x denote the probability of observing a tumor in state x.Then the average loglikelihood for a dataset D of tumor states is defined as The matrix Q does not need to be stored explicitly, because it can be written as a sum of tensor products.By using tensor operations, p can be calculated efficiently and Θ can be learned with a time and space complexity only exponential in the number of events that have occurred for each tumor in the dataset, rather than exponential in 2n (Buis andDyksen 1996, Schill 2022).Recently (Klever et al. 2022, Georg 2022, Pfahler et al. 2023) reduced the complexity further to n 3 using modern tensor formats and thus made MHN applicable to even larger state spaces.
Clearly, a tumor can only appear in a dataset after it has been clinically detected.This detection, in turn, is influenced by the tumor's genotype, as certain mutations can induce growth or alter the tumor's morphology.Such changes may result in symptoms that lead to the tumor's discovery, followed by its diagnosis, surgical removal, and eventual sequencing.Therefore the rate of observation should be dependent on the state of the tumor.In Schill et al. (2023), the observation of a tumor was introduced as a separate event with its own set of parameters Ω 2 R n > 0 .The observation of a state x happens at a rate u x ¼ Q xj¼1 Ω j , where Ω j is a multiplicative effect of the presence of event j on the rate of observation.On the other hand multiplicative effects of the observation on other events are set to 0. Thus, as soon as the observation event occurs, progression is halted.States where the observation occurred are thus absorbing states of the Markov chain.Then the probability distribution at observation is equal to the stationary distribution pð1Þ and given by pð1Þ and Q and pð0Þ defined as in Equation (3) (Schill et al. 2023).

metMHN
With metMHN, we model the joint progression of primary tumors and metastases as a Markov process on the combined event space of both tumor entities (see Fig. 1b).Formally, we consider a CTMC fXðtÞ; t ≥ 0g on the state space S :¼ ff0; 1g × f0; 1gg n × f0; 1g.A state x 2 S is represented by a bit string of length 2n þ 1.Each of the n progression events is encoded by two bits.The first bit x iP indicates the status of event i 2 f1; . . .; ng in the primary tumor, and the second bit x i M indicates the status of event i in the metastasis.We use the notations PTðxÞ ¼ ðx i P Þ and MTðxÞ ¼ ðx i M Þ for i in f1; . . .; ng to refer to the genotypes of the primary tumor and the metastasis respectively.The ðn þ 1Þ th event is encoded by one bit only and indicates the status of the metMHN seeding event.In the model context, the seeding event denotes that the progression of the metastasis has become decoupled from the progression of the primary tumor.Analogous to MHN we parameterize all transition rates by a lowdimensional set of parameters Θ 2 R ðn þ 1Þ × ðn þ 1Þ , where Θ i;i refers to the base rate of event i and Θ i;j to the effect of event j on the rate of event i.Before and after the seeding of a metastasis we assume different transition dynamics, which we describe in the following paragraphs.
Prior to seeding, the (soon-to-be) metastasis is identical to the primary tumor.Thus, events occur simultaneously in the primary tumor and the metastasis.Formally, we can describe these dynamics by a CTMC on the subspace S 0 :¼ ff0; 1g × f0; 1gg n × f0g � S with transition rate matrix . Let x ¼ ð. . .; x ði − 1Þ M ; 0; 0; x ði þ 1Þ P ; . . .; 0Þ and x þ i P þ i M :¼ ð. . .; x ði − 1Þ M ; 1; 1; x ði þ 1Þ P ; . . .; 0Þ be states that differ by exactly one event i. Transitions from states x to states In the top-left section, we show the types of input data that metMHN processes.Each row corresponds to a patient, each column to an event in the primary tumor (blue) or the metastasis (red).Events are represented by symbols and their status is encoded with a '1' for present, '0' for absent, or left blank if a tumor is unobserved.On the right, we present the primary output of metMHN: A network of interactions between events in matrix form.In the lower section, we show the most probable chronological ordering in which events accumulated in observed data points as inferred by metMHN.The progression trajectory of the primary tumor is indicated by blue arrows, while the trajectory of the metastasis is marked by red arrows.(b) The metMHN process and its state space: Black-bordered squares represent full states: the two compartments on the left detail the status of the primary tumor, the two on the right correspond to the metastasis, and the central diamond symbolizes the seeding event.The diagram is divided into two subspaces, with the left half constituting the subspace S 0 and the right half comprising the subspace S 1 .Transitions between states that occur at non-zero rates are shown as solid black arrows.Transitions that are not possible in S 0 but are possible in S 1 are indicated by greyedout arrows.Dotted arrows highlight transitions that influence the seeding event specifically.
All other transitions within S 0 are prohibited (rate 0).After seeding, the primary tumor and the metastasis are separate tumors and we assume that both accumulate mutations independently of each other.Formally, we describe the post-seeding dynamics by a CTMC on the subspace S 1 ¼ ff0; 1g × f0; 1gg n × f1g � S. We introduce two transition rate matrices Q P and . Q P holds the rates for transitions that change only the primary tumor part of a state x: x ði þ 1Þ P ; . . .; 1Þ occur at rate Note that transition rates in Q P only depend on the primary tumor genotype PTðxÞ and not on the full state x.Since events must occur one at a time, all other transitions on S 1 that affect the primary tumor occur at rate 0. Q M holds the rates for transitions that change only the metastasis part of a state x.We assume that metastatic tumors spread to foreign sites and face novel selective pressures that can differ drastically from the original site.We account for this by explicitly modeling effects from the seeding event on the progression events.Progression events occur in the metastasis at a rate given by the product of their base rates, the effects of events that are present in the metastasis and the effect of the new environment.Hence, transitions from states x ¼ ð. . .; x ði − 1Þ M ; x i P ; 0; x ði þ 1Þ P ; . . .; 1Þ to states x þ i M ¼ ð. . .; x ði − 1Þ M ; x i P ; 1; x ði þ 1Þ P ; . . .; 1Þ occur at rate All other transitions on S 1 that affect the metastasis are prohibited (rate 0).The full transition rate matrix on S 1 is then given by the sum of Q P and Q M .
By construction, the last event that occurs jointly and at the same time in a primary tumor and metastasis is the seed- denote the transition rate matrix that holds the rates for all transitions from states x ¼ ðx 1M ; . . .; x nM ; 0Þ in S 0 to their corresponding states x þ S ¼ ðx 1M ; . . .; x nM ; 1Þ in S 1 .Such transitions occur at rate See Fig. 1b for an illustration of the state space for n ¼ 2. The transition rate matrix on the full state space S is then and we denote the probability distribution over states at time t by pðtÞ.Following Klever et al. (2022), we also provide formulas for the matrices Q 0 ; Q S ; Q P ; Q M as sums of tensor products in Supplementary Section S1.By using these tensor structures in conjunction with the methods outlined in Schill (2022, Appendix A), the model parameters can be learned with a time and space complexity only exponential in the number of events that have occurred for each sample in the dataset, rather than exponential in 2ð2n þ 1Þ.

Modeling consecutive observations
Following Schill et al. (2023), we model the observation of tumors explicitly as events.Since we model two tumors that at some point evolve independently and can also be observed separately, we have to include two distinct observation events.Thus we now model a CTMC on the extended state space S D :¼ S × f0; 1g 2 .Let � pðtÞ denote the probability distribution over states on the extended state space at time t.We assume that each event has a multiplicative effect on the rate of observation of the tumor it occurred in.Since the events that lead to the detection of a primary tumor can be vastly different from the effects that lead to the detection of a metastasis, we introduce two separate parameter vectors > 0 that contain the effects of progression events in the primary tumor and the metastasis on the rates of their respective observation event.The primary tumor and the metastasis observation rates are defined as denote the diagonal matrices that hold the observation rates for primary tumors and metastases respectively and U S ¼ U P þ U M .We define that a metastasis is not observable prior to the seeding.Therefore, we set the rates of observation of metastases for such states to 0. We are interested in the distribution of the full system at the time of first observation, which can be triggered by either primary tumor or metastasis.We calculate this analogously to Schill et al. (2023) as the stationary distribution � p on the extended state space S D where each of the observation events halts the progression of the entire system.Each state where either observation occurred becomes an absorbing state.Thus the entire probability mass is located on the sets of states O P ¼ S × ð1; 0Þ (primary tumor is observed) and O M ¼ S × ð0; 1Þ (metastasis is observed).Analogous to Equation (5), we therefore have (13) In most cases, there is a considerable time lag between the observation of a primary tumor and the observation of its metastatic offspring.To account for this, we model two consecutive observations.Consider the case where the primary tumor is observed first with genotype x P 2 f0; 1g n and the metastasis is only observed at a later point in time with genotype x M 2 f0; 1g n .In this case the metastasis is unobservable metMHN i143 at the time of primary tumor observation, and thus we are interested in the metastasis marginal probability � p Po of only observing a primary tumor x P , given by Note that each tumor in a dataset is observed exactly once and no information about its subsequent progression is available.Therefore we do not track the progression of the primary tumor after its observation.Instead from here on, we only model the progression of the still unobserved metastasis.
To do so, we first calculate the distribution of metastasis genotypes at the time of primary tumor observation conditioned on the observed primary tumor genotype, which is given by In words, we set the probability of all states where the primary tumor genotype is not equal to the observation to 0, and then renormalize the resulting vector to obtain the desired conditional distribution.Next analogously to Schill et al. (2023) we propagate the distribution of unobserved metastases forward in time, until the metastasis is observed.This yields Finally, the probability to observe a primary tumor and metastasis pair in state x, given that the primary tumor was observed first is By an analogous calculation, the probability to observe a primary tumor and metastasis pair in state x, given that the metastasis was observed first is given by If the order of observation is not recorded, then we evaluate the total probability to observe state x as Equations (18-20) give the probabilities of observing pairs of genotypes.However, often only a single genotype is available, whereas the other is missing.Such individual data points are incorporated by first calculating the full joint distributions over all states and then by marginalizing over the missing genotypes.First consider the case, where only a primary tumor is observed with genotype x P , then marginalization over the unobserved metastasis genotypes yields If a metastasis was observed but not sequenced, then we do not need to sum over all states, but only over states in S 1 .
Conversely, if evidence for the complete absence of metastases is available, then we only need to sum over states in S 0 .Next, consider the case where only a metastasis is observed with genotype x M , then marginalizing over the unobserved primary tumor genotypes yields Since a metastasis is observed, we know that seeding must have occurred and therefore we only need to sum over states in S 1 .

Parameter estimation
The average log-likelihood of a dataset D containing primary tumor and metastasis pairs as well as single genotypes is given by where We infer the parameters Θ; Ω M ; Ω P from data via maximum likelihood estimation.We follow (Schill et al. 2023) and utilize the penalization penalðΘ; ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi with The penalty promotes sparsity as the logarithmic parameters are shrunk to 0. Additionally, it promotes symmetry as effects between events i and j are grouped and selected together.We then optimize via gradient ascent.The hyper parameter λ is selected via 5fold cross validation.

Results
We first assessed metMHN's ability to recover parameters in simulations.The exact simulation setup and the results are shown in Supplementary Section S2.Next, to further our understanding of metastatic spread in lung adenocarcinomas, we trained metMHN on 4852 paired and unpaired samples from the LUAD cohort of the MSK-IMPACT study.In the i144 Rupp et al.
following section we describe the dataset and then present our key findings.

Data preparation
We retrieved the AACR GENIE 14.1 data release (Pugh et al. 2022) through synapse.org(The AACR Project Genie Consortium 2023).Our selection included all samples assayed at the Memorial Sloan-Kettering Cancer Center annotated with the ONCOTREE code 'LUAD' (Lung Adenocarcinoma).For primary tumors without corresponding metastasis samples, we retrieved information about their metastatic status from Nguyen et al. (2022) and excluded samples where the status of the metastasis was unknown.
The final dataset consisted of 453 matched primary tumor (PT)/metastasis (MT) samples, 2127 unpaired MT samples, 595 PT samples without corresponding metastases (seeding ¼ 0), and 1677 PT samples with metastases that were not sequenced (seeding ¼ 1).The three most highly mutated paired samples were excluded from our analysis due to computational challenges in processing them with metMHN.
In total, our study included 2725 PT and 2580 MT samples from 4852 patients.Metadata for each sample also included the age of the corresponding patient at which the sample was reported (see Supplementary Fig. S7).These data inform the model about the order of observation of primary tumors and metastases in the same patients.When multiple PT or MT samples were present, we chose the PT sample with the youngest sampling age and the MT sample with the oldest sampling age.Genomic data consisted of somatic mutation data and segmented log R ratio (LRR) copy number data derived from single-region bulk sequencing using the targeted MSK-IMPACT panel (Cheng et al. 2015).We annotated mutation data using OncoKB (Chakravarty et al. 2017) and filtered for variants likely to be functional, as outlined in Schill et al. (2023).Our analysis was restricted to genes consistently included in all versions of the MSK-IMPACT panel (The AACR Project Genie Consortium 2023).Specifically, we examined mutations in the top 20 most frequently mutated genes.In the case of copy number alterations, we initially normalized segmented copy number data using mecan4CNA (Gao and Baudis 2020).Amplifications were identified with LRR values corresponding to relative copy number gains ≥ 0:5.Conversely, deletions were marked by LRR values corresponding to relative copy number losses ≤ − 0:5.We determined the precise minimal intervals necessary for a copy number event classification in 8 instances, based on the minimal commonly altered regions per chromosome arm and gene extents.For amplifications, we required full gene extents to be covered by an alteration, whereas for deletions we allowed for shorter intervals.In total, our study considered 28 distinct genomic events, including mutational events ('M'), copy number amplification ('Amp') and deletion ('Del') events.Binary event input data, alongside exact interval definitions for copy number events, records of the selected patients and samples and preparation scripts are accessible at https://github.com/cbg-ethz/metMHN.

Effects between genomic events and seeding
On the dataset described above, we fit metMHN and tuned the hyperparameter λ in a 5-fold cross-validation (Fig. 2).Reassuringly, the LUAD model confirms several interactions well-documented in the literature.Specifically, it identifies the strongly mutually suppressive relationship (evidenced by a bidirectional negative edge) between the principal drivers KRAS (M) and EGFR (M) (Unni et al. 2015;Skoulidis and Heymach 2019).Our model infers that EGFR suppresses further mutational co-drivers, which suggests that it might often be sufficient for progression.Instead, EGFR-driven LUADs frequently exhibit disruption of cell cycle regulation through copy number losses in RB1 and CDKN2A, two patterns also described in Nahar et al. (2018).
The model further highlights synergistic interactions that reflect established oncogenic processes, such as the rate increases observed between STK11 (M) and KEAP1 (M), and between TP53 (M) and RB1 (M) (Offin et al. 2019, Wohlhieter et al. 2020, Cai et al. 2022).metMHN also infers multiple positive interactions between gene mutations and corresponding copy number alterations, exemplified by the interaction between EGFR (M) and amplification of EGFR/ 7p, as well as between STK11 (M) and deletion of STK11/ 19p-a pattern commonly seen across various cancers (Becchi et al. 2023).Additionally, the model reflects that several mutational events capable of activating the (RTK)-RAS-RAF-MEK signaling pathway-namely, KRAS (M), EGFR (M), NF1 (M), BRAF (M), and MET (M)-tend to promote the observation of primary tumors and suppress each other's occurrence (Imperial et al. 2019).

metMHN identifies drivers of metastasis
We next examined the interactions between genomic events and metastatic seeding.The outgoing edges from the seeding event (rightmost column in Fig. 2) represent the cancer cell's adaptive response to the changing selective pressures encountered during its journey from the primary tumor to the metastatic site.The incoming edges into the seeding event (bottom row in Fig. 2) indicate how particular mutations within the primary tumor may accelerate or impede the metastatic seeding rate, thereby pinpointing genetic elements that either drive or hinder metastasis development.
metMHN identifies mutations and amplifications in EGFR, along with TP53 mutations and deletions, and MET mutations, as accelerators of metastasis formation, as indicated by positive edges (ie, promoting effects) from these events to the seeding event (Fig. 2).These findings are substantiated by experimental evidence which indicate that activation of EGFR (Che et al. 2015, Tsai et al. 2015), inactivation of TP53 (Wang et al. 2009, Powell et al. 2014), and activation of MET (Chang et al. 2015, Yin et al. 2019) enhance the metastatic capacity of lung cancer cells.Beyond these events, metMHN also revealed that various other copy number alterations positively influence the seeding process.Although widespread aneuploidy is typically regarded as a hallmark of advanced cancer stages (Ben-David and Amon 2019), specific copy number changes, like CDKN2A deletions, have been documented to sometimes occur early in lung adenocarcinoma development (Nahar et al. 2018, Watkins et al. 2020).In this context we also note metMHN's inference that copy number events generally do not substantially affect the primary tumor observation rate but indeed promote metastasis observability.
Interestingly, the effects promoting metastasis were relatively modest when compared to the base rate of seeding.This observation suggests that certain genetic or non-genetic drivers of the metastatic process might not be accounted for in the model.Alternatively, this could also indicate that metMHN i145 primary tumor cells may inherently possess a propensity to metastasize, as suggested by Klein (2020).Lastly, metMHN suggests that upon the seeding of metastases, the accumulation rates of many mutational events tend to decrease.This pattern could imply that once the metastatic process is initiated and in progress, there is diminished pressure for further mutational driver alterations, compared to the initial stages of primary tumorigenesis (Dymerska and Marusiak 2024).

Relative timing of progression events and seeding
We computed the most likely chronological sequences of events for 313 paired data points and 2,127 unpaired metastases, where we limited our analysis to cases where calculations were feasible.For the paired data points the orderings are branched, as exemplified in Fig. 3a.Prior to seeding, events happen jointly in the primary tumor.Upon seeding, the trajectory splits into a primary tumor branch and a metastasis branch (blue lower and red upper branches in Fig. 3a, respectively).The unpaired metastases' orderings are linear.
Next, we analyzed the distribution of event positions in metastasis genotypes, relative to trajectory lengths (Fig. 3b): The plots show for every event how often it occurred for each relative time point, where the left end of the axes corresponds to the beginning and the right to the end of progression.Well-established and highly frequent mutational drivers of LUAD progression, such as KRAS (M), EGFR (M), and TP53 (M) appear consistently early as initiating events.We find similar patterns for less frequent mutational events, such as MET (M) and SETD2 (M).Some events rarely appear as initiators, but still mostly occur in the early half of any sequence, such as STK11 (M) and BRAF (M).For example, RB1 (M) rarely happens spontaneously, which is reflected by its low base rate.However, it is promoted by both EGFR (M) and TP53 (M) and thus tends to happen subsequently, see Fig. 2 and Supplementary Fig. S5.Crucially, metastatic seeding was observed to happen at varying stages, with the majority of trajectories showing genomic progression both before and after seeding.On the late end of the spectrum we mainly find copy number events.After the first such event happens, it usually promotes other copy number events (see Fig. 2), leading to compounding rate increases for copy number events towards the end of a typical trajectory, possibly Figure Interactions between progression events in lung adenocarcinomas.The log-effects on observation (clinical detection) of the primary tumor and metastasis ω P and ω M are plotted in the first two rows, the remaining matrix shows the log-interaction strengths among genomic events θ.Promoting effects are colored in green and suppressive effects are colored in orange.The base rates of all events are plotted on the left (in blue).The effects an event i exerts on other events j are collected in the ith column (outgoing edges).Vice versa, the effects that events j exert on event i are collected in the ith row (incoming edges).Effects of genomic events on seeding are shown in the bottom row.Vice versa, effects from seeding on genomic events are shown in the rightmost column.(c) Empirical evidence from paired samples and pre-seeding probabilities estimated by metMHN through simulation.The first and second column show the pre-seeding probabilities estimated by metMHN conditioned on the event being observed in the primary tumor (column 1) or the metastasis (column 2).Column 3 shows the number of occurrences for each event in the paired data, column 4 shows the proportions of shared versus private occurrences for each event in the paired data.Columns 5 and 6 show the mean variant allele frequencies in the primary tumor and the metastasis, respectively.
metMHN i147 reflecting genomic instability in late stage cancers (Ben-David and Amon 2019).Next, we stratified the inferred metastasis trajectories by the 3 most prevalent initial events.Specifically, trajectories starting with TP53 (M), KRAS (M), and EGFR (M) at the first position accounted for 1766 patients or 72.38% of the analyzed metastases (see Supplementary Fig. S5).Remarkably, the subset of trajectories initiated by TP53 (M) included a significant number of tumors which seeded immediately after.These tumors then predominantly acquired copy number events.In a minority of cases, additional mutation events such as STK11 (M) and KEAP1 (M) occurred before seeding.Trajectories that began with KRAS (M) generally showed later seeding, frequently after the accumulation of other mutational co-drivers, including TP53 (M), STK11 (M), KEAP1 (M), RBM10 (M), and ATM (M).These trajectories too typically concluded with a series of copy number events.Conversely, trajectories initiated by EGFR (M) (right side) exhibited distinctly different progression patterns.Contrary to those beginning with KRAS (M), these trajectories rarely accumulated additional mutational events before seeding, with TP53 (M) being an exception.Post-seeding, the progression was once again dominated by copy number changes.However, these events followed characteristic sequences, often starting with EGFR/7p (Amp) and CDKN2A/9p (Del), then proceeding to TP53/17p (Del) and STK11/19p (Del), and culminating with the clinical detection of the tumor.

metMHN is consistent with clonality information
A key quality of metMHN is its ability to quantify the timing of seeding relative to other progression events.To validate this, we compared it with an orthogonal readout of metastatic development relative to mutational events: A mutation that predates the seeding of a primary tumor clone is expected to be clonal, i.e. exhibit a high variant allele frequency (VAF, close to 0.5) in subsequent metastases (Birkbak and McGranahan 2020).In contrast, mutations arising postseeding in metastases are more likely to be subclonal and thus exhibit lower VAFs.Therefore, we used per-gene mean VAFs in metastasis samples as a proxy for the relative timing (pre-or post-seeding) of the occurrence of mutations in the respective gene.To account for a bias in VAF distributions, we restricted VAF measurements to cases in which the respective gene was not copy number altered.We compared for each mutation its mean VAF with the model-derived probability that the event occurred prior to seeding.To this end, we approximated this probability through simulations using Gillespie's algorithm (Gillespie 1977).We found that mutational events with high pre-seeding probabilities in metastases corresponded to elevated VAFs in metastasis samples as evidenced by a Pearson correlation coefficient of 0.55 (P ¼ .01)see Fig. 3c and Supplementary Fig. S6.For a more detailed analysis of timing trends between pairs of genomic events, we provide a comparison of metMHN inferences with trends in phylogenetic analyses of TCGA primary tumors (Raynaud et al. 2018) in Supplementary Section S3.In summary, while metMHN builds on co-occurrence patters and does not leverage VAF information, they nevertheless produce results consistent with clonality information.

Discussion
We have presented metMHN, an efficient analytical model for cancer progression, specifically designed to investigate the forking progression paths of primary tumors and their metastatic offspring.metMHN capitalizes on the extensive crosssectional data available from clinical targeted sequencing and is able to infer relationships between events that are shared across individual samples.Our comprehensive analysis, encompassing data from nearly 5000 lung cancer patients, corroborates well-established relationships among key genomic drivers.In addition, metMHN successfully identifies specific events in primary tumors that may accelerate the development of metastases and quantifies how the dynamics of event accumulation change upon metastatic branching.Moreover, metMHN allows for the reconstruction as well as for the simulation of disease histories yielding further insight into the dynamics of metastatic cancers.
Every model's efficacy is inherently tied to the quality of its training data.While metMHN uses comprehensive crosssectional data from bulk tissue, this approach has its limitations, particularly in resolving the clonal structures of heterogeneous tumors.In metMHN, binary states represent the tumor as a whole.Consequently, two tumors with identical mutations will be interpreted identically by the model, even if, in one case, the mutations exist within the same clone, and in the other, they are in separate clones.In terms of what we define as seeding event, the most accurate biological interpretation would be the onset of genetic divergence between the metastasis-seeding cell and its most recent detectable ancestor in the primary tumor (Sun and Nikolakopoulos 2021).Phylogenetic methods which use multi-regional samples have an advantage in accurately timing this event.Furthermore, this notion of seeding does not necessarily correspond to the seeding cell leaving the primary tumor, nor does it necessarily correspond to the establishment of the seeding cell at its metastatic site (Sun and Nikolakopoulos 2021).
Another challenge arises when the training data does not accurately represent the patient population.For instance, an under-representation of metastatic tumors in the training data could lead to an underestimation of the base rate for the seeding event, falsely suggesting they occur later in the progression than they actually do, while an over-representation of these cases would have the opposite effect.In contrast, phylogenetic methods, which reconstruct tumor evolution on an individual basis, are less susceptible to biases in datasets.These methods also offer the advantage of resolving clonal structures, presenting a more detailed picture of tumor evolution.However, the scarcity of data, especially in multi-region sequencing studies, limits their ability to represent patient populations comprehensively.
The complexity of cancer progression can exceed the capabilities of metMHN, for example, when patients present with numerous metastatic lesions or harbor disseminated cells that have yet to form detectable metastases.Various factors, including treatment modalities, genetic predispositions, age, inflammation, and other comorbidity conditions may further influence disease progression.
In summary, metMHN is a cancer progression model providing a quantitative and dynamic description of tumor development and metastatic seeding.It can be learned from currently available large clinical genomic datasets comprising cross-sectional bulk sequencing data.Rupp et al.

Figure 1 .
Figure1.(a) Workflow of metMHN.In the top-left section, we show the types of input data that metMHN processes.Each row corresponds to a patient, each column to an event in the primary tumor (blue) or the metastasis (red).Events are represented by symbols and their status is encoded with a '1' for present, '0' for absent, or left blank if a tumor is unobserved.On the right, we present the primary output of metMHN: A network of interactions between events in matrix form.In the lower section, we show the most probable chronological ordering in which events accumulated in observed data points as inferred by metMHN.The progression trajectory of the primary tumor is indicated by blue arrows, while the trajectory of the metastasis is marked by red arrows.(b) The metMHN process and its state space: Black-bordered squares represent full states: the two compartments on the left detail the status of the primary tumor, the two on the right correspond to the metastasis, and the central diamond symbolizes the seeding event.The diagram is divided into two subspaces, with the left half constituting the subspace S 0 and the right half comprising the subspace S 1 .Transitions between states that occur at non-zero rates are shown as solid black arrows.Transitions that are not possible in S 0 but are possible in S 1 are indicated by greyedout arrows.Dotted arrows highlight transitions that influence the seeding event specifically.

Figure 3 .
Figure 3. (a) Event orders for five patients as inferred by metMHN.Events accumulate from left to right.Blue edges represent the primary tumor development, red edges the one of the metastasis.Distances between events do not correspond to real or estimated time.(b) Distribution of relative positions in trajectories of metastasis genotypes.The left end of the axes corresponds to the beginning, and the right to the end of progression.(c)Empirical evidence from paired samples and pre-seeding probabilities estimated by metMHN through simulation.The first and second column show the pre-seeding probabilities estimated by metMHN conditioned on the event being observed in the primary tumor (column 1) or the metastasis (column 2).Column 3 shows the number of occurrences for each event in the paired data, column 4 shows the proportions of shared versus private occurrences for each event in the paired data.Columns 5 and 6 show the mean variant allele frequencies in the primary tumor and the metastasis, respectively. i148 a single primary tumor; d ; if d is paired; primary obs: first; � p Mo > Po d ; if d is paired; metastasis obs: first; � p tot d ; if d is paired; obs: order unknown: