Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability

Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%–50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%–47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.


INTRODUCTION
D espite the improvements of transcript expression estimation methods based on RNA-seq data (Li and Dewey, 2011;Hensman et al., 2015;Bray et al., 2016;Patro et al., 2017), the estimated transcript expression can still be inaccurate and uncertain. One source of uncertainty in expression estimation is that multiple sets of expression estimates can optimally explain the observed RNA-seq reads. Therefore, the ''best'' estimation cannot be uniquely identified. The state-of-the-art methods to quantify transcripts' expression are based on probabilistic models, and, in probabilistic model inference terminology, the phenomenon of nonuniqueness in optimal parameters under infinite data is called model ''nonidentifiability. '' In this work, we relax the concept and use this term to refer to the nonuniqueness of optimal parameters under a given finite data set. See Figure 1 for a toy example of model nonidentifiability in expression quantification. The two main problems for evaluating the accuracy of transcript expression estimates under nonidentifiability are (1) detecting the transcripts whose expression estimates are nonidentifiable and (2) bounding the range of the uncertain expression of the transcripts.
Expression of transcripts is used in many analyses, and understanding the accuracy and uncertainty of the estimated expression helps us evaluate the confidence of the conclusions of such analyses. Transcript expression estimates are used for detecting splicing isoform switching (Nowicka and Robinson, 2016;Guo et al., 2017;Vitting-Seerup and Sandelin, 2017), for identifying differential expression (McCarthy et al., 2012;Love et al., 2014;Ritchie et al., 2015;Soneson et al., 2015;Pimentel et al., 2017), and for predicting disease status and treatment outcome (Morán et al., 2012;Hoadley et al., 2014). Ranges of uncertain expression estimates provide useful insights into the reliability of these studies.
We focus on the uncertainty of expression estimates due to model nonidentifiability, but there are other causes of estimation uncertainty or inaccuracy. The small sample sizes (low sequencing depth of RNA-seq) also lead to estimation errors in transcripts' expression. Statistical methods such as bootstrapping and Gibbs sampling (Turro et al., 2011;Glaus et al., 2012;Al Seesi et al., 2014) provide an estimate on the error in expression levels due to the sample size. This estimation error can be reduced by increasing the sequencing depth. In contrast, the estimation uncertainty due to model nonidentifiability is more fundamental because it cannot be addressed even under infinite sample size.
In RNA-seq data, exon sequences are usually shared among multiple transcripts, and RNA-seq reads usually cannot be mapped to a unique transcript. Due to the high rate of multimapping events in RNA-seq data, the best set of transcripts' expression cannot be uniquely resolved, and the phenomenon of nonidentifiability occurs. Multimapped reads are prevalent in RNA-seq data regardless of the sequencing depth. Thus, the uncertainty of expression due to model nonidentifiability cannot be easily addressed.
Previous works have analyzed the nonidentifiability phenomenon in transcript expression quantification. However, they mainly focused on the first problem, to identify the transcripts for which the expression estimates are nonidentifiable, but not the second, which is to bound the ranges of the true expressions for the transcripts with uncertain expression estimates. Lacroix et al. (2008) and Hiller et al. (2009) developed methods to list the transcripts that have nonunique expression estimates, but their methods do not provide information about the values of optimal abundances. Roberts and Pachter (2013) incorporated the detection of nonidentifiable abundances into their quantification method, and designed a tool to output an identifiability flag for each transcript along with a single-expression estimate.
In this work, we develop methods that address the range of optimal expression estimates for transcripts under model nonidentifiability. For each transcript, we calculate the minimum and FIG. 1. Example of a nonidentifiable quantification model. Transcripts are the paths in the splice graph, denoted by the concatenation of exon indices. The number on each edge indicates the number of observed reads mapped to this splice junction. The set of transcript abundances is optimal if it perfectly explains the observed reads. That is, for each junction, the total abundances of transcripts containing that junction sum up to the number displayed on the edge. The right side of the figure shows two co-optimal expression abundances. It can be verified that both solutions explain the observed reads perfectly as they both predict 10 reads on each junction. maximum values across all optimal expression estimates. That is, for any expression value between the computed minimum and maximum, there exists a set of expression of the other transcripts such that the estimation of all transcripts' expression (combining both the expression of this transcript and the other transcripts) leads to the largest likelihood in the probabilistic model of expression quantification. Compared with a list of names of the transcripts for which the expressions are nonidentifiable, the range of optimal expression of a transcript provides more information on the accuracy (or inaccuracy) of the estimate.
Most widely used quantification software (Li and Dewey, 2011;Hensman et al., 2015;Bray et al., 2016;Patro et al., 2017) take a set of reference transcript sequences as input and assume that the reference transcripts are the complete set of sequences that can be expressed. This is called reference-transcript-based expression quantification. Another line of expression quantification models (LeGault and Dewey, 2013;Bernard et al., 2014;Ma et al., 2020) called ''graph quantification'' assumes that the current reference transcriptome database is incomplete. Instead, the splice graph that encodes the exon/exon connection relationships is assumed to be correct and complete, meaning every possibly expressed transcript corresponds to a path in the splice graph and vice versa. Those models infer the splice graph edge expression or edge selection propensity in the quantification probabilistic models.
We provide a more detailed overview in Section 2.3. Many transcript assembly methods also adopt a similar setup, assuming a mixture of reference transcripts and novel isoforms are expressed (Trapnell et al., 2010;Tomescu et al., 2013;Pertea et al., 2015;Liu and Dickerson, 2017;Gatter and Stadler, 2019).
We develop methods to bound the range of optimal expression estimates for both reference-transcriptbased and graph quantification models. Our method for the reference-transcript-based quantification is based on linear programming over sufficient statistics (Section 2.2), and our method for graph quantification is based on max-flow formulations to ''introspect'' the graph quantification model (Section 2.4). Our introspection algorithm can not only bound the uncertain expression of full transcripts, but also extends to graph structures. For example, given a set of edges, our method computes the range of the optimal total expression of transcripts that cover any edge in the set.
Combining our methods for quantification models and interpolating between the complete and incomplete reference transcriptome assumption, we can additionally compute the range of optimal expression estimates under the assumption that a given percentage of the expression comes from the reference and the remaining expression comes from the full paths in splice graphs (Section 2.5).
Applying our method to 16 Human Body Map samples, we analyze to what degree the expressions of transcripts are estimated inaccurately due to nonidentifiability. We observe that around 35%-50% of transcripts potentially suffer from expression estimation error across the 16 samples. Most of these transcripts (or 20%-47% of total transcripts) have very uncertain expression estimates such that the ranking of expression between the transcript and its sibling isoforms from the same gene is inconclusive. Around half of the transcripts with uncertain expression estimates due to nonidentifiability are different from those due to finite sample size.
Applying our method on sequencing data sets of an MCF10 cell line and of CD8 T cells, we use the ranges of optima to evaluate the reliability of detected differentially expressed (DE) transcripts within each data set. A DE detection is unreliable if the ranges of optimal expression between the DE groups largely overlap. We observe that the majority of the DE calls are reliable and robust to the uncertain expression estimation due to nonidentifiability when the reference transcriptome contributes to >40% of expression. However, there are 5 unreliable DE calls (out of 257 detections) in the MCF10 data set, and 19 unreliable DE calls (out of 3152 detections) in the CD8 T cell data set. It requires further investigation to determine whether these transcripts are actually DE, and analyses based on the DE status of these transcripts require extra caution.

METHODS
This study does not collect new biological data, and thus IRB approval is not required. We start with relevant definitions in Section 2.1. Section 2.2 provides a high-level overview of probabilistic modeling of transcript quantification, and the linear programming to derive a range of optimal abundance estimates for this setup. Section 2.3 provides an overview of graph quantification, and Section 2.4 describes our introspection algorithm to derive the ranges under this setup.

Definitions
A ''splice graph'' is a directed acyclic graph representing alternative splicing events in a gene. The graph has two special vertices: S represents start of transcripts and T represents termination of transcripts. Every other vertex represents a (partial) exon. Edges in the splice graph represent ''splice junctions,'' that is, potential adjacency between exons in transcripts. Each transcript corresponds to a unique S -T path in the splice graph, and the set of known transcripts is called the ''reference transcriptome.'' S -T paths that are not present in the reference transcriptome correspond to ''unannotated transcripts.'' We use the phrase ''quantified transcript set'' to denote a set of transcripts with corresponding abundances.

Ranges of optimal estimates for reference-transcript-based quantification
Recall the problem of reference-transcript-based expression quantification. We focus on paired-end reads for now, however, this formulation naturally extends to other types. We also focus on a particular probabilistic modeling, which lies at the heart of most modern transcript quantifiers (Li and Dewey, 2011;Hensman et al., 2015;Bray et al., 2016;Patro et al., 2017). Assume that the paired-end reads from an RNAseq experiment are error-free and uniquely aligned to a reference genome as fragments. We denote the set of fragments as F, and the set of transcripts as T = fT 1 ‚ T 2 ‚ . . . ‚ T n g with abundance (copies of molecules) c 1 ‚ c 2 ‚ . . . ‚ c n . The probability of observing F is as follows: To generate a fragment, first a transcript is sampled, then the fragment is sampled from the selected transcript, and we only observe the resulting fragments. P(T i ) denotes the probability of sequencing a fragment from transcript T i , and P(zjT i ) denotes the probability of sampling the fragment z given that it comes from T i . A T (z) is the set of transcript indices onto which z can map. Usually, P(T i ) / c i , and P(zjT i ) is a known quantity derived from fragment length distribution, bias correction, and similar factors. Recent transcript quantifiers (Bray et al., 2016;Patro et al., 2017) use sophisticated techniques to ensure accurate estimation of P(zjT i ) and fast inference of fc i g.
We now introduce the idea of reparameterization. Assume, for now, each fragment spans exactly one junction. If two sets of quantified transcripts result in exactly the same total abundance on each junction, they yield the same generative model. In other words, these two sets will be indistinguishable from each other. Let fg i g be the set of total abundances for each junction J i , and we use J i 2 T j to denote that junction J i is in transcript T j . The following condition is sufficient to guarantee that the set of transcript abundances fc 0 i g is valid, and indistinguishable from fc i g: As this defines a linear system, we calculate the maximum and minimum possible value of c 0 j for each transcript T j , which would be the confidence interval for this transcript assuming fc i g is an optimal solution for the inference, which can be readily obtained using existing software. More specifically, we use either max c 0 j or min c 0 j as the sole optimization objective with the linear system as the constraints, then solve the resulting linear program, for each transcript of interest. It is an interval because for every abundance value in between the extremes, there exists an optimal solution that allocates this exact abundance to the transcript.
This argument no longer holds when fragments can span multiple junctions. However, as we have shown previously (Ma et al., 2020), to ensure that two sets of quantified transcripts yield the same fragment distribution, it suffices that for each phasing path (i.e., the set of junctions in a fragment, as a path on the splice graph), the total abundance of transcripts containing the phasing path in full is equal. In other words, the form of the linear program remains unchanged, with the only change being that J i is now a phasing path instead of a single junction.

Bird's-eye view: graph quantification
We now provide a high-level overview of graph quantification, and draw similarities with its transcriptbased counterpart. The problem of splice graph expression quantification is very similar to the transcript 124 ZHENG ET AL.
version of the problem, but with the variables associated with transcripts fT i g‚ fc i g replaced by a graph G = (V‚ fe i g) and a network flow ff i g on the graph. We start by assuming that G is the splice graph in the usual sense, and every fragment z is a junction read, which can be mapped to an edge in G. Under an analogous probabilistic model, the probability of observing F is as follows: P(e i )P(zje i ): To generate a fragment, first an edge in G is sampled, then the fragment is sampled from the selected edge, and we only observe the resulting fragments. Again, P(e i ) denotes the probability of sampling a fragment from edge e i , and P(zje i ) denotes the probability of sampling the fragment z given it comes from e i . A G (z) denotes the set of edges from which z can be generated. We also have P(e i ) / f i up to a normalization factor, and P(zje i ) being a known quantity.
After inferring the splice flow value from the set of observed fragments, we obtain the optimal splice graph flow that explains the observation. While the graph flow itself finds use in certain cases, for most analysis it is imperative that the flow will be decomposed into a weighted set of S -T paths, corresponding to a quantified set of transcripts. Importantly, the flow decomposition is not unique in many cases, and while these decompositions lead to a different set of transcripts, they define the same generative model. This is a direct manifestation of nonidentifiability, if we consider the transcript abundances to be the variables of the probabilistic model.
In the next section, we describe methods that consider all possible flow decompositions of a given graph flow. As a particular use case, these methods are able to determine the minimal and maximal possible abundance for a given transcript, across all possible flow decompositions.
Similarly to the transcript-based quantification, the conclusion no longer holds when fragments span multiple junctions, and it is much harder to adjust the graph quantification model to properly take these into account. However, for graph quantification, G can be any supergraph of the splice graph, as long as each S -T path in G corresponds to a valid transcript to generate fragments, and each valid transcript corresponds to a unique S -T path in G. Several approaches are possible, and we will use the one outlined in our previous work (Ma et al., 2020), which properly handles phasing reads.

Subgraph quantification
Nonidentifiability manifests itself in graph quantification as different flow decompositions from the inferred splice graph flow. Specifically for a transcript, the corresponding S -T path can be assigned with different weights from different flow decompositions. Similar to Section 2.2, we only need to know the minimal and maximal possible weight to construct the confidence interval.
In addition, we are also interested in the related problem of ''local quantification,'' that is, to estimate the abundance of a specific splicing pattern by aggregating expression of full-length transcripts containing the pattern. This is natural for analysis of complex gene loci, where certain splicing patterns within a region might be of larger interest than the patterns outside. Our reasoning for deriving a ''confidence interval'' can similarly apply for splicing patterns, that is, we are interested in the total weight of S -T paths containing a specific splicing pattern as a confidence interval. Since a transcript consisting of s exons is a combination of s -1 consecutive junctions, we are interested in splicing patterns that are defined by co-occurrence of k disjoint junctions. This motivates the following formal definition of the subgraph quantification problem in graph theory language, generalizing the transcript confidence interval calculation to splicing patterns: Definition 1 (AND-Quant). Let G = (V‚ E) be a directed acyclic graph with an edge flow, and fE k g be a list of edge sets with the well-ordering property: If a path visits e i 2 E i , then visits e j 2 E j at a later step, it must be that i < j. An S -T path is ''good'' if it intersects each E k . For a flow decomposition of G, the total ''good flow'' is the total flow from good S -T paths. AND-Quant(G‚ fE k g) asks for the minimum and maximum total good flow for all decompositions of G.
For quantifying a full transcript, E k consists of single-edge sets, each corresponding to an edge in the S -T path representing the transcript. The range of optima for this transcript is exactly AND-Quant(G‚ fE k g). For the aforementioned ''local quantification,'' the definition of fE k g depends on the specific construction of the graph quantification instance, but in general it is easy to construct one E i for each junction that satisfies the well-ordering property.
To solve AND-Quant, the first step is to define a similar problem called OR-Quant as follows: Definition 2 (OR-Quant) Let G = (V‚ E) be a directed acyclic graph with an edge flow, and E 0 be an arbitrary subset of E. An S -T path is ''good'' if it intersects E 0 . The total ''good flow'' and the objective are defined in the same way as in AND-Quant.
The OR-Quant problem is complementary to AND-Quant and is interesting on its own, as it is suitable to represent analyses where we are interested in aggregated expression that includes any of the several junctions or exons. We now convert an instance of AND-Quant to OR-Quant by constructing a graph G B , called the block graph, that contains every edge that is either in one of E k , or is between an edge in E k -1 , and an edge in E k for 1 k m, where m is the number of sets in fE k g (with some abuse of notion, assume E 0 consists of a virtual edge ending at S, and E m + 1 consists of a virtual edge starting at T). We claim the following for AND-Quant: If we know the minimum and maximum total ''bad flow'' (negative of good flow), we can obtain the answer to AND-Quant by complementing the result with U, the total flow of G. From the lemma, an S -T path is bad if and only if it intersects with G -G B , which turns the problem into an instance of OR-Quant.
We now solve OR-Quant. Recall E 0 is the set of edges that a ''good'' path needs to intersect. To that end, we define two auxiliary graphs. Gis a copy of G without the edges in E 0 . G + is a copy of G with an extra vertex T 0 , which replaces T as the flow sink, and all edges in E 0 have their destination moved to T 0 . We claim that running MaxFlow on both graphs yields the solution: Theorem 1. Let G + and Gbe constructed as described above.
Constructing G + and Gtakes linear time, so the time complexity of OR-Quant depends only on the time to solve the MaxFlow. Combining the arguments leads to the solution to AND-Quant (recall U is the total flow on G): Theorem 2. Let G B be the block graph, and [l, r] = OR-Quant(G, G -G B ). Then AND- See Figure 2 for an example of both problems. In the following three sections, we provide proofs to Theorems 1 and 2, as well as an algorithm to construct G B in linear time.
2.4.1. Algorithms for OR-Quant. In this section, we state the algorithms for OR-Quant from a pure graph theoretical perspective. Recall our setup consists of a directed acyclic graph (DAG) G with predetermined source S and sink T, a flow F = ff e g on the graph, and an edge subset. We also use C(F) to denote the total flow, and in this way we write U = C(F).
We start with several basic definitions.
Definition 3 (Flow). Given DAG G with predetermined source S and sink T, a flow F is a set of edge weights of G satisfying non-negativity on every edge weight and flow balance on every vertex except S and T.
Definition 4 (Decompositions). Given graph G and a flow F = ff e g, a decomposition R is written as fT i ‚ c i g where each T i is an S -T path on G and c i is a nonnegative number, satisfying P i: e2T i c i = f e for every edge e. We use jRj to denote the number of S -T paths in R, and C(R) = P c i to denote the total flow/ capacity of the decomposition.
Definition 5 (Partial Decompositions). A partial decomposition of a set of non-negative edge weights W = fw e g on a graph G is written as fT i ‚ c i g where each T i is again an S -T path on G, and c i is a nonnegative number satisfying P i: e2T i c i w e for every edge e. Alternatively, if W is a flow, a partial decomposition can simply be defined as a subset of a (full) decomposition of F.
We define partial decompositions on arbitrary non-negative edge weights, as required for a later proof. We present the following lemma without proof.
Lemma 2 (Finite Decomposition). Every flow on a graph of finite size has a decomposition of finite size. We now restate the definition of OR-Quant slightly more formally.
, that is, total flow from good paths in the decomposition. OR-Quant(G, E 0 ) asks for the minimum and maximum Q(R) for any possible decomposition R of F.
We now state our algorithms and prove the correctness of the algorithms, starting from the lower bound.
Theorem 3 (Diff-Flow). The auxiliary graph Gis built with edge set E -E 0 . Z = U -MaxFlow(G -) is the minimum total good flow Q(R) from any possible decomposition R of F.
Proof. We prove the theorem in two parts: First, we show there is a decomposition r such that Q(R) = Z, then we show that any decomposition satisfies Q(R) ! Z.
The key observation is that all S -T paths that are bad are fully in G -. For the first part, fix a maximum flow of Gas F -. We let Rbe an arbitrary decomposition of F -, and R + be an arbitrary decomposition of F -F -. We and only if every path in R + intersects with E 0 . Assuming otherwise, we can remove that path from R + and add it to R -, which leads to larger C(R -). This is impossible: it implies Fis not the MaxFlow of G -, because the flow of Ris a strictly better solution. Having proved Q(R + ) = Z, we now know R -+ R + , a (full) decomposition, has exactly Z total good flow. Now for any decomposition R, we split it into two parts. R + contains all good paths in R, and Rcontains all bad paths in R.
Theorem 4 (Split-Flow). The auxiliary graph G Ã is built by adding a vertex T 0 in G and, for each edge in E 0 changing its destination to T 0 (we change it from G + to G Ã for technical reasons here). T 0 replaces T as the sink of flows. Y = MaxFlow(G*) is maximum total good flow for any possible decomposition of F.
Proof. We prove the theorem in a similar style: First, we show any decomposition satisfies Q(R) Y, then that there is a decomposition R of F such that Q(R) = Y.
For the first part of the theorem, for a decomposition R of F, we can write R = R + + Rwhere R + contains all good paths in R and Rcontains all bad paths in R. We can map R + to G Ã and denote the resulting partial decomposition R Ã (R Ã is indeed a partial decomposition by noting there is a one-to-one mapping between edges of F* and edges of F). We have Q(R) = Q(R + ) = C(R + ) = C(R*) £ MaxFlow(G*).
The second part of the proof, that is, to show there is a decomposition R of F such that Q(R) = Y, is more involved. Thus, we provide an overview before proceeding to the actual proof.

Overview
This part of the proof involves actually constructing the ''correct'' decomposition of R such that Q(R) is correct. We only know that there exists a MaxFlow with correct flow value on G Ã , and we can decompose this flow. However, this only gets us a ''truncated'' decomposition, where all paths end in an edge in E 0 instead of T. Thus, our task will be to complete each of these paths in the decomposition by extending them to T, which yields the correct decomposition as we desire. We cannot arbitrarily extend the paths as we also need to respect the flow capacity on F.
So, we may need to further split the weight of a path in the truncated decomposition when trying to complete it. Our constructive proof follows a strict protocol of doing these splits and completions, one path at a time, so flow capacity is respected in every step.
We start by building an arbitrary decomposition R Ã from a maximum flow of G Ã . However, R Ã is not a valid partial decomposition of F, because G Ã and G have different sets of edges. Our goal is to obtain a partial decomposition R + of F that is a ''reconstruction'' of R Ã . We will define several terms for convenience of our proof.

Path mapping for G Ã
A good S -T path on G is mapped to an S -T 0 path on G Ã by finding the first edge of the path that is in E 0 , changing the destination of that edge to T 0 , and discarding the edges after that so that the path ends at T 0 . (We will never map a bad S -T path on G this way.) Similarly, an S -T 0 path on G Ã is mapped to a (incomplete) path on G by moving the destination of last edge (that was T 0 ) back to its original node before the transformation. We assume a label containing its original destination is kept on each edge that ends at T 0 . This implies that multiple edges can exist between a node and T 0 , and the movement follows the label. The resulting path is guaranteed to intersect with E 0 , but is not a complete S -T path.
We apply an induction argument, formally defined as follows. Figure 3 provides a concrete example for the induction argument.

Flow reconstruction instance
An instance for flow reconstruction has two inputs: (F‚ R Ã ). F is a flow on G, and R Ã is a partial decomposition on F Ã (of G Ã ), where F Ã is constructed in the same way as G Ã by moving endpoints of edges in E 0 to a new node T 0 . F Ã is actually not a flow because load balance is violated at the vertices that have an incoming edge moved away. Nevertheless, partial decomposition of F Ã is still well-defined as R Ã contains S -T 0 paths. The output of the instance is R + , a partial decomposition of F, satisfying that (1) each path in R + is good and (2) if we map each path in R + back to G Ã , the resulting partial decomposition has the same flow as R Ã .
We will apply the induction on jR Ã j, number of paths in R Ã . The base case is then jR Ã j = 0, where we can simply return an empty R + . Assume now the flow reconstruction can be solved for jR Ã j k -1, we solve the instance with jR Ã j = k. We first state the algorithm, and then prove its correctness.

Induction algorithm
We first pick the path (P‚ c) in R Ã whose second-to-last vertex (right before T 0 ) is the last among all paths in topological ordering of G. Denote the last edge of this path, mapped back to G, as (u‚ v). P without its last edge is then a path from S to u. We let P 0 denote P with last edge changed back to (u‚ v).
We next run a MaxFlow on G from v to T with total flow c. This call always returns a full flow (with total flow c), because there is at least c incoming flow for v from the existence of edge (u‚ v) alone. This flow is then decomposed, and for each path in the decomposition, we prepend it with P 0 (which is a path from S to v) so that it becomes an S -T path. We call the resulting partial decomposition R 0 , and we next show it is valid, meaning the total flow of R 0 on each edge does not exceed F. For edges in P 0 , the total flow on this edge in R 0 is exactly c. As (P‚ c) is a path in R Ã , each of the edges in P has at least c flow in F Ã , which immediately means each edge in P 0 has at least c flow in F. For all other edges, its flow comes from the v -T MaxFlow, so the flow value is always below the capacity of this edge.
We continue the process by recursively calling the procedure on (F 0 ‚ R 0 Ã ), where F 0 is F minus the flow of R, and R 0 Ã is R Ã without (P‚ c). We return the partial decomposition from the recursive call merged with R 0 .

Proof of induction correctness
The new instance to be called is an instance with size k -1, since one path in R Ã is removed. We first prove that the recursive call is valid, that is, R 0 Ã is indeed a partial decomposition of F 0 Ã (F 0 Ã is constructed from F 0 by moving the endpoint of edges in E 0 to T 0 ).
If the call is invalid, it means some edge in F 0 Ã has less flow than the total flow from R 0 Ã . Since R Ã is a partial decomposition of F Ã , the condition will only be invalidated for an edge with positive flow in R 0 , as these are the only edges where the flow on F 0 Ã is less than the flow on F Ã . This edge can either be an edge in P, or an edge whose starting point is later than v in the topological order of G by construction of R 0 (with a v -T MaxFlow). If the edge is in P and is not the last edge, exactly c flow is subtracted on this edge from F to F 0 and from R Ã to R 0 Ã , so this would contradict with the condition that R Ã is a partial decomposition of F Ã . If this is the edge (u‚ T 0 ) that is mapped from (u‚ v), again, exactly c flow is subtracted in both F to F 0 and R Ã to R 0 Ã .
If this is an edge not in P, we claim the edge has zero flow in R 0 Ã . This is because the way we pick (P‚ c), where the path with the latest second-to-last vertex is picked. If the edge has positive flow in R 0 Ã , it implies there is a path in R 0 Ã that includes this edge, and the ending point of the path would be later than v, meaning the path would have been chosen in the place of P during the process.
Given the recursive call is valid, we can prove the correctness of the algorithm by showing that R + outputted by the algorithm indeed satisfies the two output conditions. For (1), each path in R + is good because it comes from an R 0 at some iteration, and all paths in R 0 have P 0 as prefix whose last edge (u‚ v) is in E 0 . For (2), at each recursion call, R 0 mapped to F Ã becomes one path P with flow c, which matches (P‚ c) in R Ã , so the condition is maintained similarly by an induction argument. This finishes the correctness proof for the induction.

Completing the proof
With the induction algorithm, we can reconstruct a partial decomposition R + of G from R Ã (a decomposition of F Ã ), then similar to the previous proof, construct another partial decomposition Rfrom F minus the flow of R + . By construction of R + , we have Q(R + ) = MaxFlow(G*). If some path in Rintersects with E 0 , we can remove that path from Rand add it to R + , then map the new R + to G Ã to obtain

RANGES OF EXPRESSION DUE TO NONIDENTIFIABILITY
a flow better than MaxFlow(G*), a contradiction. This means Q(R -) = 0 and the decomposition R = R + + Rsatisfies Q(R) = MaxFlow(G*), completing the whole proof.
, With both bounds proved, we can formally finish the proof of the main theorem: Theorem 1. Let G + and Gbe constructed as described above. Then OR-Quant(G, Proof. This is implied by Theorem 3 for the lower bound, and Theorem 4 for the upper bound. ,

Algorithms for AND-Quant.
In this section, we prove that the complementarity argument mentioned above is correct. Our setup consists of a DAG G with predetermined source S and sink T, a flow F = ff e g on G, and a list of edge sets fE k g. We define flows and decompositions in the same way as the last section (Definitions 3 and 4). Now recall the definition of AND-Quant, written slightly different with the new notations: Definition 7 (Well-Ordering Property). A list of edge sets fE k g satisfies the well-ordering property if a path visiting e i 2 E i then e j 2 E j at a later step implies i < j.
Definition 8 (AND-Quant). Let G = (V‚ E) be a directed acyclic graph with an edge flow, and fE k g be a list of edge sets satisfying the well-ordering property. An S -T path is good if it intersects each E k . For a decomposition of F, the total good flow is the total flow from good S -T paths. AND-Quant(G,{E k }) asks for the minimum and maximum total good flow for all possible decompositions of F.
We start by discarding edges e in any of E i such that no good S -T paths would use e. By definition, removal of these edges will not change the answer to AND-Quant as no good S -T paths would be excluded by removing these edges. This also means each edge e in any E i satisfies that there is a good S -T path that includes e. Now given G is a DAG and the edge sets satisfy the well-ordering property, we can define the following: Definition 9 (Start/End-Sets and Natural Order). Define U i = fu : (u‚ v) 2 E i g, V i = fv : (u‚ v) 2 E i g and for convenience, V 0 = fSg‚ U m + 1 = fTg.
We define the natural order u v if there is a directed path starting at u and ending at v. We define U i ! x as the condition that there is some u i 2 U i such that u i x (intuitively, x can be reached from U i ), and x ! U i as the condition that x u i for some u i 2 U i (intuitively, x can reach U i ). The conditions V i ! x and x ! V i can be similarly defined.
Definition 10 (The Block Graph). Define B i , the i th block subgraph, the set of edges (u‚ v) that satisfies V j -1 ! u and v ! U j , for 1 i m + 1. The full block graph G B is the union of all B i and E i .
The filtering steps and construction of the block graph can be done in linear time, as discussed in Section 2.4.3.
The rest of this section is dedicated to prove the following lemma, which directly implies the main theorem:

Lemma 3 (Main Lemma for AND-Quant). An S -T path in G is good if and only if it is a subset of G B .
We now provide a quick overview before proceeding to the actual proof.

Overview
The proof of the above lemma is done in several parts. The if direction of the statement is straightforward, and the only-if direction is harder to prove. We start by deriving some helpful corollaries of the well-ordering property. Next, we show that any vertex x in the block graph and any edge set E i in the query list, x is either strictly before E i , meaning there is a path starting at x and ending at an edge in E i (this includes the case where x 2 U i , that is, an edge in E i originates from x), or strictly after E i , meaning there is a path starting at an edge in E i and ending at x. This implies each E i is a cut of G B , which implies the only-if part of the statement.
Lemma 4 (Well-Ordering Property Extended). If u 2 U i ‚ v 2 V j ‚ v u, then i > j.
Proof. We can derive this from the well-ordering property of fE k g, by constructing an S -T path. First, as v 2 V j there is an edge in E j that contains v, and by our previous assumption there is a path from S to v that uses this edge. As v u, there is a path from v to u. As u 2 U i , there is an edge in E i that contains u, and similarly there is a path from u to T that uses this edge. Combining the three segments together, we have an S -T path that first visits an edge in E j , then an edge in E i later, so by the well-ordering property of fE k g, i > j. , Proof. Recall that we removed all edges in E i that do not belong to any good paths. v i 2 V i implies there is an edge (u i ‚ v i ) 2 E i that is contained in a good path, and later in this good path there is an edge in E i + 1 that starts at some vertices in U i + 1 . This implies v i ! U i + 1 . , Proof. u i 2 U i means there is an edge (u i ‚ v i ) 2 E i , and u i ! U i . As shown in the last lemma, v i ! U j for j > i. This means the same also holds for u i as u i < v i .
, We can prove the symmetric statements about V i .
Lemma 7. For a fixed i, each vertex x in G B either satisfies x ! U i or V i ! x, but not both.
We can similarly prove that the condition holds for all We have proved for any vertex in x and any i, one of x ! U i or V i ! x must be satisfied. No vertices can satisfy both, otherwise we have u 2 U i ‚ v 2 V i and v x u, which would imply i < i by Lemma 4, contradiction. ,

Lemma 3 (Main Lemma for AND-Quant). An S -T path in G is good if and only if it is a subset of G B .
Proof. We first prove that if a path is good, it is within G B . An edge (u‚ v) on the path is either in some E i , or between some edge in E i -1 and some edge in E i . In the latter case, we know V i -1 ! u and v ! U i , which directly imply (u‚ v) 2 B i .
To prove the other direction, we show that removing any of E i results in disconnection of S to T. By the previous lemma, the vertices of G B can be partitioned into two disjoint sets: For an edge (u‚ v) 2 G B that starts in S and ends in T: If (u‚ v) 2 B j , we know V j -1 ! u ! U i (first part from definition of B j and second part from u 2 S, similar for the rest of this proof), so j -1 < i, and at the same time V i ! v ! U j so i < j, but there is not an integer i between j -1 and j, contradiction. If (u‚ v) 2 E j , we know U j ! u ! U i so j i (because V j -1 ! u by using the fact there is a good path including u), and V i ! v ! V j so i j (because v v j ! U j + 1 for some v j 2 V j by Lemma 5), which implies i = j.
So we have (u‚ v) 2 E i . In other words, removing all edges in E i results in disconnection between S and T, so each S -T path in G B uses an edge in E i , for each i. This means the path is good.
, We can now finish the proof of the main theorem: Theorem 2. Let G B be the block graph, and [l, r] = OR-Quant(G, G -G B ). Then AND-Quant(G, Proof. By Lemma 3, a path is a bad path if and only if the path intersects with G -G B . Thus, we can convert an instance of AND-Quant to an instance of OR-Quant by constructing the block graph G B , then complementing the results of OR-Quant with E 0 = G -G B . The correctness of our OR-Quant algorithms is guaranteed by Theorem 1. ,

RANGES OF EXPRESSION DUE TO NONIDENTIFIABILITY
2.4.3. Constructing G B . In this section we discuss how to construct G B in linear time using notations from Section 2.4.2. We first run two breadth-first searches from S and T to mark the set of vertices that are reachable from S and can reach T. All vertices that are unable to do both will never appear in an S -T path, and will be removed first. Let s(v) denote the largest i such that there is a path from S, via an edge in each of E 1 ‚ E 2 ‚ . . . ‚ E i in order, to v. Generate a topological order of G, and let s(S) = 0. For every vertex in the topological order other than S, the value of s(v) is determined as follows. s(v) is initialized as the largest of s(u) from all its predecessors. Then for every (u‚ v) 2 E j and s(u) = j -1, set s(v) = max (s(v)‚ j).
We now show that the values of s(v) are correctly derived. If v is indeed reachable from S via an edge in each of E 1 ‚ E 2 ‚ . . . ‚ E i , we have s(v) ! i as we will visit all nodes on the path in topological order and after an edge in E i we always have s(v) ! i. If s(v) ! i, we show v is reachable from S via an edge in each of E 1 ‚ E 2 ‚ . . . ‚ E i . We let p(v) be the predecessor of v where s(v) is calculated from, either by passing the value or passing an edge in E j .
By chaining p(v), we obtain a path from S to v where each vertex is responsible for the value of s(v) of its successor in the path. This means along the path the value will only increase by passing through an edge in E j , and this can only bring s(v) from j -1 to j, so the path must contain an edge from each E 1 ‚ E 2 ‚ . . . ‚ E i in order.
Similarly, we can obtain t(v), which is defined as the smallest i such that there is a path from v, via an edge in each of E i ‚ E i + 1 ‚ . . . ‚ E m to T by setting t(T) = m + 1 and traverse the graph in reverse topological order. The filtering process then works as follows. For each (u‚ v) 2 E i , the edge is kept in E i if and only if s(u) = i -1 and t(v) = i + 1. We prove this procedure is correct. A good S -T path containing (u‚ v) must visit an edge in each of fE 1 ‚ E 2 ‚ . . . ‚ E i -1 g in order, visit (u‚ v) 2 E i , then an edge in each of The first part is possible if and only if s(u) ! i -1, and the last part is possible if and only if t(v) i + 1. It remains to prove that s(u) i -1 (the other part is symmetric). This is because otherwise there will be an edge in E i that precedes u. As (u‚ v) is also in E i , this means there will exist a path containing two edges in E i , which violates the well-ordering property.
To construct G B , we need to construct the set of blocks B i . This can be done using the values of s(v) and t(v). For an edge (u‚ v) not in any of E i , if s(u) + 1 = t(v), the edge is allocated to B t(v) . We prove this procedure is correct. If an edge satisfies s(u) + 1 = t(v) = k, there is an edge in E k -1 that leads to u, meaning V k -1 ! u, and similarly u ! U k , which is the definition of B k . Similarly, if (u‚ v) 2 B k , we show s(u) + 1 = t(v) = k. Given V k -1 ! u and the edge set is filtered, there exists v k -1 2 V k -1 ‚ v k -1 v, and s(v) ! s(v k -1 ) = k -1. If s(v) ! k, this means there is an edge in E k that leads to u and V k ! u. Given v ! U k at the same time, we have V k ! u v ! U k , and there is a path that visits an edge in E k twice, violating the well-ordering property. We can similarly prove it is necessary and sufficient that t(v) = k.
Combining the results, we have the following lemma for preprocessing (note that topological sorting is a linear time algorithm): Lemma 8. There is an O(n + m) algorithm that filters fE i g such that every remaining edge in every E i is included in at least one good S -T path, and constructs fB i g as described in Definition 10, where n and m are the number of nodes and edges in G.

Structured analysis of differential expression
We have discussed nonidentifiability-aware transcript quantification under two assumptions. In this section, we model the quantification problem under a hybrid assumption. Some fragments are generated from the reference transcriptome, while others are generated by combining known junctions (valid under the setup of graph quantification). This model is more realistic than the model under the two extreme assumptions.
For each transcript T i , let [l 0 i ‚ u 0 i ] denote its range of optimal expression calculated under the complete reference transcriptome assumption (as in Section 2.2), and [l 1 i ‚ u 1 i ] denote its range of optimal expression calculated under the incomplete reference transcriptome assumption (as in Section 2.4). We use parameter k to indicate the assumed portion of fragments generated by combining known junctions. For 0 < k < 1, assumptions. These ranges are useful for analyzing the effects of nonidentifiability under milder assumptions, as we now show for differential expression.
In differential expression analysis, for each transcript we receive two sets of abundance estimates fA i g, fB i g under two conditions, and the aim is to determine whether a transcript is expressed more in fA i g. With fixed k, we can instead calculate the ranges f[l k A‚ i ‚ u k A‚ i ]g and f[l k B‚ i ‚ u k B‚ i ]g as described above, where f[l k A‚ i ‚ u k A‚ i ]g are the ranges for transcript T i under the condition corresponding to abundance estimates fA i g. Suppose the transcript is detected to be DE by comparing fA i g and fB i g and it is overexpressed in fA i g.
When considering nonidentifiability, if the mean of l k A‚ i is less than that of u k B‚ i by a threshold, we define this transcript to be a questionable call for differential expression. If k portion of expression is explained by unannotated transcripts, we cannot determine definitely if the expression of A i is higher on average than that of B i . This filtering of differential expression calls is very conservative (expected to filter out few calls), as most differential expression callers require much higher standards for a differential expression call.

Implementation
In the following analyses, we use GENCODE annotation (version 26) (Frankish et al., 2018) as the reference transcriptome for constructing the splice graphs and for expression quantification. Salmon (Patro et al., 2017) and its corresponding graph quantification probabilistic model (Ma et al., 2020) are used to obtain the edge abundances of the splice graphs under complete and incomplete reference assumptions. We evaluate the expression estimation uncertainty due to nonidentifiability by determining whether the ranking of expression can be altered under different optimal abundances. Specifically, the ranking is computed within the same sample across isoforms in Section 3.2, and for the same isoform across samples in Section 3.3, which is statistically defined by DE analysis.

Expression of 20%-47% transcripts has inconclusive ranking compared with sibling isoforms
We applied our methods to the Human Body Map data set (The Illumina Body Map 2.0 data; 2011) (SRA accession ERX011205), which consists of 16 RNA-seq samples from 16 tissues. We are interested in evaluating the expression estimation uncertainty due to nonidentifiability, and focus on the transcripts for which the uncertainty of expression estimation is so large that the ranking of expression between the transcript and its sibling isoforms cannot be determined. We use the term ''sibling isoforms'' of a transcript to refer to the annotated alternative splicing isoforms that belong to the same gene. For each transcript, we enumerate its sibling isoforms, and compare the range of optimal expression estimates for the pair of transcripts to determine whether the two ranges overlap. An overlap between the two ranges indicates an indecisiveness in the ranking of expression between the two transcripts.
We observe that around 35%-50% of transcripts have uncertain expression estimates due to nonidentifiability. That is, the ranges of optimal abundances of these transcripts are not single points. The majority of them (around 20%-47% of total transcripts across all 16 samples) have a very uncertain expression estimate such that the ranking of expression between the transcript and at least one of its sibling isoform is inconclusive (Fig. 4A). The range is computed under various k values (compositions of reference transcript expression) ranging from 10% to 100%. These transcripts are possibly false positives in isoform switch detection if they are predicted.
When compared with the expression estimation uncertainty caused by finite sample size (or finite sequencing depth), we observe that the sets of transcripts with inconclusive expression ranking due to the two sources of error do not have large overlap (Fig. 4B). In an arbitrarily chosen Human Body Map sample (a prostate sample, SRA accession ERR030877), we set the k value to be 70% to make the number of transcripts with uncertainty expression estimates under the two cases similar. Only half of transcripts for which the uncertainty estimates are caused by finite sample size are common to the transcripts under the model nonidentifiability case. This observation suggests that the expression uncertainty due to model nonidentifiability cannot be captured by the degree of uncertainty due to finite sample size.
The uncertainty caused by finite sample size is evaluated by bootstrapping in Salmon software (Patro et al., 2017). Terminus (Sarkar et al., 2020) identifies groups of transcripts that have smaller quantification variance when quantifying as a group compared with quantifying individually, and is based on bootstrapping or Gibbs sampling of expression estimation. Theoretically, the difference between Terminus and our approach for identifying uncertain expression estimates lies in the source of error and whether incomplete reference is considered. In Figure 5, we show the percentage of commonly identified transcripts with expression estimation uncertainty of the two methods in the Human Body Map data set.
The computed percentages of transcripts with inconclusive ranking of expression are upper bounds. Because the range of optimal expression estimate per transcript is calculated separately, arbitrarily selecting a pair of values from two ranges of optimal expression of two transcripts may not lead to an optimal pair of expression in the expression probabilistic model. For example, selecting the maximum values for both isoforms may lead to nonvalid estimates where the sum of estimated expression (before normalization) exceeds the number of observed RNA-seq reads.  However, we speculate that these upper bounds are close to the true percentages. Because reversing the ranking requires one to increase the expression of one transcript and decrease the expression of the other, and it is less likely to generate nonviable paired estimates under this operation.
3.3. DE transcripts are generally reliable when assuming the reference transcripts contribute >40% to the expression Applying our method to the MCF10 cell line samples with and without epidermal growth factor (EGF) treatment (accession SRX1057418) (Kiselev et al., 2015), we analyze the reliability of the detected DE transcripts. We use Salmon (Patro et al., 2017), tximport (Soneson et al., 2015), and DESeq2 (Love et al., 2014) for the differential expression detection pipeline. This pipeline predicts 257 DE transcripts under a false discovery rate (FDR) threshold of 0.01. We use the overlap between the mean ranges of optimal expression estimated in the samples with and without EGF treatment as an indicator for unreliable DE detection (see Section 2.5 for details). The overlap is defined as the size of the intersection over the size of the smaller range of the two ranges of optima. We compute the overlap under various k values. The threshold to declare an unreliable DE detection is 25% of overlap.
We identify examples of reliable and unreliable DE predictions. There are five DE transcripts for which their differential expression statuses may change even when the reference expression proportion (1 -k) is as high as 90%. Figure6A-C shows the lower bounds and upper bounds of the transcript expression of three examples among the five transcripts. Their expression estimates suffer from great uncertainty such that the ranges of optima between the two DE groups overlap.
The five genes corresponding to the five transcripts are involved in the following cellular processes or pathways: mRNA degradation, cell apoptosis, glucose transportation, DNA repair, inhibition and transportation of certain kinetics (Pruitt et al., 2011). These genes contain between 6 and 22 isoforms. Further analyses based on the DE detection of these five transcripts require caution since they may be falsely detected to be DE due to nonidentifiability.
Other than these transcripts, the detected DE transcripts are generally reliable after considering the potential expression estimation uncertainty due to nonidentifiability. The ranges of optimal expression estimates between the two sample groups do not have large overlaps when the reference transcriptome is relatively complete and contribute >40% to the expression (Fig. 6E). This observation is supported by another data set, where replicates naive CD8 T cells, and four replicates of effector memory CD8 T cells are compared for differential expression detection, as detailed in Figure 7. There are 3152 DE transcripts under FDR threshold 0.01, 19 out of which are unreliable even when reference transcripts compose >90% expression. We observe the similar pattern that most DE predictions are reliable when the reference transcript expression is >40%.
A previous study (Everaert et al., 2017) showed that expression quantification software tends to make slightly more mistakes in deciding the relative expression of isoforms within one sample, compared with deciding the fold change of one isoform across multiple samples. Our results here and in the previous section show agreement with that observation, but with amplified errors in deciding relative expression of isoforms within one sample. Our model for bounding the range of uncertainty due to model nonidentifiability provides an explanation from the theoretical perspective: the short sequencing reads may not be sufficient to uniquely reveal the relative abundances of transcripts from complicated splice graphs.

CONCLUSION AND DISCUSSION
We develop algorithms to compute the range of optimal expression estimates due to nonidentifiability for each transcript. We consider both complete and incomplete reference transcript assumptions (quantified with reference-transcript-based quantification and graph quantification, respectively) and further provide the range of uncertain estimates under mixed assumptions: a certain proportion of expression is from reference transcripts and the rest (indicated by k) is from expression of splice graph paths. The code for computing the range of expression is available at https://github.com/Kingsford-Group/subgraphquant. The code for the involved analyses is available at https://github.com/Kingsford-Group/subgraphquantanalysis. Applying our methods on Human Body Map samples and two RNA-seq data sets for DE transcript detection, we observe the following expression uncertainty patterns: the ranking of expression between a transcript and its sibling isoforms in a given sample cannot be determined for many (20%-47%) transcripts if the expression estimation uncertainty is considered, but when comparing the expression estimates of a transcript across multiple RNA-seq samples, the detected DE transcripts are mostly reliable.
The k parameter is unknown in our model, and we address this by investigating the ranges of optima under varying reference completeness values. However, determining the best k value that fits the data set as an indicator for reference trustfulness is an interesting question in itself, and we believe transcript assembly or related methods might be useful for choosing the correct k value for each data set.
The nonidentifiability problem in expression quantification is partly due to the contrast between the complex splicing structure of the transcriptome and short length of observed fragments in RNA-seq. Recent developments of full-length transcript sequencing might be able to close this complexity gap by providing longer range phasing information. However, full-length transcript sequencing techniques suffer from problems such as low coverage and high error rate. It is still open whether full-length transcript sequencing is appropriate for quantification and how the current expression quantification methods, including this work, should be adapted to work with full-length transcript sequencing data.