Not All Requirements Prioritization Criteria Are Equal at All Times: A Quantitative Analysis

Requirement prioritization is recognized as an important decision-making activity in requirements engineering and software development. Requirement prioritization is applied to determine which requirements should be implemented and released. In order to prioritize requirements, there are several approaches/techniques/tools that use different requirements prioritization criteria, which are often identified by gut feeling instead of an in-depth analysis of which criteria are most important to use. Therefore, in this study we investigate which requirements prioritization criteria are most important to use in industry when determining which requirements are implemented and released, and if the importance of the criteria change depending on how far a requirement has reached in the development process. We conducted a quantitative study of one completed project from one software developing company by extracting 32,139 requirements prioritization decisions based on eight requirements prioritization criteria for 11,110 requirements. The results show that not all requirements prioritization criteria are equally important, and this change depending on how far a requirement has reached in the development process.


INTRODUCTION
R EQUIREMENTS Prioritization (RP) is an important decision making task in software development [1] where the objective is to determine, from a set of candidate requirements, which requirements are the most valuable and thus should be included in the product [2], and in which order they should be implemented [3]. Prioritizing requirements (i.e. determining the most valuable ones) involves making decisions based on one or several criteria, e.g., budget [4], time constraints [4], technical constraints (e.g. development cost and risk) [3,5,6], business aspects (e.g. market competition and regulations) [6], customer satisfaction [5,6], or busienss value [3,7]. The increasing number of requirements, both from internal (e.g. developers) and external (e.g. customers) sources, and from the availability of vast amount of data (big data) coming from digital networks connecting an increasing number of people, devices, services, and products [8], makes RP even more difficult.
Several RP techniques have been introduced in the literature [4,6,9,10] to make RP accurate, efficient, and reliable [4]. For example, RP techniques based on new technologies such as machine learning and repository mining [5,6,11] (following the trend of big data in requirements engineering), or RP techniques based on established RP concepts such as Analytical Hierarchy Process, Numerical Assignment, Planning Game, and Cumulative Voting [4].
Regardless if the RP techniques are based on new technologies or established concepts, all use one or several criteria when prioritizing requirements. However, all techniques have limitations, not only related to, e.g., scalability and requirements dependencies [5,9], but also due to assumptions about project context [3,4] and assumptions about which criteria should be used when prioritizing requirements, which is often decided based on gut feeling [3]. Having a predefined set of criteria to be used in RP may lead to using misleading criteria [3], and thus making wrong/ poor decisions. Hence, it is important to have flexible RP techniques where the used criteria are based on an in-depth analysis of which criteria are the most appropriate for a given context/project [2]. There are studies (e.g., [7,12,13]) that have investigated which criteria are most commonly used in industry and/or most important/valuable when prioritizing requirements. However, most (if not all) of these studies have investigated the RP criteria by asking industry practitioners for their subjective opinion concerning which criteria are most commonly used and/or most important when prioritizing requirements.
In this paper, we investigate which RP criteria are the most important ones in industry when determining which requirements are implemented and released to its customer and which ones are dropped. To this aim, we conducted a study of one completed project from one software developing company investigating which criteria industry practitioners actually base their decisions on when prioritizing requirements, and if the criteria change depending on how far a requirement has reached in the development process. Our hypotheses is that some criteria are more important than others when prioritizing requirements in industry over time.
In order to test our hypotheses, we performed a quantitative study considering 83,408 RP decisions based on eight RP criteria for 11,110 features 1 from one completed project with 14 software development teams and five cross-functional teams. The extracted data was analyzed by designing, comparing, validating, and diagnosing ordinal Bayesian regression models employing a Sequential likelihood. In addition, ordered categorical predictors were modeled as category-specific effects. Finally, to better understand how these effects vary over time a conditional effects analysis was conducted.
The results of this quantitative study show that not all RP criteria are equally important, and that the importance of a criterion changes depending on how far a requirement has reached in the development process. For example, having a high business value has an actual impact on RP early in the development process, high customer value has an impact in the middle, while being a critical requirement only has an impact at the end of the development process. Moreover, one out of eight used RP criteria, namely number of key customers who believed the requirement is important, had no impact at all on RP. Although the criterion dependency to other requirements had a significant impact on requirements prioritization in one point in time, it did not matter if the requirement had dependencies to other requirements or not. Meaning, requirements dependencies does not have an impact on requirements prioritization.
The remainder of this paper is organized as follows. Section 2 presents related work, and an introduction to Bayesian Data Analysis. Section 3 describes the design of our quantitative study, while Section 4 presents the results. Section 5 discuss the findings and Section 6 discloses the threats to the validity of our study. Finally, Section 7 gives a summary of the main conclusions.

BACKGROUND AND RELATED WORK
This section presents related work on requirements prioritization. We conclude the sec-tion by providing a brief introduction to Bayesian data analysis.

Requirements Prioritization
Several Systematic Literature Reviews (SLR) and systematic mapping studies have studied state-of-the-art in Requirements Prioritization (RP) [1,4,6,9,10,14,15]. Herrmann and Daneva [1] investigated RP techniques based on benefit and cost information and concluded that empirical validations of RP techniques were needed. Kaur and Bawa [14] conducted a SLR and identified seven RP techniques that were compared and analyzed. The seven RP techniques were Analytic Hierarchy Process (AHP), value-oriented prioritization, cumulative voting, numerical assignment, binary search tree, planning game, and B tree prioritization. The authors concluded that more work in the RP area is needed in order to improve the effectiveness -in terms of complexity and time consumption -of RP techniques. Pergher and Rossi [6] performed a systematic mapping study focusing on empirical studies in RP. The authors identified that accuracy, time consumption, and ease of use were the most common criteria to use when evaluating RP techniques. Moreover, the results revealed that most studies in the RP area focus on RP techniques. Achimugu et al. [9] conducted a SLR with the focus on RP techniques and their prioritization scales. The SLR identified 49 RP techniques that, in general, faced challenges related to time consumption, requirements dependencies, and scalability.
Later on, Hujainah et al. [10] conducted a SLR to identify strengths and limitations of RP techniques. The results showed that RP is important for ensuring the quality of the developed system. In addition, 108 RP techniques were identified and analyzed based on, e.g., used RP criteria and limitations. In total, 84 RP criteria were used among the 108 RP techniques where the criterion importance was the most frequently used criterion. Moreover, the authors concluded that the existing RP techniques have limitations with regards to scalability, requirements dependencies, time consumption, and lack of quantification, which is in line with the reported limitations in [9]. Thakurta [15] performed a systematic mapping study focusing on understanding RP artifacts, which included the objective of RP and factors that influence the overall RP process. In a recent SLR, Bukhsh et al. [4] evaluated the existing empirical evidence in the RP area, which did not only include empirical evidence related to RP techniques. The results show that AHP is the most accurate and commonly used RP technique in industry. Most of the focus in the RP literature is on proposing, developing, and evaluating RP techniques, and comparing the performance of existing RP techniques [6]. The most common approach to evaluate RP techniques is by empirically evaluate two or more RP techniques, where AHP is commonly used as one of them [4].
All RP techniques use one or several criteria for RP, where most of them use a fixed, predefined, set that are used during the RP process [3]. However, the predefined criteria may not be suitable for all contexts. Thus, it is important to identify which criteria to use, and which ones are the most important to use given the context. Riegel and Doerr [3] conducted a SLR to identify and categorize prioritization criteria. In total, about 280 prioritization criteria were extracted from the literature and categorized into six main categorize: benefits, costs, risks, penalties and penalty avoidance, business context, and technical context and requirements characteristics. The most frequently mentioned RP criteria in the literature were: implementation effort, resource availability, implementation dependencies, business value, customer satisfaction, and development effort. Hujainah et al. [10] identified 84 RP criteria where importance was the most used criterion among the identified RP techniques, followed by cost, business value, value, and dependency. Thakurta [15] identified several factors that influences requirements prioritization, including requirements dependencies, software architecture, business value, and stakeholder roles.
Most of the identified RP criteria in the above literature comes from proposed RP techniques, and thus are selected based on gut feeling [3] and not importance. There are studies, e.g., [7,12,13], that looked into which RP criteria are used/important in industry. Berntsson Svensson et al. [12] investigated how RP is conducted in industry and which criteria are used when prioritizing requirements. The results show that cost, value, customer input, and/or no criterion are the most commonly used criteria in industry for RP. In another study, Daneva et al. [7] found that the understanding of requirements dependencies is important for RP, and that the two most important RP criteria are business value and risk. Jarzebowicz and Sitko [13] investigated agile RP in industry. The results show that business value is the most commonly used RP criterion, but other criteria such as complexity, stability, and interdependence are also used. However, these studies are based on the practitioners subjective opinion about which RP criteria are important, and not on an in-depth analysis based on actual RP decisions.
The above indicates that the focus have been on comparing RP techniques and not on what we should measure, i.e., the criteria. Ultimately, in all analysis, what you measure and how you measure it, is more important than the actual analysis. To this end we focus on an analysis technique that allows us to take prior knowledge into account, handles disparate types of data, uses generative models, and quantifies uncertainty through probability theory, in order to investigate what effect different measurements have on requirements prioritization.

Bayesian Data Analysis
In this paper we expect the reader to have knowledge regarding design of statistical models. In our particular case we will conduct linear regression, however our outcome (dependent variable) is of an ordered categorical nature (i.e., compared to a count the differences in value is not always equal), as are some of our predictors (independent variables) [16]. 2 To this end, we will design, compare, validate, and diagnose ordinal Bayesian regression models with the purpose of propagating uncertainty and making probabilistic statements by using a posterior probability distribution as explained by, e.g., Bürkner and Vuorre [17], Furia et al. [18], Torkar et al. [19].

STUDY DESIGN
In this paper, we extracted data from three databases to empirically investigate which RP criteria actually have an impact on the decisions when prioritizing requirements in industry. The aim is to empirically evaluate the impact RP criteria have on decisions when prioritizing requirements, and if the criteria change depending on how far a requirement has reached in the development process. Our conjecture is that some criteria are more important than others when prioritizing requirements in industry, e.g., the software engineers/teams subjective opinions, previous experiences, intuitions [20], value and customer input [12], or requirements dependencies, software architecture, business value, and stakeholder roles [15]. The following research questions (RQ) provided the focus for the empirical investigation: • RQ1: Which requirements prioritization criteria are the most important when determining which requirements should be implemented and released?
• RQ1.1: Does the importance of requirements prioritization criteria change depending on how far a requirement has reached in the development process?
We conducted our analysis on one completed software project. Table 1 provides a description of the studied software project, namely the total number of features (i.e., 2. Throughout the paper we use the terms variate, predicted variable, dependent variable, and outcome interchangeably. The same applies to the terms covariate, independent variable, and predictor. Since the paper's focus is on features in requirements engineering, we refrain from using that term in connection to our statistical analysis. requirements), the total number of decisions for all requirements (note that one requirement could have one or several decisions), number of RP criteria used in the decision making, number of involved development teams, and number of involved cross-functional teams. The analyzed software project belongs to one software developing company from our industry collaboration network. The software developing company has a large number of completed and ongoing projects. Thus, in order to select a project to be analyzed, four criteria were identified that needed to be satisfied: • Criterion 1: Completed project. It was important for the studied project to be completed in order to analyze all requirements and decisions during the project's life cycle. Thus, we avoided projects with a short development time, e.g., projects that was only 50% completed, since these projects would have an incomplete number of requirements and decisions made.
• Criterion 2: More than one criterion. About 280 different requirements prioritization criteria have been identified in the literature [3], while other studies, e.g. [12,15,20], have identified different criteria that are considered important in requirements prioritization. Therefore, in order to analyze which criteria actually have an impact on the decisions in industry, it was important to analyze a project that used several different requirements prioritization criteria.
• Criterion 3: Complete information. We needed reliable data in order to produce a healthy dataset (the most important aspect in any statistical analysis is the data, not what approach one uses). To that end, all information about the requirements, e.g., which state the requirement is in, needed to be complete in the databases, and all decisions (from requirements prioritization) needed to be fully documented.
• Criterion 4: Large number of requirements and decisions.
In order to fully understand which RP criteria are the most important in industry, our studied project could not be a too simple example with only a few requirements and decisions made. Therefore, it was important that the studied project had a large number of requirements and decisions made, which could be seen as representative of a project at larger software company.
These four criteria allowed us to (i) identify a project that the company identified as a representative project of the company, i.e., purposive sampling [21] ensuring representativeness, (ii) discard projects having a short development time with few requirements and decisions, and (iii) discard projects with only one or a few requirements prioritization criteria. A "gate-keeper" at the software developing company identified a suitable project that fulfilled all four criteria.

Data Extraction
We extracted data from three databases of the studied project. Fig. 1 provides an overview of our data extraction steps, which are described below.
(D1) Extract all features. The first step in the data collection and extraction phase was to extract all features that was ever considered from the completed project. For each feature, a unique ID (FeatureID) was extracted, which was used to link the feature to all requirements prioritization decisions and state(s) the feature reached in the development process (see D2 below), and to all values it had for each requirement prioritization criteria (see D3 below). In total, 11,110 features were extracted.
(D2) Extract all states for each feature. When features are discovered it is not certain if the feature will be included in the product release. Available resources, scope, and leadtime limits the realization of any feature into the product. Therefore, to keep track of all features through the software development process, a feature can have one of seven states, namely: elicited, prioritized, planned, implemented, tested, released, or dropped (illustrated in Fig. 2). The state of a feature shows how far the feature has reached in the development process. Before a feature is given a state (the first state is elicited), a decision (i.e., requirements prioritization decision) was made to include that feature in the project. All extracted features from D1 reached at least the state of elicited, and thus is considered to be included in the project. Then, before a feature change its state, a new requirements prioritization decision based on eight criteria (see D3 below and Table 2) was made. A feature could move (backward or forward) from one state to another, meaning a feature could have been in one or several states. When extracting all states from the second database, the FeatureID was used to link each feature to all its states in the project. In total, 83,408 decisions were extracted. Fig. 2 shows the different states for a feature, which are described below.
State Elicited: Each feature that has been through a prefeasibility phase, and being prioritized (i.e. a decision is made to include the feature for the next step) reach the state Elicited.
State Prioritized: At regular intervals, features with the state of Elicited are being reviewed in a feasibility review for possible inclusion into the product. After the feasibility review, a decision (i.e. requirements prioritization) is made. Features that are prioritized to be included get the state Prioritized.
State Planned: Features that have the state Prioritized are being reviewed in an analysis review. In the analysis review, each feature is analyzed based on, e.g., scope, adding details to the feature, estimations, and a more elaborate specification of the feature is created. After the analysis, the features are prioritized (i.e., a decision is made) to be included in the product or not. All features that are prioritized to be included in the product get the state Planned. These features are input for the design, coding, and iteration/sprint planning. Finally, these features are added to the product backlog.
State Implemented: From the product backlog, features are selected (i.e. prioritized) for development. When the features are developed, which includes technical design, coding, and unit tests, the features get the state Implemented.
State Tested: Although the implemented features include some testing, e.g., unit tests, a decision (requirements prioritization) is made about which feature will be included for a more through testing process in order to ensure adequate level of quality before an implemented feature is released. Features that are selected for, and pass the testing, get the state Tested.
State Released: When all activities have been completed for the features with the state Tested, a decision (i.e. requirements prioritization) is made about which features should be released. The features that are selected for being released get the state Released.
State Dropped: A feature can be rejected/dropped at any time in the process (until state Released). These features get the state Dropped. Dropped features are not deleted from the backlogs/repositories/storage to enable future analysis.
(D3) Extract all RP criteria and its values. The FeatureID and States were used to extract the RP criteria and its values for each feature and its state(s). We extracted data from all RP criteria that were used when prioritizing a feature in this project. In total, eight different RP criteria were used each time a feature was prioritized.
(D4) Merge data. After step 3 (D3) was completed, we merged all extracted data from all three databases (D1-D3) using FeatureID and State. This allowed us to remove incomplete information, e.g., features without a state and empty values for the RP criteria. Table 2 provides an overview of all extracted data from D1-D3.
Due to non-disclosure agreements, the empirical data, e.g., FeatureID, variable names, and values, are not allowed to be revealed. Hence, we generated a synthetic dataset with describing names and values. The modifications of the real data include, changing the real FeatureID to a random ID without replacement from 1 to 11,110. Moreover, all variable names (column Variable in Table 2) were changed to descriptive and generic names that described the purpose of the variable, inline with current literature. In addition, the values (column Value(s) in Table 2) have been modified to descriptive values. For example, the values for the variable State are changed to names that describes the state of a feature. Section 3.2 provides descriptive statistics of the merged data. Concerning Number of stakeholders (Fig. 3b), the absolute majority of the features have only one, while for Number of key customers (Fig. 3c), most features have zero. Finally, concerning the variable Priority (Fig. 3d), most features have zero in priority, and are then, more or less, spread out to priority 1000, where there is another peak.

Descriptive Statistics
Looking at the variables that eventually will be modeled as category-specific effects (i.e., they being ordered categorical) one can see that for Architects' involvement After gaining some insight concerning the variables, we next turn our attention to statistical model development where we design, compare, validate, and diagnose statistical models to conduct inferences. (Fig. 1 provides an overview of our statistical model development.) All steps in the analysis can be replicated by downloading the replication package, and preferably install Docker. 3 The empirical data used in this manuscript is unfortunately not generally available due to an NDA. However, we have generated a synthetic dataset so anyone can follow the analysis step-by-step, and reach very similar results. 3. https://github.com/torkar/feature-selection-RBS DOI: 10.5281/ zenodo.4646845

Statistical Model Development
There are several ways to model ordered categorical (ordinal) data, but not until quite recently was it possible to use them easily in Bayesian data analysis. Software engineering, generally speaking, handles ordered categorical data by assuming that the conclusions do not depend on if a regression or ordinal model is used. The problem is, of course, that relying on an incorrect outcome distribution will lead to subpar predictive capabilities of the model [17]. This, in combination with the fact that effect size estimates will be biased when averaging multiple ordinal items, and that data can be non-normal, is something a researcher should want to handle [22].
Today, we have at least three principled ways to model ordinal data: Adjacent category [17], Sequential [23], and Cumulative models [24]. These models have been developed and refined in a Bayesian framework mostly because of needs from other disciplines, such as psychology [17].
First, Adjacent category models can be used when predicting the number of correct answers to several questions in one category (think of a math module for the SAT or the PISA tests). We could perceive that our underlying datageneration process could be modeled this way.
Sequential models, on the other hand, assume that the outcome results from a sequential process and that higher responses are only possible if they pass lower responses; which is very much the case for our outcome State.
Finally, Cumulative models assume that the outcome, e.g., observed Likert scale values, stems from a latent (not   observable) continuous variable [17].
In the case of Sequential models, we can model ordinal predictors as category-specific effects, while in Cumulative models, predictors are modeled as monotonic effects, the latter in order to avoid negative probabilities [16]. The main reason for modeling predictors this way is to gain a more fine-grained view of the parameters (i.e., how much does each category, in a predictor, affect the outcome). As an example, suppose that our variable Business value has a stronger impact when moving from State 3 to 4 (Fig. 3a). This pattern would be invisible if not modeled appropriately.
(SMD1) Selection of likelihood. The first step concerning model design is often to decide which likelihood to use for inference, the Cumulative, Sequential, or Adjacent-category. This can be done by designing six statistical models and estimating their pointwise out-of-sample prediction accuracy (this, we would claim, is state of the art concerning model comparison, as introduced by Vehtari et al. [25]). In Table 3, the result of the model comparison is presented, and it is clear that the Sequential model with predictors modeled as category-specific effects (where possible), has relatively speaking better out of sample prediction accuracy.
If we examine In summary, on the 99%-level, the first model is significantly better than the second (since the difference does not cross zero). The only difference between the two first models is that we model predictors, when possible, as categoryspecific effects. Next, it is also notable that we do not see the same effect using a Cumulative model and modeling predictors as monotonic (rows [4][5]. Finally, the last line is our null model (M 0 ), which is a model that does not use any predictors and, thus, only models grand means, i.e., the cutpoints. By looking at ∆elpd, we see that adding predictors to our model (rows 1-5) has a clear effect compared to M 0 . Hence, the conclusion concerning the model comparison is that the Sequential model, using category-specific effects, is our target model for now M s[cs] = M. Next, we need to set appropriate priors.
(SMD2) Prior and posterior predictive checks. For our candidate model, we have several parameters in need of appropriate priors. One way to decide on priors is to make sure that the combination of all priors should be nearly uniform on the outcome scale and that impossible values should not be allowed.
Using a Sequential(φ, κ) model we know that more probability mass could be set in the beginning (potentially all features could be dropped in State 1), and then we should assign less probability mass for each following level in our outcome; we have six levels in our outcome, i.e., State 1-6 ( Fig. 3b). 4 The complete model design, with priors, is thus, On the first line we assume that State should be modeled using a Sequential likelihood. On Lines 2-4 we provide our linear model with all predictors and the parameters we 4. Conventions for writing mathematical forms of Sequential models vary somewhat, but we will use Sequential(φ, κ), where φ is our linear part and κ the intercepts we want to estimate, i.e., the cutpoints between each step in our outcome. want to estimate (β 1 , . . . , β 8 ). Ordered categorical predictors are modeled as category-specific effects, i.e., cs(). As is evident, we model φ with a logit link function (Line 2), in order to translate back to the log-odds scale from the probability scale (0, 1). Finally, on Lines 5-6, we set priors on our parameters. The intercept (cutpoints) priors for κ are wider since we can expect them to vary more, while for our β parameters N (0, 1) might seem very tight, it still implies a prior variance of σ 2 = (1 · 8) 2 = 64 for the model.
A visual view can be given by sampling from the priors only, i.e., prior predictive checks, and with priors and data, i.e., posterior predictive checks (see Fig. 4 for a comparison).
(SMD3) Diagnostics. When using dynamic Hamiltonian Monte Carlo we have a plethora of diagnostics, which we should utilize to ensure validity and efficiency of sampling.
Here follows a short summary of the most common diagnostics and the outcome of these diagnostics for our model.
There should be no divergences since it is an indication that the posterior is biased (non-stationary); it mainly arises when the posterior landscape is hard for the sample to explore (a validity concern). No divergences were reported. Tree depth warnings are not a validity concern but rather an efficiency concern. Reaching the maximum tree depth indicates that the sampler is terminating prematurely to avoid long execution time [26]. No warnings were reported.
Having low energy values (E-BFMI) is an indication of a biased posterior (validity concern). No warnings were reported.
The R convergence diagnostics indicates if the independent chains converged, i.e. explored the posterior in approximately the same way (validity concern) [27]. It should converge to 1.0 as n → ∞. The R diagnostics was consistently < 1.01.
The effective sample size (ESS) captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm, for each parameter (efficiency concern). The higher the better. When ESS ≈ 0.1 one should start too worry, and in absolute numbers we should be in the hundreds for the Central Limit Theorem to hold. The ESS diagnostics was consistently > 0.2.
Finally, the Monte Carlo Standard Error (MCSE) was checked for all models. The MCSE is yet another diagnostic that reflects effective accuracy of a Markov chain by dividing the standard deviation of the chain with the square root of its effective sample size (validity concern). Having reached some confidence that the target model is representing the data-generation process adequately, while assuring the validity and efficiency concerning the sampling, we next turn our attention to model inferences.
(SMD4) Inferences. The next section will provide results from the model by listing all parameter estimates (in our case the cutpoints, κ, are not relevant, but we shall focus on β 1 , . . . , β 8 ) and plot them.
In particular, we will analyze the category-specific effects that were modeled. Does the fine-grained view, which the category-specific modeling of predictors provides us with, tells us a story about how each category, part of a predictor, affects the outcome?
Finally, we will present a number of conditional effects. The latter concept is an excellent way to better understand the effect a specific predictor has on the six outcomes. Not only the size of the effect will be visible, but also how it varies depending on a number of factors.

RESULTS
Before we explain the concept of conditional effects, we will investigate the model's results as-is. First, Table 4 consists of all effects we are interested in (i.e., the β's). All rows in bold indicate a significant effect, that is, the 95% credible interval of an effect's distribution (density), does not cover zero (l-95% and u-95% columns in the table). (Fig. 5 provides a visualization of the table that is, perhaps, more straightforward to assimilate.) Even though we are not very interested in an effect's point estimate per se, let us take Stakeholders, as an example, i.e., µ = −0.05 CI 95% = [−0.08, −0.02]. Recall from Sect. 3.2 that Stakeholders could vary (0, . . . , 10) and was used to indicate how many stakeholders a particular feature had. First, we transform the value using inverse logit, since the model used a logit() link function, i.e., exp(−0.05)/(exp(−0.05) + 1) = 0.49.
In order to improve sampling, all variables, where appropriate, were centered and scaled, i.e., for all values, we removed the variable's mean (in this case, µ x = 1.05), and then scaled each value by dividing it with the variable's standard deviation (σ x = 0.53). Hence, to receive the original scale we do the opposite, i.e., 0.49 · 0.53 + 1.05 = 1. 31 Another thing that is clear, by looking at the table and the figure, are a number of significant effects on three parameters that we modeled as category-specific effects. business value [2] business value [3] business value [4] business value [5] customer value [1] customer value [2] customer value [3] customer value [4] customer value [5] architects [1] architects [2] architects [3] architects [4] architects [5] −0.5 0.0 0.5 1.0 If we take customer value as an example, we can claim that customer value has a positive effect on the third and fourth cutpoints, i.e., the cutpoints between States 3/4 and between 4/5, are pushed up, leading to more probability mass being assigned to the lower levels, i.e., State ≤ 4 (Elicited, Prio, Planned, Implemented, Dropped, and below). This detail would have been impossible to notice without modeling the effect as category specific. 5 However, what is even more interesting is the fact that we have a combined posterior probability 5. In the replication package one can see that not modeling categoryspecific effects would miss that architects' involvement actually has some interesting effects. distribution for all effects, i.e., it is possible to see how each effect varies when fixing all other covariates to their mean or reference level.

Conditional Effects
Conditional effects allow us to fix all predictors to their mean, or reference category for factors, except for the one we want to understand better. If we plot our significant effects for our continuous covariate, one can see how the effect varies depending on State (Fig. 6).
Let us now go through these plots one by one and make notes about the particular characteristics of an effect. The top-left plot in Fig. 6 presents the effect Priority. What is evident, compare with the other plots, is that the uncertainty is low since the bands surrounding each line are tightly following the line. If we look at State 6 (a feature is released), we can see that there is a much higher probability (y-axis) for State 6, as we move to the right (the priority increases). Also, as expected, for States 1 and 2 to have a high Priority is uncommon.
Examining the next plot (clockwise), one can see that for State 6, there is a clear change when moving from 'No' to 'Yes' (albeit with a slight increase in uncertainty). Next, in Dependency, which is also a dichotomous variable, there is a difference between State 6 and State 1 and 2. In short, if we have a dependency, there is a greater probability that the feature will end up in State 6, while the opposite holds for State 1 and 2. Also worth noting is that State 3 has the highest probability of having a dependency (and then it does not matter if it moves from 'No' to 'Yes').
Finally, for Stakeholders, one can see that State 1 and 2 implies that the more stakeholders, the higher the probability that it will end up in those states, which might sound counterintuitive; however, we also have an increase in uncertainty. The opposite holds for State 6, but once again, greater uncertainty when increasing the number of stakeholders.
We will refrain from plotting the last three significant category-specific effects, and simply conclude by saying that all three effects contain categories that affect the outcome positively or negatively.
To summarize this section, we have seen that analyzing estimates drawn from the posterior probability distribution provides us with indications of significant effects (by looking at the standard deviations and credible intervals). The conditional effects analysis, i.e., fixing all other variables except for one, provided insight into how an effect varies (an effect's size is, by itself, not always exciting, but rather how it varies depending on context).

DISCUSSION
In this section, the results are discussed and related to the literature. Section 5.1 discuss the first research question, while the second research question is discussed in Section 5.2. Finally, Section 5.3 discuss general findings.

Most important RP criteria (RQ1)
In analyzing RQ1, which requirements prioritization (RP) criteria are most important when determining which requirements should be implemented and released, we looked into which RP criteria the company deemed most important to use in their project (i.e., which ones are actually used), and which ones have an actual impact when deciding which features should be implemented and released. However, we did not investigate if the used RP criteria are the most appropriate ones to use. This decision was made by the company and is not part of this study.
Looking into which RP criteria the company deemed most important to use in practice, eight RP criteria were used when prioritizing requirements, namely: All eight RP criteria used by the case company are identified in the literature (e.g., in [3,7,10,12,15,28,29]). Business and customer value are native to agile [29], and used in industry when prioritizing requirements [7,12]. Although expert opinion (peoples' previous experiences, opinions, intuitions, various criteria, arguments, or a combination of one or several of these information sources) is not identified as an RP criterion in the literature, it is often used when prioritizing requirements [12,30,31]. However, there is a difference between how expert opinion is used in the analyzed project and what is reported in the literature.
The difference is that the team's expert opinion (called Team priority) is a specified criterion for RP where the teams decide on a value (between 0 and 1000) that represents their expert opinion, which is then used when prioritizing requirements.
One interesting finding is related to the importance of a requirement. Hujainah et al. [10], indicate that importance was the most frequently used RP criterion in the identified requirements prioritization techniques/tools. According to Hujainah et al., importance refers to how important a requirement is to the stakeholders. This definition is in line with [15], who defines importance as the subjective evaluation of a requirement by stakeholders. However, stakeholders include several different types of stakeholders, e.g., users, customers, the project team, marketing/business department, and competitors, and thus it is not clear which perspective is used. Riegel and Doerr [3], report on several different perspectives of importance identified as RP criteria, e.g., project importance with regards to overall project goal, importance to business goals, and importance to customers. In the analyzed project, three different perspectives of importance were used when prioritizing requirements as three separate criteria, namely: (i) from the project's perspective (Critical feature), (ii) from an internal stakeholder perspective (Stakeholders), and from a customer perspective (Key customers). In the literature, importance is often used in pair-wise comparisons, to produce an ordered list of requirements based on importance, or from a cost-value perspective. However, in the analyzed project, the importance from stakeholder and customer perspective were simply used by counting how many internal key stakeholders considered the feature/requirement to be important and counting how many key customers (customer perspective) consider a feature/requirement to be important.
One surprising finding, when comparing the used RP criteria in the analyzed project with the literature, is that implementation/development effort/cost is not used at all, despite being frequently mentioned in the literature (e.g., in [10,15]), and being the most frequently mentioned criterion in [3]. Moreover, there are several RP techniques/tools that are based on cost/effort [4,10], and it has been reported to be used in industry when prioritizing requirements [7,12]. Despite that various cost/effort estimations are performed at the company and for the analyzed project, it is not deemed as an important criterion to be used for requirements prioritization. One possible explanation may be that cost/effort estimations are considered by the team when setting their own priority (called Team priority), but not explicitly used when prioritizing requirements. However, it is not possible to confirm or reject this explanation based on the extracted data. We can only conclude, based on the extracted data, that implementation/development effort/ cost is not considered an important RP criterion at the company when determining which requirements should be implemented and released, which is not in line with the literature.
The company use eight RP criteria in the analyzed project, but just because they are used it does not mean that they have an actual impact when determining (i.e., requirements prioritization) which requirements should be implemented and released. Therefore, we analyzed 83,408 decisions for 11,110 features to see which of the eight RP criteria have an actual impact, i.e., which ones are good predictors for which requirements that are implemented and ultimately released. Based on the results in Sect. 4 (see Table 4 and Fig. 5), seven out of the eight RP criteria used in the analyzed project have an actual impact on requirements prioritization. The only criterion that did not have an actual impact is Key customers. One reason why Key customers does not have an actual impact is that the uncertainty is too large. Meaning, from a requirements prioritization perspective, it does not matter if key customers believe a feature/requirement is important or not, it has no impact when determining which requirements will be implemented and released.
Although the RP criterion Dependency is significant, meaning it has an actual impact on requirements prioritization, it is a weak predictor for requirements that are implemented and released, which is shown in Table 4 and Fig. 5. This result is not in line with previous studies [3,5,7,10,15,28]. In fact, our result shows the opposite, that dependencies does not have an impact on requirements prioritization. This is surprising since requirement dependencies are important when prioritizing requirements and deciding the order in which the requirements can be implemented [28]. Some requirement needs to be satisfied according to conditions of other requirements, while others may have to be implemented together [32]. According to Shao et al. [5], requirement prioritization results that do not consider requirements dependency can rarely be used. In addition, requirement dependencies are used as a RP criterion in industry [7], is frequently mentioned in the literature [3,15], and used in several requirements prioritization techniques/tools [10]. One possible explanation for the difference between this study and the literature is that we have not asked industry practitioners what they consider (i.e., their subjective opinion) to be important when prioritizing requirements, nor have we used our own opinion or previous studies to decide which criteria have an actual impact on requirements prioritization. Instead, we investigated the actual outcome of 83,408 requirements prioritization decisions for one completed project at one software developing company. To the best of our knowledge, no other study has analyzed the actual outcome of requirements prioritization decisions in industry to identify which RP criteria have an actual impact, and definitely not with such a large sample.

Importance of RP criteria depending on the state (RQ1.1)
As shown in Fig. 7, different RP criteria have different impact on requirements prioritization depending on the state of the requirement, i.e., depending on how far the requirement has reached in the development process. Meaning, some criteria have a high impact on requirements prioritization early in the development process, others in the middle, while some have a high impact at the end.
Looking more closely into which RP criteria are important depending on the state of the requirement. For State 1, the more internal stakeholders (Stakeholders) that consider a feature to be important, the more likely it is that the requirement is prioritized to be included in the project. This is natural since having stakeholders interested in a requirement would lead to the requirement being included in the first place. However, for all other states (State 2 to State 6), it is a lower probability for a requirement to be included (i.e., prioritized) with an increasing number of stakeholders who consider the requirement to be important, which is not in line with the literature [3,10,15].
When moving from State 1 to State 2, the higher business value and the more architects are involved (Architects' involvement), the higher probability that a requirement reach State 2. For a requirement to reach State 3 (i.e. moving from State 2 to State 3), the higher involvement from architects, the more likely it is that a requirement reach State 3. When a requirement moves from State 3 to State 4, high customer and business value have an impact on the requirement prioritization, i.e., the higher customer and business value a requirement have, the more likely it is that it will reach State 4. In addition, a medium Team priority (i.e., not too low and not too high) had an impact on the requirements prioritization in State 4. High Customer value is important for a requirement to move from State 4 to State 5, while for a requirement to reach State 6 (moving from State 5 to State 6), being a Critical feature and considered important for the team (i.e., having high Team priority) increases the probability of the requirement to be released.
For a requirement to be considered in an iteration/ sprint planning meeting (to reach State 3), three criteria have an actual impact. First, a high number of internal stakeholders need to consider the feature to be important, then the feature needs to have a high business value (i.e., to be considered valuable for the company), and involve the software architects. That software architects need to be involved makes sense since it is important to analyze if the included requirements have any negative impact on the current architecture, and/or if the technical debt would increase. However, just because a requirement may have a negative impact on the current architecture and/or the technical debt, it does not mean it will not be included in the project. It means that it is important to get this information/ knowledge from the experts (i.e., the software architects) before making decisions about the requirement.
One interesting, and surprising finding is that only internal value (Business value and Stakeholders) and not external value (Customer value and Key customers) have an actual impact on deciding which requirements reach State 3. That is, among all requirements that are prioritized to be included in the project until the iteration/sprint planning meeting, only internal value is considered, while the customer perspective is totally ignored. The criterion Customer value only start having an actual impact when a requirement moves from State 3 to State 4, and from State 4 to State 5, while Key customers does not have any impact at all when prioritizing requirements. This means that requirements in the early phases in the development process with high customer value may not be prioritized to be included in the project if the business value is low. This is not in line with [7] where the focus is on combining valuecreation for the vendor (i.e., Business value) with valuecreation for the customer (i.e., Customer value).
The findings in this study show that the team's expert opinion/experiences/subjective opinion etc. only have a   6). This is not in line with the literature [30,31], which suggest that the decisions and selection of what to include are commonly based on previous experiences, opinions, intuitions, arguments, or a combination of one or several of these information sources. Instead, up until State 6, the decisions (i.e., requirements prioritization) are based on the internal stakeholders view of the importance of the feature, business value and finally customer value. However, in State 6, Team priority has a very large effect (probability mass close to 70%) on which requirements should be released.
Hujainah et al. [10] investigated which RP techniques/ tools have been published in the literature and which criteria are used in these RP techniques/tools. The most commonly used criterion in the identified RP techniques/ tools is importance (used in 51 out of 108 identified tools). However, the results in this study show that importance only has an impact on the requirements prioritization in the very beginning of the development process (in State 1 when more stakeholders consider a requirement important increases the probability of the requirement being included) and at the very end of the development process (in State 6 when Critical feature goes from 'No' to 'Yes' it significantly increases the probability of a requirement being released). In all other decisions between State 1 and State 6, importance has no significant impact on the decisions. Meaning, all RP techniques/tools that only use importance as the RP criterion, or use a combination of criteria where importance is one of them, would not be applicable/useful at for the analyzed project.
When discussing RQ1, which RP criteria are important, we saw that Dependency has a significant impact on requirements prioritization; however, it was weak due to high uncertainty. When analyzing RQ1.1, if the impact change depending on which state a requirement is in, we see, in particular in State 3, that Dependency has an impact with a probability mass of close to 30%. However, not much changes when the dependency moves from 'No' to 'Yes', as shown in Fig. 6. We see a similar pattern, although with a lower portability mass, for all other states. Meaning, if a requirement has dependencies to other requirements then it does not have an impact on requirements prioritization.

General Discussion of Results
This study identified interesting findings regarding aspects of requirements prioritization, which deviated from what the literature says. Overall, these findings revised parts of our understanding of requirements prioritization, and provides insight into how future requirements prioritization techniques/tools and decision support systems (e.g., AIbased and data-driven decision support systems) should be developed. These findings are: • not all RP criteria are equally important, and it changes depending on how far a requirement has reached in the development process; • the need for identifying which criteria are important, and when they are important, in order to develop flexible requirements prioritization techniques/tools and other decision support systems (e.g., AI-based and data-driven decision support systems); and • unnecessary/obsolete decision criteria/information for decision making (e.g., requirements prioritization) may lead to poor or wrong decisions.
The results from RQ1.1 (see Sect. 5.2) show that not all RP criteria are equally important (i.e., having an impact on which requirements are prioritized to be included, implemented, and eventually released), and that the importance of a criterion changes depending on where in the software development process a requirement is (refers to the six different states, as described in Table 2). For example, Business value has an impact on requirements prioritization in the early phase, Customer value in the later phases, while being a critical requirement (i.e., Critical feature is 'YES') only has an impact in the last phase (State 6). These findings are not in line with how requirements prioritization techniques/ tools in the literature are developed [10,15]. Most, if not all, requirements prioritization techniques/tools select, based on expert opinion, which criteria should be used in the developed requirements prioritization technique/tool [3]. Hence, the requirements prioritization technique/tool in the literature cannot be used in a flexible way with different criteria depending on the development phase, and thus may not be so useful. This may be one reason why gut-feeling, subjective opinion, and expert judgement are frequently reported to be used in requirements prioritization [12,30,31] and it may be explained by representativeness heuristic, which is a mental shortcut to lessen the cognitive load [33]. In short, when the needed information is not available when making decisions, e.g., cannot be used in the current techniques/tools, practitioners use previous experience instead. The importance of having flexible requirements prioritization technique/tool is supported by Berander and Andrews [2].
The findings in this study shows the importance for customizing RP criteria, not only to a specific context/ project [3], but also to specific development phases. Thus, when developing requirements prioritization techniques/ tools, or other decision support systems (e.g., AI-based or data-driven decision support systems), it is important to identify which criteria/information is important to use, and when to use them. Not identifying which criteria/ information that is important may lead to other consequences. One consequence may be that all/too much/ unnecessary criteria/information (i.e., information that is not important for the decision) is presented to the decision makers. That is, unnecessary/obsolete decision criteria/ information is visible for the decision makers, which may lead to poor or wrong decisions. Having extra irrelevant option set, e.g., RP criteria that are not important for the decision, visible to the decision maker should not affect the choice, but in some contexts, it does [34]. The importance of presenting correct information to the decision maker is shown in [33] where the presence of obsolete requirements negatively affected the cost/effort estimations of the requirements. Thus, it may have a similar affect on unnecessary/obsolete RP criteria.

VALIDITY THREATS
Construct validity [35] is concerned with the relation between theories behind the research and the observations. A construct in this case is a latent concept we are trying to measure. Ultimately we want to measure if the concept is real and if, the way we measure it indirectly, is appropriate to better understand the concept.
Concerning this study, the variables (e.g., RP criteria) used in the statistical analysis are all constructs and, hence, try to measure an underlying latent concept; this served our purposes well. By investigating literature and then contrast this with our statistical analysis we uncovered several cases where one could question if appropriate constructs are used in requirements prioritization. First, the effect between constructs vary, which might not be a problem by itself; however, the fact that they vary over time is a bit more worrying. This could be a sign of inappropriate constructs (Sect. 5.2) and we have in this study taken a first systematic step to analyze these constructs using a principled approach to statistical analysis (see, e.g., early work by Furia et al. [36]).
In summary, for our constructs, there might be face validity (does it makes sense?); however, content validity (do the constructs include all dimensions?) is most likely lacking for some constructs. This indicates that predictive validity can be questioned.
Internal validity concerns whether causal conclusions of a study are warranted or if overlooked phenomena are involved in the causation. We assume that the eight used RP criteria in the analyzed project are considered to be the most important ones to use when prioritizing requirements. However, other RP criteria than the ones in the database may have been used and, thus, affected the results. Moreover, another factor that may have affected the results is incorrect/missing data/value for the different RP criteria. There were no NAs in the dataset; however, that does not necessarily mean that there are no NAs. Some of the coding can be a representation of NA, e.g., 'No Value'. In this case, we know that 'No value' and 'None' in the dataset actually are values and not a representation of NAs since we asked the "gate-keeper" from the analyzed project about the correctness of the data.
Threats to external validity is concerned with the ability to generalize the results, i.e., in this case the applicability of the findings beyond the studied project and company. One threat to external validity is that only one completed project from one company is evaluated in this study. Although we analyzed 83,408 decisions based on eight RP criteria for 11,110 requirements, the results from this study may not be generalizable to all projects and requirements prioritization criteria and decisions. However, the aim of this study is not to develop theories that applies to all projects and RP criteria and decisions, but to show that in some projects different RP criteria have different impact on the decisions, and that the impact change depending on how far a requirement has reached in the development process. However, in order to generalize the results, replications on other projects with different RP criteria are needed.

CONCLUSION
In conclusion, we conducted a quantitative empirical study to analyze which requirements prioritization (RP) criteria are the most important ones when determining which requirements are implemented and released, which RP criteria have an actual impact on requirements prioritization decisions, and if the importance of a criterion changes depending on how far a requirement has reached in the development process. To this aim, we extracted 83,408 requirements prioritization decisions based on eight RP criteria for 11,110 requirements (features) from one completed project at one software developing company. The extracted data was analyzed by designing, comparing, validating, and diagnosing ordinal Bayesian regression models. We showed how to model ordinal data in a principled way, how to use category-specific effects to get a more nuanced view, and how to report results using conditional effects. The results from this study highlights the following key findings: 1) Not all used RP criteria have an actual impact on requirements prioritization decisions, e.g., Key customers had no impact at all, while the remaining seven RP criteria had, at least, some impact. 2) Not all RP criteria are equally important, and this changes depending on how far a requirement has reached in the development process. For example, for requirements prioritization decisions before iteration/sprint planning, having high Business value had an impact on requirements prioritization decisions, but after iteration/sprint planning having high Business value had no impact. Moreover, high Team priority (i.e., the teams' subjective opinion) and being a critical feature (i.e., Critical feature is 'YES') only had an impact at the very end of the development process.

3) Internal value (Business value and Stakeholders)
is more important (i.e., have an actual impact on decisions) than external value (Customer value and Key customers) when prioritizing requirements in the beginning of the project. That is, among all requirements that are prioritized to be included in the project until the iteration/sprint planning meeting, only internal value is considered, while the customer perspective is totally ignored. 4) Although Dependency was found to have a significant impact on requirements prioritization decisions, in particular in the middle of the development process, not much changes, in terms of actual impact in decisions, when the dependency moves from 'NO' to 'YES'. Meaning, if a requirement has dependencies to other requirements has no impact on requirement prioritization decisions.
The findings in this paper confirm the need for analyzing and identifying which RP criteria are important in order to develop flexible requirements prioritization techniques/ tools [2]. That is, the importance for customizing requirement prioritization criteria, not only for specific contexts/ projects, but also for specific development phases.
Finally, the findings in this study highlights the need for conducting more quantitative studies (preferable in combination with qualitative data) on different projects and contexts, and with different RP criteria in order to get a more complete understanding of which RP criteria are most important to use in requirements prioritization decisions, and when in the development process they should be used. Although we only studied requirements prioritization decisions and criteria, the results that different criteria (data/ information) have different impact on the decisions depending on where in the development process the decisions are made, may be applicable to other types of decisions within software development. Therefore, it would be interesting to study other types of decisions using other support systems, e.g., AI-based, machine learning, or data-driven decision support systems, to identify which criteria/data/ information have an actual impact on the decisions, and when in the development process.