Speedup and efficiency of computational parallelization: A unifying approach and asymptotic analysis

In high performance computing environments, we observe an ongoing increase in the available numbers of cores. This development calls for re-emphasizing performance (scalability) analysis and speedup laws as suggested in the literature (e.g., Amdahl's law and Gustafson's law), with a focus on asymptotic performance. Understanding speedup and efficiency issues of algorithmic parallelism is useful for several purposes, including the optimization of system operations, temporal predictions on the execution of a program, and the analysis of asymptotic properties and the determination of speedup bounds. However, the literature is fragmented and shows a large diversity and heterogeneity of speedup models and laws. These phenomena make it challenging to obtain an overview of the models and their relationships, to identify the determinants of performance in a given algorithmic and computational context, and, finally, to determine the applicability of performance models and laws to a particular parallel computing setting. In this work, we provide a generic speedup (and thus also efficiency) model for homogeneous computing environments. Our approach generalizes many prominent models suggested in the literature and allows showing that they can be considered special cases of a unifying approach. The genericity of the unifying speedup model is achieved through parameterization. Considering combinations of parameter ranges, we identify six different asymptotic speedup cases and eight different asymptotic efficiency cases. Jointly applying these speedup and efficiency cases, we derive eleven scalability cases, from which we build a scalability typology. Researchers can draw upon our typology to classify their speedup model and to determine the asymptotic behavior when the number of parallel processing units increases. In addition, our results may be used to address various extensions of our setting.


Introduction
Parallel computing has become increasingly relevant to tackle hard computational problems in a variety of scientific disciplines and industrial fields.The large diversity and deployment of parallel computing across disciplines, including artificial intelligence, arts and humanities, computer science, digital agriculture, earth and environmental sciences, economics, engineering, health sciences, mathematics, and natural sciences, is mirrored in usage statistics published by supercomputer clusters (e.g., [25,20]).This ongoing progress in computational sciences through parallelization has been fostered through the end of exponential growth in single processor core performance [14] and the availability of high performance computing (HPC) infrastructures, tools, libraries, and services as commodity goods offered by computing centers of universities, public cloud providers, and open source communities.. Beyond these developments, the number of cores available as parallel processing units has increased substantially over the past years.While the statistics of the TOP500 list (as of June 2022) shows values of 35,339.2(10th percentile), 67,328 (median), and 225,465.6 (90th percentile), the corresponding values of the lists as of June 2017 and June 2012 amount to (16,545.6;36,000; 119,808) and (6,776; 13,104; 37,036.8),respectively [35].In addition, in contrast to the lists of 2012 and 2017, which both include only one site with more than 1 million cores, the current list shows that nine clusters have more than 1 million cores.This enormous growth in the number of cores which are available for parallel processing calls for re-emphasizing asymptotic performance analysis (e.g., [8,3]) and speedup laws as suggested in the literature (e.g., Amdahl's law [4] and Gustafson's law [16]).
In general, studying performance of algorithmic parallelism is useful for several purposes; these include optimizing system operations via design-time and run-time management (e.g., [17,41,37,40,9]), making temporal predictions on the execution of a program (e.g., [29,1]), and analyzing asymptotic speedup and efficiency properties as well as determining speedup and efficiency bounds (e.g., [32,8]).In this article, we focus on the two latter purposes, which have been addressed only rarely in the literature.
Analyzing performance of parallel algorithms is challenging as it needs to account for diversity in several regards.For example, existing speedup models and laws make different assumptions with respect to the homogeneity/heterogeneity of parallel processing units, variations of workloads, and methodological characteristics and application fields of algorithms (e.g., optimization, data analytics, simulation).This heterogeneity has resulted in a landscape of many speedup models and laws, which, in turn, makes it difficult to obtain an overview of the models and their relationships, to identify the determinants of performance in a given algorithmic and computational context, and, finally, to determine the applicability of performance models and laws to a particular parallel computing setting.
Our focus lies on the development of a generic and unifying speedup and efficiency model for homogeneous parallel computing environments.We consider a range of determinants of speedup covered in the literature and prove that existing speedup laws can be derived from special cases of our model.Our model depends neither on specific system architectures, such as symmetric multiprocessing (SMP) systems or graphics processing units (GPU), nor on software properties, such as critical regions; we rather perform a theoretical analysis.We further focus on the analysis of asymptotic properties of the suggested model to study speedup and efficiency limits and bounds in the light of a computing future with an increasing number of parallel processing units.
Our results contribute to research on the performance (in terms of scalability) of computational parallelization in homogeneous computing environments in several regards: (1) We suggest a generic speedup and efficiency model which accounts for a variety of conditions under which parallelization occurs so that it is broadly applicable.This wide scope allows conducting performance analysis in many of those cases which are not covered by existing models and laws with restrictive assumptions.(2) We generalize the fragmented landscape of speedup and efficiency models and results, and we provide a unifying speedup and efficiency model which allows overcoming the perspective of conflicting speedup models by showing that these models can be interpreted as special cases of a more universal model; (3) From our asymptotic analysis, we derive a typology of scalability (speedup and efficiency), which researchers may use to classify their speedup model and/or to determine the asymptotic behavior of their particular application.We also provide a theoretical basis for explaining sublinear, linear and superlinear speedup and efficiency and for deriving speedup and efficiency bounds in the presence of an enormous growth of the number of available parallel processing units.To sum up, we consolidate prior research on performance in homogeneous parallel computing environments and we provide a theoretical understanding of quantitative effects of various determinants of asymptotic performance in parallel computing.
The remainder of the article is structured as follows: In Section 2, we provide a brief overview of the foundations of speedup and efficiency analysis in parallel computing.We proceed in Section 3 with the suggestion of a generic speedup and efficiency model.In Section 4, we perform a mathematical analysis of our model in order to determine theoretical speedup and efficiency limits.We discuss the application of the suggested scalability typology in Section 5 before we provide conclusions of our research in Section 6.

Foundations of speedup and efficiency analysis
The main purpose of parallel computation is to take advantage of increased processing power to solve problems faster or to achieve better solutions.The former goal is referred to as scalability, and scalability measures fall into two main groups: speedup and efficiency.Speedup S(N ) is defined as the ratio of sequential computation time T (1) to parallel computation time T (N ) needed to process a task with given workload when the parallel algorithm is executed on N parallel processing units (PUs) (e.g., cores in a multicore processor architecture); i.e.,

S(N
The sequential computation time T (1) can be measured differently, leading to different interpretations of speedup [5]: When T (1) refers to the fastest serial time achieved on any serial computer, speedup is denoted as absolute.Alternatively, it may also refer to the time required to solve a problem with the parallel program on one of the parallel PUs.This type of speedup is qualified as relative.In this work, we focus on relative speedup.As speedup relates the time required to process a given workload on a single PU to the time required to process the same workload on N PUs, we need to determine this workload.It is usually divided into two sub-workloads, the sequential workload and the parallelizable workload.While the former is inherently sequential and necessarily needs to be executed on a single PU, the latter can be executed in parallel on several PUs.Independent of the number of available parallel PUs N , the time required to solve a task is the sum of the time to handle the sequential workload and the time to handle the parallelizable workload of the given task.When only a single PU is available, the time for the sequential workload s and for the parallelizable workload p are usually normalized by setting s + p = 1; i.e., s and p represent the sequential and the parallelizable fractions of the overall execution time.
For some applications, it is useful to consider a fixed workload (e.g., when solving an instance of an optimization problem), which is independent of the number of parallel PUs (N ) available, and then to analyze how computation of the fixed workload on a single PU can be speeded up by using multiple PUs.Speedup models of this type are referred to as fixed-size models, such as Amdahl's law [4].For other applications (e.g., when analyzing data), is more appropriate to use the availability of N PUs to solve tasks with workloads which increase depending on N .Then, scalability analysis deals with investigating how computation of the variable workload on one PU can be speeded up by using multiple PUs.Speedup models of that type are referred to as scaled-size models, such as Gustafson's law [16].
With varying number of PUs N , both the sequential and parallelizable workload may be considered scalable.It is common in the literature to introduce two workload scaling functions f (•), g(•) with f, g : N → R >0 for the sequential and parallelizable workload, respectively; i.e.; the (normalized) time to process the sequential and the parallelizable workload on a single PU are s • f (N ) and p • g(N ), respectively.Thus, the (normalized) time to process the overall workload on a single PU amounts to s • f (N ) + p • g(N ).Usually, it is assumed that f (1) = g(1) = 1 so that T (1) = s + p = 1 holds; however, our workload scaling functions do not require to meet this assumption. 1 An example of using a scaling function for the sequential workload can be found in the scaled speedup model suggested by Schmidt et al. [31,p. 31ff].While scaling functions for sequential workloads can be found only rarely, scaling functions for parallelizable workloads are much more common; see, for example, the scale-sized speedup model of Gustafson [16], the memory-bound speedup model of Sun and Ni [33,34,32], the generalized scaled speedup model of Juurlink and Meenderinck [19] and the scaled speedup model of Schmidt et al. [31, p. 31ff].A discussion of the relationship between problem size and the number of PUs N can be found in [36, p. 32f].
While the time required to process the sequential workload is independent of the number of PUs N , the time required to process the parallelizable workload depends on N as this workload can be processed in parallel.Usually, the parallelizable workload is considered to be equally distributed on N PUs, resulting in the (normalized) time p•g(N ) N to handle the parallelizable workload.However, there are tasks possible when the time required to handle the parallelizable workload is affected due to its actual parallel execution; for example, when a mathematical optimization problem, such as a mixed-inter linear program (MILP), is solved with a parallelized branch-and-bound algorithm, then good bounds may be found early so that the branch-and-bound tree does not grow as large as with the sequential execution of the branch-andbound algorithm.This effect may result in a denominator function which is not identical to N and allows explaining superlinear speedup as it has been observed in the literature (e.g., [30,11,15]).We account for this effect with a scaling function h(•), with h : N → R >0 .
Finally, processing one single large task on several parallel PUs involves some sort of overhead, which is often rooted in initialization, communication, and synchronization efforts [38,18,13].We account for the additional time required for these efforts with an overhead function z(•), z : N → R >0 .
The abovementioned workloads and temporal effects are visualized in Figure 1.The resulting general speedup equation is the given by Figure 1: Workloads and temporal effects of parallelization Note that the speedup equation given in ( 2) is a generalization of several well known speedup models, including those used in Amdahl's law [4] (set f (N ) = g(N ) = 1, h(N ) = N, z(N ) = 0), Gustafson's law [16] (set f (N ) = 1, g(N ) = h(N ) = N, z(N ) = 0), and the generalized scaled speedup model [19] (set Based upon speedup S(N ), efficiency E(N ) relates speedup to the number of parallel PUs used to achieve this speedup, and it is defined by

A generic speedup and efficiency model
Based upon the general speedup equation ( 2), we derive a generic speedup and efficiency model, which we use in the remainder of this article to analyze its asymptotic behavior.The generic speedup model uses power functions for f (•), g(•), h(•) and ignores any overhead induced through parallelization.The use of power functions is widely adopted in the literature, included in many prominent speedup models [4,16,31,34,19] and is based on the assumption that many algorithms have a polynomial complexity in terms of computation and memory requirement [32, p. 184].As we focus on the analysis of the asymptotic behavior, we always take the highest degree term.The motivation for neglecting any parallelization overhead (i.e., z(N ) = 0 ∀N ∈ N), as it is done in many, if not most speedup and efficency models in the literature, is manifold: First, the overhead is often unknown.Second, omitting an overhead term simplifies computations and provides a basis for developing laws which include an overhead function z(N ) = 0. Third, speedup and efficiency values determined without considering overhead represent upper bounds for practically achievable speedup and efficiency values when overhead occurs.We use the following power functions: and yield the following generic speedup equation (for and the following efficiency equation (for N > 1): The generic speedup equation given in (5) generalizes several well-known speedup equations and laws suggested in the literature (see Table 1).

Theoretical speedup and efficiency limits 4.1 Asymptotic speedup
As we are interested in asymptotic speedup, we determine limits for N → ∞.We rewrite the generic speedup equation ( 5) as follows: Scaled speedup model [31] (under the assumption that the sequential and parallel workloads are given by power functions c f • N α f and c g • N αg , resp.) Sun and Ni's law [33,34] (under the assumption that the parallel workload is given by a power function N αg ) Generalized scaled speedup model [19] For term (I), we yield the following limits (the proof can be obtained from equations ( 34)-( 37) in Appendix A): For term (II), we yield the following limits (the proof can be obtained from equations ( 38) -( 46) in Appendix A): Aggregating the above given limits for terms (I) and (II), yields the following limits for the speedup given in equations ( 5) and ( 7): We now briefly discuss each of the six equations and refer to these as speedup cases; a visual illustration of the speedup cases can be retrieved from Figure 2.
Case A S : The speedup limit given in eq. ( 17) represents an upper bound for S(N ) ∀N > 1 (see Appendix A) and refers to situations in which the number N of available PUs affects the time required to address the parallelizable workload (due to α h > 0).While the speedup limit holds for any c h , α h > 0, it seems reasonable to assume that h(N ) = c h • N α h ≥ 1 holds as increasing the number of PUs from N = 1 to N > 1 should not lead to an increase of time required to execute the parallelizable workload.However, the speedup limit in this case does not depend on the values of c h and α h .Also, this case assumes that the scaling functions change the serial and parallelizable workloads using the same factor N α f .It should be noticed that case A S results in Amdahl's law [4] when setting c f = c g = c h = α h = 1, α f = α g = 0; then, the limit on speedup amounts to 1 s .As the speedup model of Amdahl's law is a special case of the memory-bound model suggested in [33,34] (Sun and Ni's law ) under the assumption that the parallelizable workload in the memory-bound model is given by a power function N αg , case A S partly covers the abovementioned model.This case also (partly) covers the scaled workload model of Schmidt et al. [31] under the assumption that the sequential and parallelizable workloads are given by power functions c f • N α f and c g • N αg , resp., with α f = α g .
Case B S : The speedup limit given in eq. ( 18) represents an upper bound for S(N ) ∀N > 1 (see Appendix A).It refers to the same situation as described in case A S with the modification that, here, N does not affect the time required to address the parallelizable workload (due to α h = 0); that time is rather modified through a division by the scalar c h ; i.e., when executing the parallelizable workload in parallel, the corresponding time changes are determined by a constant factor α h .It seems reasonable to assume that c h ≥ 1 holds in this case (cmp.the analogeous discussion of case A S ), with a resulting speedup limit of lim N →∞ S(N ) ≥ 1.While case B S seems not useful under the premise that the parallelizable workload is infintely parallelizable, it becomes useful when this assumption is replaced by the expectation that a given parallelizable workload can be executed in parallel only on a limited number N max PUs; then, α h may represent this limitation.For a discussion of limited parallelization, see, for example, [10, p. 772ff] and [3, p. 141ff]).
Case C S : When the increase of serial workload is asymptotically higher than that of parallelizable workload (α f > α g ), speedup converges against 1 (see eq. ( 19)) as a lower bound regardless of the values of c h and α h ; i.e., in this case, any parallelization does not reduce the overall execution time asymptotically due to the "dominant"' increase of the serial workload.Case C S (partly) covers the scaled workload model of Schmidt et al. [31] under the assumption that the sequential and parallelizable workloads are given by power functions c f • N α f and c g • N αg , resp., with α f < α g .
Case D S : This case covers situations in which speedup is not limited and increases asymptotically with Θ(N α h ) for (i) α g > α f , (ii) α h > 0, and (iii) α g − α h ≥ α f (see eq. ( 20)); (i) ensures that the parallelizable workload increases faster than the sequential workload, (ii) ensures that parallelization asymptotically reduces the time required to execute the parallelizable workload, and (iii) ensures that the temporal effect induced through the joint growth of the parallelizable workload and its actual parallel execution (N (αg−α h ) ) is not weaker than the temporal effect induced through the growth of the sequential workload (N α f ).Depending on the value of α h , speedup asymptotically grows sublinearly (0 < α h < 1), linearly (α h = 1), or superlinearly (α h > 1).It should be noticed that case D S results in Gustafson's law [16] when setting c f = c g = c h = α g = α h = 1, α f = 0; then, the speedup asypmtotically grows linearly.Case D S (partly) cover Sun and Ni's law when setting under the assumption that the parallelizable workload in this model is given by a power function N αg ).Finally, case D S (partly) covers the scaled workload model of Schmidt et al. [31] under the assumption that the sequential and parallelizable workloads are given by power functions c f • N α f and c g • N αg , resp., with α g − α f ≥ 1.
Interestingly, case D S may help explaining superlinear speedup as is has been observed in research on mathematical optimization (at least for a limited range of N ) [27,6,30,15], for example.
Case E S : Similar to case D S , case E S refers to situations in which speedup is not limited and increases asymptotically, but now with Θ(N (αg−α f ) ) for (i) α g > α f , (ii) α h > 0, and (iii) α g − α h < α f (see eq. ( 21)).The conditions under which case E S differ from those in case D S only with regard to (iii); i.e., here, the temporal effect induced through the joint growth of the parallelizable workload and its actual parallel execution (N (αg−α h ) ) is weaker than the temporal effect induced through the growth of the sequential workload (N α f ).Now, the difference (α g − α f ) determines the asymptotic growth of speedup: it asymptotically grows sublinearly (0 < α g − α f < 1), linearly (α g − α f = 1), or superlinearly (α g − α f > 1).
Similarly to case D S , case E S (partly) includes speedup models and laws suggested in the literature: Case E S (partly) covers Sun and Ni's law when setting c f = c g = c h = α h = 1, α f = 0, α g < 1 (under the assumption that the parallelizable workload in this model is given by a power function N αg ).With α g = 1 2 , Sun and Ni's model becomes the "generalized scaled speedup" model suggested in [19]; thus, case E S also covers the generalized scaled speedup model.Finally, case E S also (partly) includes the model of Schmidt et al. [31] under the assumption that the sequential and parallelizable workloads are given by power functions c f • N α f and c g • N αg , resp., with 0 < α g − α f < 1); then, speedup grows asymptotically sublinearly with Θ(N (αg−α f ) ).
As case D S , also case E S may help explaining superlinear speedup.
Case F S : This case covers situations in which speedup converges to c h for (i) α g > α f and (ii) α h = 0 (see eq. ( 22)).For c h ≥ 1, the limit c h is a lower bound; for c h ≤ 1, the limit c h is an upper bound.Condition (i) ensures that the parallelizable workload increases faster than the sequential workload, and with condition (ii) we assume that N does not affect the time required to address the parallelizable workload (due to α h = 0).As with case B S , case F S is useful with the expectation that a given parallelizable workload can be executed in parallel only on a limited number N max of PUs.

Asymptotic efficiency
In order to determine theoretical efficiency limits, we proceed analogously to the determination of speedup limits.We rewrite the generic efficiency equation (eq.( 6)) as follows: For term (I'), we yield the following limit (see equations ( 47) -(49) in Appendix B): For term (II'), we yield the following limits (the proof can be obtained from equations ( 50) -(62) in Appendix B): With lim N →∞ (I ) = 0, we yields the following limits for efficiency equations ( 6) and ( 23): lim We now briefly discuss each of the eight equations and refer to these as (efficiency) cases; a visual illustration of the efficiency cases can be retrieved from Figure 3 which, unsurprisingly, shows structural similarities with the visual representation of speedup limits due to the relationship between efficiency and speedup as given by E(N ) = S(N ) N .Case A E : The efficiency limit given in eq. ( 25) equals zero regardless of the value of α h and apparently represents a lower bound for E(N ) ∀N > 1.This case refers to a situation in which the increase of the serial workload is asymptotically higher than that of a(n) (adjusted) parallelizable workload (adjusted based upon a decrease of the number of PUs by 1) (α f > α g − 1).
Case B E : The efficiency limit given in eq. ( 26) equals zero when the value of α h is upper bounded by 1. Again, it apparently represents a lower bound for E(N ) ∀N > 1.This case refers to a situation in which the ratio of the increases of serial workload and adjusted parallelizable workload converges against the constant c f cg and in which the increase of time reduction of executing the parallelizable workload is sublinear in N (α h < 1).
Case C E : The efficiency limit given in eq. ( 27) describes a situation that differs from that in case B E only in that the increase of time reduction of executing the parallelizable workload is now linear in N (α h = 1).Then, the limit of efficiency is given by a constant larger than 0 assuming that the parallelizable workload is positive (p > 0).
Case D E : The efficiency limit given in eq. ( 28) describes a situation that differs from that in case B E only in that the increase of time reduction of executing the parallelizable workload is now superlinear in N (α h > 1).Despite this increase of time reduction and due to the relatively large increase of the serial workload compared to that of the parallelizable workload (α f = α g − 1), the limit of efficiency is still given by a constant (larger than 0) assuming that the parallelizable workload is positive (p > 0).
Case E E : The efficiency limit given in eq. ( 29) describes a situation that is similar to that of case B E .While α h < 1 holds again, the (adjusted and the non-adjusted) parallelizable workload grows faster than the serial workload (α f < α g − 1).However, as the increase of time reduction of executing the parallelizable workload is sublinear in N (α h < 1), efficiency converges against 0.
Case F E : In contrast to case E S , the efficiency limit given in eq. ( 30) describes a situation in which the increase of time reduction of executing the parallelizable workload is linear in N (α h = 1).Then, efficiency asymptotically equals a constant larger than 0 (assuming p > 0).
Case G E : One situation in which efficiency is unbounded is decribed in eq. ( 31), where the (adjusted and the non-adjusted) parallelizable workload grows faster than the serial workload (α f < α g − 1) and the increase of time reduction of executing the parallelizable workload is superlinear in N (α h > 1).When also α f > α g − α h holds, efficiency grows asymptotically with Θ(N αg−α f −1 ); i.e., the asymptotic growth does not depend on α h .For efficiency is superlinear, linear and sublinear, respectively.
Case H E : A second situation in which efficiency is unbounded is decribed in eq. ( 32), where the (adjusted and the non-adjusted) parallelizable workload grows faster than the serial workload (α f < α g − 1) and the increase of time reduction of executing the parallelizable workload is superlinear in N (α h > 1).When also α f ≤ α g − α h holds, efficiency grows asymptotically with Θ(N α h −1 ); i.e., the asymptotic growth does now depend on α h .For α h > 1, α h = 1 and 0 < α h < 1, efficiency is superlinear, linear and sublinear, respectively.

Asymptotic scalability
In the previous subsections, we identify speedup cases and efficiency cases.Now, we consider speedup cases and efficiency cases jointly, which results in various speedup-efficiency cases.We refer to these cases as scalability cases, which are defined by both speedup and efficiency limits (see Table 2 and Figure 4).We assign to each scalability case a scalabilty type, which describes both speedup (as first parameter) and efficiency (as second parameter), using the following semantics: • β h , γ h : fixed values which depend on the reduced parallel workload scaling function h • ∞ f,g : unbounded and monotonically increasing; the extent of increase depends on workload scaling functions f and g • β s,f,g , γ s,f,g : fixed values which depend on the sequential part s (note: p = 1 − s) and the workload scaling functions f and g • β s,f,g,h , γ s,f,g,h : fixed values which depend on the sequential part s (note: p = 1 − s), the workload scaling functions f and g, and the reduced parallel workload scaling function h • ∞ f,g : unbounded and monotonically increasing; the extent of increase depends on workload scaling functions f and g • ∞ h : unbounded and monotonically increasing; the extent of increase depends on reduced parallel workload scaling function h Each scalability type refers to exactly one scalability case and set of conditions (see Table 2).
For the discussion of speedup, efficiency and scalability results derived in the preceding section, we recall the meaning of parameters α f , α g and α b since their values determine the (speedup, efficiency and scalability) case of a particular parallel algorithm: The parameters α f , α g and α h affect the serial workload, the parallelizable workload and the actual reduction of parallelizable workload through parallelization, respectively, depending on the number of PUs N ; they are given by f (N ) := c f N α f , g(N ) := c g N αg and c h N α h , respectively.We also recall the abovementionded speedup and efficiency equations: 5)) 6)) We now discuss each of the eleven scalability cases A SC to K SC .As the definition of scalability cases (and types) is based upon combinations of speedup cases and efficiency cases, the characteristics of scalability cases (and types) can be derived from the above descriptions of the characteristics of speedup and efficiency cases.
Case A SC : With speedup case C S , the increase of serial workload is asymptotically higher than that of parallelizable workload (α f > α g ).Then, speedup converges against 1.Case C S induces efficiency class A E so that the resulting asymptotic efficiency equals 0. The scalability type is (1, 0).Overall, parallelization does not scale at all and parallelization efforts do not make much sense.
Case B SC : With speedup case A S , the scaling functions f and g change the serial and parallelizable workloads using the same factor N α f with possibly different values c f and c g ; furthermore, case A S refers to situations in which the number N of available PUs affects the time required to address the parallelizable workload (due to α h > 0).Then, speedup converges against a constant β s,f,g = Speedup case A S leads to efficiency case A E ; i.e., efficiency converges against zero.
Scalability case B SC , which refers to scalability type (β s,f,g , 0), includes Amdahl's law [4] and partly Sun and Ni's law [33,34] (see the discusssion of speedup case A S ).
Case C SC : This scalability case is similar to scalability case B SC and differs from it only as α h equals zero; i.e., N does not affect the time required to address the parallelizable workload.With speedup case B S and resulting efficiency case A E , the associated scalability type is (β s,f,g,h , 0), with speedup . As discussed above, speedup case B S , and thus scalability case C SC , are not useful under the premise that the parallelizable workload is infintely parallelizable, but it becomes useful when a given parallelizable workload can be executed in parallel only on a limited number of PUs.
Case D SC : Scalability case D SC , which corresponds to scalability type (β h /0), includes speedup case F S , in which the parallelizable workload increases faster than the sequential workload (α g > α f ) and N does not affect the time required to address the parallelizable workload (α h = 0).Then, speedup converges to c h .When speedup case F S applies, either efficiency case E E or B E applies with efficency converging towards zero in both cases.As with scalability case C SC , case D SC is useful with the expectation that a given parallelizable workload can be executed in parallel only on a limited number of PUs.
Case E SC : This scalability case refers to a situation in which (i) the parallelizable workload increases at least one magnitude faster than the sequential workload (α g ≥ α f + 1) and (ii) the number of available PUs affects the time required to address the parallelizable workload with unlimited and sublinear growth (0 < α h < 1).This scalability case is linked to speedup case D S and one of the efficiency cases E E or B E , resulting to scalability class (∞ h , 0).Case E SC involves a speedup growth of Θ(N α h ); i.e., speedup convergence is determined by the reduced parallel workload scaling function h.Due to the condition α h < 1, this growth is sublinearly in N and efficiency converges towards zero.
Case F SC : This case describes to a situation in which the parallelizable workload increases more than one magnitude faster than the sequential workload (α g ≥ α f + 1) and the number of available PUs affects the time required to address the parallelizable workload with superlinear growth (1 < α h ≤ (α g − α f ))).Under such conditions, speedup case D S and efficiency case H E apply, resulting in the scalability type (∞ h , ∞ h ), where both speedup and efficiency are unlimited, speedup grows superlinearly with Θ(N α h ), and efficiency grows sublinearly (when 1 < α h < 2), linearly (when α h = 2), or superlinearly (when 2 < α h ).
Case G SC : This scalability case describes a situation in which the parallelizable workload increases one magnitude faster than the sequential workload (α g α f + 1) and the number of available PUs affects the time required to address the parallelizable workload with linear growth (α h = 1).Under such conditions, speedup case D S and efficiency case C E apply, resulting in the scalability type (∞ h , γ s,f,g,h ), where speedup is unlimited and grows linearly and where efficiency converges against a constant γ that depends upon functions f, g, h and upon s (γ s,f,g,h = Scalability case G SC covers Gustafson's law [16], where α g = α h = 1, α f = 0.It also (partly) covers the model of Schmidt et al. [31].
Case H SC : Scalability case H SC differs from scalability case G SC only in the regard that the parallelizable workload increases more than one magnitude faster than the sequential workload (α g > α f + 1).Similiar to case G SC , the scalability type is (∞ h , γ h ) but here γ h is set to c h (efficiency case F E ).
Analogously to scalability case G SC , also case H SC (partly) covers the model of Schmidt et al. [31].
In addition, case H SC also (partly) covers Sun and Ni's law when c f = c g = c h = 1 holds and when the parallelizable workload in this model is given by a power function N αg .
Case I SC : This scalability case applies when (i) the parallelizable workload increases at least one magnitude faster than the sequential workload (α g > α f + 1) and (ii) the number of available PUs affects the time required to address the parallelizable workload with unlimited and superlinear growth (1 < α g − α f < α h ).Under these conditions, speedup case E S and efficiency case G E apply, leading to scalability type (∞ f,g , ∞ f,g ); i.e., both speedup and efficiency are unbounded and grow superlinearly.
Case J SC : The conditions under which this case apply differ from those of case I SC in that the parallelizable workload increases less than one magnitude faster than the sequential workload (0 < α g − α f < 1).Then, speedup case E S and efficiency case A E apply, leading to scalability type (∞ f,g , 0); i.e., while speedup is unbounded and grow sublinearly, efficiency converges against zero.
Case K SC : When (i) the parallelizable workload increases one magnitude faster than the sequential workload (α g = α f +1) and (ii) the number of available PUs affects the time required to address the parallelizable workload with unlimited and superlinear growth (1 < α h ), scalability case K SC applies with speedup case E S and efficiency case D E ; then, speedup is unlimited and grows linearly, efficiency converges against a constant γ = p•cg s•c f , and scalability type (∞ f,g , γ s,f,g ) applies.

Discussion
The scalability typology developed in the previous section allows researchers to determine the limits of speedup and efficiency of their applications and the extent to which computational parallelization scales for their needs.They also support researchers regarding their decision of how many parallel PUs to use in the presence of economic budget constraints.Our typology shown in Table 2 provides a more comprehensive picture of scalability in homogeneous computing environments than speedup laws suggested in the literature (shown in Table 1), thereby widening the scope of applying scalability insights.At the same time, our typology includes all of the abovementioned speedup laws as illustrated in Table 3.
A key issue for researchers is the assignment of their particular application to a scalability class, which requires determining the sequential workload s and the power functions f, g, h (see Definition 4).In order to determine s, a straightforward approach is to execute the application on a single PU and measure the execution times t seq ans t par of the sequential and parallelizable workloads, resp., leading to s = tseq tseq+tpar and p = 1 − s.The power functions f and g, which correspond to the problem size in terms of the sequential and parallelizable workloads depending on the number N of PUs, can be specified by the researcher or are given by the particular application.For example, when the application is concerned with processing and analyzing data, the volume of data to be processed can be aligned to N .However, in an optimization context when, for example, an instance of a mixed-integer linear program needs to be solved to optimality with a branch-and-bound algorithm, the size of the problem instance is fixed.With regard to the function h, which allows adjusting the time required to handle the parallelizable workload when this workload is actually executed in parallel, in many applications researchers may find it appropriate to set h(N ) = N , which assumes that the parallelizable workload is equally distributed on N PUs.However, researchers may derive from this assumption when, for example, superlinear speedup is observed in computational experiments.Ideally, h can be derived from algorithmic analysis; but in many cases this approach may not work and, then, researchers are advised to determine h with computational experiments.

Conclusion
In this work, we provide a generic speedup (and thus also efficiency) model, which generalizes many prominent models suggested in the literature and allows showing that they can be considered special cases with different assumptions of a unifying approach.The genericity of the speedup model is achieved through parameterization.Considering combinations of parameter ranges, we identify six different asymptotic speedup cases and eight different asymptotic efficiency cases; these cases include sublinear, linear and superlinear speedup and efficiency.Based upon the identified speedup and efficiency cases, we derive eleven different scalability cases and types, to which instantiations of our generic speedup (and efficiency) model may lead.Researchers can draw upon our suggested typology to classify their speedup model and/or to determine asymptotic scalability of their application when the number of parallel PUs increases.
Our theoretical analysis is based upon several assumptions which are common in the literature (e.g., [34]).First, we assume that the overall workload only contains two parts, a sequential part and a perfectly parallelizable part, which can be executed in parallel on all available parallel PUs.In practice, the latter condition may not always hold but even then our speedup and efficiency results are useful as they can be used as upper bounds of achievable speedup and efficiency.In the literature, alternative models which do not require the abovementioned dichotom disctinction have been suggested, including parallelism/spanand-work models, multiple-fraction models, and roofline models (e.g., [10, p. 772ff], [3, p. 141ff]), [7]).Future theoretical analysis of speedup and efficiency limits may consider those types of models, Second, while our generic speedup model includes a function for parallelization overhead, we assume that this overhead is negligible and omit this function from our analysis.However, we admit that parallelization overhead may cause large scalability degradation [18] and have considerably large effects on speedup and efficiency.Overhead can be rooted in several phenomena, including the existence of critical regions (exclusive access for one process only), inter-process communication, and sequential-to-parallel synchronization due to data exchange [38].While, in this case, speedup and efficiency results can again be used as upper bounds of achievable speedup and efficiency, future work should identify appropriate overhead functions and complement asymptotic analysis with the analysis of speedup and efficiency optima in order to determine the optimal number of parallel PUs to be used in particular problem settings.In the literature, different ways to integrate overhead functions into the determination of execution times have been suggested.One option is the provision of additive terms as it has been done for overhead of data preparation [26] or communication required for mapping a workload onto multiple cores [21,3].A theoretical analysis of an additive overhead function can be found in [13].Another option involves adding a coefficient function to the execution time of the parallelized workload (e.g., [12,33]).
Third, in our analysis we focus on homogeneous parallel computing environments.We acknowledge that, in modern parallel computing environments, parallel processing units are not necessarily equally potential in their computing capabilities and that a substantial body of literature on speedup in such heterogeneous computing environments exist; see, for example, the surveys on heterogeneous multicore environments of Al-Babtain et al. [2] and Al-hayanni et al. [3].Many works suggest extensions of Amdahl's law, Gustafson's law and/or Sun and Ni's Law for such environments (e.g., [17,19,39,28,42,23,22]).Studies on speedup and efficiency properties of architecture-dependent laws are particularly helpful for the design of multi-core environments.Future work may extend our generic speedup model by concepts of different types of parallel PUs as suggested in the literature and adapt our theoretical analysis to heterogeneous settings.
Finally, merging the two abovementioned research streams leads to the consideration of speedup and efficiency in heterogeneous parallel computing environments under the consideration of overhead (functions).Our model and theoretical analysis may be extended in both regards, drawing on prior work.For example, Huang et al. [18] suggest an extension of Amdahl's law and Gustafson's law in architecturespecific multi-core settings by considering communication overheads and area constraints; Pei et al. [26] extend Amdahl's law for heterogeneous multicore processors with the consideration of overhead through data preparation; and Morad et al. [24] analyze overheads as a result of synchronization, communication and coherence costs based upon Amdahl's model for asymmetric cluster chip multiprocessors.
With the suggestion of a generic and unifying speedup (and efficiency) model and its asymptotic analysis, we hope to provide a theoretical basis for and typology of scalability of parallel algorithms in homogeneous computing environments.Future research can draw upon and extend our research results to address various extensions of our setting.

Figure 2 :
Figure 2: Overview of speedup limits

Figure 3 :
Figure 3: Overview of efficiency limits

Figure 4 :
Figure 4: Overview of scalability cases

Table 1 :
Instantiations of generic speedup model

Table 2 :
Scalability cases *Efficiency * * α * : Values for cases A S and B S are upper bounds, value for case C S is lower bound, value for case F S is lower bound (c h ≥ 1) or upper bound (c h ≤ 1).* * : Values for cases A E to G E are lower bounds, the value for case H E is an upper bound (for sufficiently large values of N (c

Table 3 :
Scalability types of speedup models suggested in the literature