A double-decomposition based parallel exact algorithm for the feedback length minimization problem

Product development projects usually contain many interrelated activities with complex information dependences, which induce activity rework, project delay and cost overrun. To reduce negative impacts, scheduling interrelated activities in an appropriate sequence is an important issue for project managers. This study develops a double-decomposition based parallel branch-and-prune algorithm, to determine the optimal activity sequence that minimizes the total feedback length (FLMP). This algorithm decomposes FLMP from two perspectives, which enables the use of all available computing resources to solve subproblems concurrently. In addition, we propose a result-compression strategy and a hash-address strategy to enhance this algorithm. Experimental results indicate that our algorithm can find the optimal sequence for FLMP up to 27 activities within 1 h, and outperforms state of the art exact algorithms.


INTRODUCTION
Enterprises face more and more competition, which requires the competitors to develop new products in a short time.However, product development projects often involve many interrelated activities with complex information dependences (Lin et al., 2012;Bashir et al., 2022).Such activities usually follow uncertain processes and rework frequently, which makes it difficult for managers to control the project durations, costs and risks (Mohammadi, Sajadi & Tavakoli, 2014;Lin et al., 2018).Therefore, how to sequence interrelated activities to reduce negative impacts has drawn considerable attention (Attari-Shendi, Saidi-Mehrabad & Gheidar-Kheljani, 2019;Wen et al., 2021).
The design structure matrix (DSM) can clearly describe interrelated activities and interdependence, which is considered an effective tool in scheduling development projects (Browning, 2015;Wen et al., 2021).Figure 1A presents a typical DSM of a balancing machine project (Abdelsalam & Bao, 2007), where activities are listed on the left column and the top row following the same order; d i,j (0 ≤ d i,j ≤ 1,i = j) denotes the degree of information dependence of activity i on j(marked in red).Since activity i precedes j, d i,j represents the backward information flow from downstream to upstream in the activity  sequence, which is above the diagonal and called feedback; d j,i is the information flow in opposite direction, which is under the diagonal and called feedforward.In Fig. 1B, if the order of activities i and j is reversed, then d i,j and d j,i become a feedforward and a feedback, respectively.The information flows from other activities to i and j are also affected (marked in yellow), which means that adjusting the activity sequence can significantly affect the overall information flows in DSM (Lin et al., 2015;Meier et al., 2016).
Figure 1 indicates that due to the existence of feedbacks, upstream activities often execute in the absence of information.Once the downstream activities complete, feedbacks may cause upstream activities to rework.In fact, feedbacks usually involve suggestions, errors and modifications, which are the main reason for project delay and cost overrun (Haller et al., 2015;Lin et al., 2015;Wynn & Eckert, 2017).Therefore, some researches suggest minimizing the total feedback values of activity sequence to reduce negative effects (Qian et al., 2011;Nonsiri et al., 2014).However, these studies do not consider the influence of feedback length, i.e., long feedbacks across more activities may cause more upstream activities to rework than short ones.Hence, the objective of minimizing the total feedback length is proposed, which has been widely applied in DSM-based scheduling problems.For instance, Qian & Yang (2014) demonstrated the effectiveness of optimizing the feedback length to reduce the overall reworks through a case study of a pressure reducer project.Benkhider & Kherbachi (2020) used a composite objective that considers the feedback length to reduce the duration of Huawei P30 pro project; Gheidar-kheljani ( 2022) studied a two-objectives scheduling model that considers the feedback length and the cost of decreasing dependence among activities.
(4) where 0-1 vector X = (x 1,2 ,...,x i,j ,...,x n,n−1 ) denotes an activity sequence; objective function Eq. (1) minimizes the total feedback length, if x i,j = 1, then feedback d i,j and its length ( n k=1,k =j x k,j − n k=1,k =i x k,i ) are counted into the objective value; constraint Eq. (2) guarantees that there is only one execution order for activity i and j; constraint Eq. (3) ensures that the execution order is transitive; constraint Eq. ( 4) guarantees that decision variables are binary.
Further, the original model can be simplified to a sequence-based model (Lancaster & Cheng, 2008;Shang et al., 2019).Let integer vector S = (s 1 ,s 2 ,...,s h ,...,s k ,...,s n ) be an activity sequence, where decision variable s h is the activity at position h of the sequence, for example, s 3 = 5 means that activity 5 is assigned to position 3. Since position h is set before position k(h < k), d s h ,s k is the feedback from position k to h, and the sequence-based model can be formulated as follows: where objective function Eq. ( 5) minimizes the total feedback length, (k − h) is the length of feedback d s h ,s k ; constraint Eq. ( 6) limits the values of the decision variables (s h ,s k ∈ I ) and prohibits an activity from appearing in multiple positions (s h = s k ).Due to the concise expression of FLMP, we mainly analyze the sequence-based model in the rest of this article.
Researchers have proved that FLMP is NP-hard and extremely difficult to solve (Meier, Yassine & Browning, 2007).Therefore, many studies proposed heuristic approaches to obtain near-optimal activity sequences.These algorithms usually follow the classic heuristic framework, such as genetic algorithm, local search, which can obtain a reasonable solution within a short time, but cannot guarantee the global optimum.On the other hand, studies on exact approach are quite limited, and the existing algorithms are not practical due to the weak computational capability.Nevertheless, the research on specialized exact algorithms has promoted the exploration FLMP properties.Shang et al. (2019) found that FLMP has optimal sub-structure, which allows the original problem to be decomposed into multiple subproblems.Based on this property, they developed a parallel exact algorithm, which can solve FLMP with 25 activities in 1 h and is the state of the art exact approach in current literature (see the detailed review in 'Literature review').
This study focus on improving the computational capability of exact approach for FLMP, through fully utilizing the structural properties.We develop a double-decomposition based parallel branch-and-prune algorithm (DDPBP), to obtain the optimal activity sequence.The proposed algorithm first divides FLMP into forward and backward scheduling subproblems, then decomposes subproblems into several scheduling tasks and solving them concurrently.The resulting optimal subsequences are connected to be the global optimum.Furthermore, we propose an effective result-compression strategy to reduce communication costs in parallel process, and a novel hash-address strategy to boost the efficiency of sequence comparisons.Computational experiments on 480 FLMP instances show that DDPBP significantly reduces the time consumption for obtaining the optimal solution, and increases the problem scale that exact algorithms can solve to 27 activities within 1 h.
The rest of this article is organized as follows.'Literature review' presents a literature review on exact and heuristic approaches for FLMP.In 'FLMP analysis', we recall the properties of FLMP, which is the foundation of the proposed algorithm.'Doubledecomposition based algorithm' introduces the main scheme and key phases of DDPBP, including the result-compression strategy.'Hash strategy' provides the hash-address strategy applied in DDPBP.'Computational experiments' conducts the comparisons between DDPBP and state of the art algorithms.'Analysis' provides systematic analyses of parameters and key strategies.'Conclusions' draws conclusions.

LITERATURE REVIEW
This section presents a literature review about FLMP and the existing solution approaches, while some closely related problems are also mentioned.Table 1 summarizes the optimization objectives and the proposed algorithms discussed in the literature.
The high uncertainty of the development process makes it difficult to estimate the project duration, costs and risks.Therefore, many studies usually introduce alternative  Altus, Kroo & Gage (1995) and Todd (1997) and many other studies have pointed that the feedback spanning across more activities usually leads to more reworks, which indicates that the length of feedback may significantly affect the development progress.Therefore, a more reasonable objective of finding the activity sequence with the minimum total feedback length is proposed.Over the years, many practical applications have confirmed that the feedback length minimization (FLMP) is an appropriate approximation of minimizing the project duration, costs and risks (see, e.g., Meier, Yassine & Browning, 2007;Lancaster & Cheng, 2008;Qian & Yang, 2014).
Due to the NP-hard nature of FLMP, it is extremely difficult to find the optimal activity sequence, even for small-scale problems.Thus, researchers turned to develop heuristic approaches to obtain near-optimal solutions.In particular, Lin et al. (2018) proposed an effective hybrid algorithm by integrating an insertion-based heuristic with simulated annealing.Khanmirza, Haghbeigi & Yazdanjue (2021) introduced the imperialist competitive algorithm to solve large-scale FLMP, which is enhanced by adaptively applying operators and tuning parameters.Wen et al. (2021) introduced an insertion-based heuristic algorithm (IBH) to solve a closely related problem that minimizes the total rework time.This algorithm follows the sequential improvement strategy to select operators, and experiments showed that IBH was competitive in scheduling interrelated activities.Most recently, Peykani et al. (2023) successively optimized the feedback length and project duration by a genetic algorithm based hybrid approach, in order to reschedule development project in resource constrained scenarios.These algorithms can obtain a reasonable solution in a short time, but cannot guarantee the global optimum.
As for exact approaches, only three studies focus on scheduling interrelated activities optimally.Qian & Lin (2013) reformulated FLMP as two equivalent linear programming models, then adopted the CPLEX MILP solver to optimally solve them.However, the largest FLMP that can be solved within 1 h is limited to 14 activities, and the performance of this approach is strongly affected by the density of DSM.Gheidar-kheljani (2022) proposed multi-objective model that minimizes the total feedback length and the cost of decreasing activity dependence.They applied CPLEX to solve small scale problems and designed a genetic algorithm for large problems.Shang et al. (2019) have proven that FLMP has optimal sub-structures, which allows the original problem to be divided into multiple subproblems.Based on this, they developed a hash-address based parallel branchand-prune algorithm (HAPBP), which is the state of the art specialized exact approach in current literature.HAPBP divides FLMP into two subproblems, and concurrently schedules activities in forward and backward directions.This algorithm also employs a hash strategy to improve the efficiency of sequence comparison, by mapping activity sequences into hash values.Experiments confirm that HAPBP can solve FLMP up to 25 activities within 1 h.The shortcomings of this study are that the proposed parallel framework limits the algorithm to only use two cores of CPU, and the hash strategy is extremely space-consuming, which prevents HAPBP from fully utilizing available computing resources.
In summary, the studies on heuristic approach did not fully explore the structural properties of FLMP, and the existing heuristic algorithms are usually designed by using the classic heuristic frameworks.On the other hand, studies on specialized exact approaches for FLMP are quite limited, and there is clearly an urgent need for such dedicated exact algorithms capable of solving problem instances that cannot be solved by existing approaches.Decomposing FLMP into subproblems to reduce the problem complexity, then solving them concurrently, is highly appealing approach to obtain the optimal activity sequence.However, the existing parallel framework and applied strategies do not take advantage of FLMP properties and available computing resources, which strongly limits the computational capability.To fulfill these research gaps, we propose in this work an novel parallel exact algorithm to solve FLMP.The main contributions are summarized as follows.
• We develop a double-decomposition based parallel branch-and-prune algorithm (DDPBP), which can employ all available computing resources to solve FLMP optimally.The proposed algorithm first divides FLMP into forward and backward scheduling subproblems, then decomposes subproblems into several scheduling tasks, and applies multiple CPU cores to prune unpromising subsequences.The resulting optimal subsequences are connected to be the global optimum.
• We propose two strategies to further enhance the DDPBP algorithm.The resultcompression strategy is designed to reduce communication costs among parallel processes, by extracting and sending key information from numerous intermediate results.Furthermore, a novel hash-address strategy is developed to quickly compare and locate subsequences with lower space costs, which significantly accelerates the process of subsequence pruning.
• Computational experiments confirm the competitiveness of the DDPBP algorithm on 480 random FLMP instances, compared to the state-of-the-art exact approaches.
In particular, the proposed algorithm increases the problem scale that can be solved exactly to 27 activities within 1 h, and significantly reduces the solving time for problems with less than 27 activities.In addition, further analyses shed light on the significant contributions of the result-compression and hash-address strategies to the performance of DDPBP.

FLMP ANALYSIS
Decomposing the original problem into smaller subproblems is an effective way to solve complex problems (Chen & Li, 2005;Shobaki & Jamal, 2015;Mitchell, Frank & Holmes, 2022).In this section, we briefly introduce the properties of FLMP and the resulting prune criterion (Shang et al., 2019), which allows the algorithm to divide FLMP into two independent subproblems, and discard unpromising sequences effectively.All properties are mathematically proved in Appendix.

Problem properties
Assume that a development project consists of activities I = {1,2,...,i,...,n}, the activity sequence is S = (s 1 ,s 2 ,...,s p ,s p+1 ,...,s n ), and the total feedback length fl We set position p(1 < p < n) as a split point, define that region A p = {s 1 ,s 2 ,...,s p } contains activities from position 1 to p, and region B p = {s p+1 ,s p+2 ,...,s n } contains activities after position p.Then we have feedback values fv a p and fv b p that are produced by the subsequences of regions A p and B p , respectively.
Property 1: The total feedback length fl = fv a p + fv b p , and: Property 1 shows the compositions of total feedback length fl, when the original sequence is split into two regions.Further, if we set the split point as position p + 1 or p − 1, then fv a p+1 and fv b p−1 can be derived from the following equations.
Property 2: Changing the subsequence of region A p (B p ) does not affect the value of fv b p (fv a p ). Due to split point p, FLMP is divided into two subproblems, which minimize feedback values fv a p and fv b p , and are related to regions A p and B p , respectively.Property 2 indicates that although there exist feedbacks from region B p to A p , the two subproblems are totally independent with each other.

Prune criterion
We define that any two sequences are ''similar'', if they consist of the same activities, such as sequences (1,2,3) and (3,1,2); otherwise, they are ''dissimilar'', such as sequences (1,2,3) and (1,2,5).Based on the preceding properties, a prune criterion is proposed as follows: Prune criterion: In region A p (B p ), for a subsequence SA p (SB p ), if its feedback value fv a p (fv b p ) is not the lowest among similar subsequences, then any sequence S starting (ending) with SA p (SB p ) is not the global optimum, and SA p (SB p ) should be pruned.
For a certain pair of regions A p and B p , assuming the optimal SB * p of B p is found, then all high-quality sequences S should end with SB * p .Therefore, the quality of S depends on the quality of SA p , or vice versa.In other words, the prune criterion holds.In addition, for each group of similar SA p , only the one with lowest fv a p is kept, the rest (p! − 1) subsequences are pruned.The same is true for SB p .We present an example to illustrate how the prune criterion works.

DOUBLE-DECOMPOSITION BASED ALGORITHM
This section presents the details of the proposed Double-Decomposition based Parallel Branch-and-Prune (DDPBP) algorithm for solving FLMP, including the general concept, the main scheme, and key phases including task distribution and result combination.

General concept
Double-decomposition means that DDPBP can decompose the whole sorting problem from two perspectives.Based on the properties of FLMP, the original problem is divided into two independent subproblems that are related to regions A p and B p respectively.By introducing the parallel framework, DDPBP can construct active sequences in forward (from head to tail, A p ) and backward (from tail to head, B p ) directions concurrently.Figure 3 shows the search trees applied in DDPBP.In the forward tree, each node represents a subsequence from position 1 to p within a complete sequence, for example, node (7,6,5) is the first three activities of one complete sequence, and child node (7,6,5,4) is built by adding activity 4 to the end of node (7,6,5).The backward tree follows the same structure, but represents the opposite direction.DDPBP traverses two trees in a breadth-first way, along with pruning unpromising nodes (marked by red line).When the exploration finishes, the remaining partial sequences (leaf nodes) are connected as complete sequences (marked by blue line), from which we can find the optimal solution.These two exploring processes are totally independent without any information exchange, which can be distributed to two cores of CPU.However, this framework limits the full use of available computing resources.As multi-core computers are common nowadays, a more flexible framework that supports any number of cores, is necessary.Figure 4 presents a further decomposition in the forward and backward processes.For a FLMP with seven activities, assume that six cores are available, then we can assign half of cores to each process.For the forward process, in row 3, nodes are divided into three groups and sent to three cores for node pruning.Since each core only handles partial nodes, after all tasks are finished, DDPBP gathers the results and proceeds further node pruning.After  6 7)  (2 1)  (1 2)  (7 6)   (5 7 6)  (4 7 6)  (1 7 6)  (2 5 7)  (3 1 2) Forward Tree Backward Tree Core 1 Core 2 Complete sequences (1 3 2 5 4 7 6), (7 6 5 4 3 1 2)... Sequence (1 3 2 5 4 7 6) with the minimum total feedback length is the global optimum all unpromising nodes are discarded, the remaining nodes are used to generate child nodes for the next row.The decomposition in the backward process follows the same way.
The second decomposition makes it possible to take full advantage of multiple cores to share the workloads.Although forward and backward processes do not communicate with each other, multiple threads within two processes still exchange data frequently.Researches show that the communication cost in parallel frameworks cannot be ignored (Tsai et al., 2021;Wang & Joshi, 2021).Therefore, how to reduce the impact of multi-thread communication is an important issue in this study.Return the optimal sequence S * with the lowest feedback length F * .

Main scheme
Algorithm 1 presents the main scheme of DDPBP.The whole procedure consists of a task distribution phase ('Task distribution phase') and a result combination phase ('Result combination phase').For a FLMP problem with n activities, assume that there are cn cores available.Starting with a given parameter na, the algorithm sets the number of rows that the forward process needs to explore as na, and sets the number of rows explored by the backward process as (n − na).Then, the task distribution phase concurrently traverses the forward and backward trees row by row, and applies the forward and backward process to discard unpromising nodes (Step 1).Since na and (n − na) may be not equal, if both processes are running, the algorithm distributes cores equally to two processes (cn/2 for each); if one process ends earlier, the remaining process adaptively takes all the cores to make a full use of computing resources (Step 1.1).After tree explorations finish, in the result combination phase, each partial sequence SA na contained in SetA na is connected to its corresponding sequence SB na in SetB na to construct the complete sequence.Finally, the one with the minimum feedback length among all complete sequences is the global optimum (Step 2-3).

Task distribution phase
The task distribution phase realizes the double decomposition of the FLMP problem.The first decomposition is to concurrently schedule activities in forward and backward directions.The second decomposition is to distribute pruning tasks of each row to given cores within forward and backward processes.We use the forward process as an example to illustrate this idea, and the backward process follows the same procedure except exploring the backward tree.

Cores
As shown in Fig. 5, the forward process consists of four components, including task distribution, node pruning, result compression and result restoration.These components work sequentially on each row until reaching row p = na.
Task distribution: Suppose that cna cores are available.The algorithm receives n p−1 nodes stored in SetA p−1 from row p − 1, and is about to explore row p.Then these nodes are divided into cna equal parts, i.e., SetA p−1 {i}(1 ≤ i ≤ cna) with n p−1 /cna nodes, and sends them to cna cores respectively.This procedure is single thread.

Node pruning:
As shown in Algorithm 2, in the forward process, assume that core i is exploring row p and receives partial nodes stored in SetA p−1 {i}.
Step 1 adds a new activity to the end of node SA p−1 to build child node SA p , and calculates fv a p using recursive Eq. ( 9) (if row p = 2, use Eq. ( 7) instead); Steps 3.1-3.2locate the similar node of SA p at SetT {i}(ha) and only save the node with lower fv a p at this position, where SetT {i} is a temporary result set for core i and ha is a unique hash address for each group of similar nodes (see in 'Hash strategy').These steps repeat, until all the child nodes SA p are checked.
Result compression: The hash-address strategy is introduced to boost the efficiency of searching similar nodes in SetT {i}.In 'Hash strategy', we propose hash functions to map each group of similar nodes into a unique hash address ha.In order to support all possible addresses, the size of SetT {i} is set as C p n , which equals to the total number of similar node groups in row p.However, since a core only handles a partial task, it does not need to use all the space of SetT {i}.In fact, the hash addresses appearing in a core are usually discrete and irregular, such as {1,3,...,20,26}, which causes the final SetT {i} to be sparse.Hence, in order to reduce communication cost, after node pruning finishes, the algorithm extracts remaining nodes, corresponding fv a p and hash addresses from SetT {i}, and sends them to the next component, instead of transmitting the entire SetT {i} (detailed analysis in 'Effectiveness of result-compression strategy').Calculate fv a p for each SA p using Eq.7; Else Calculate fv a p for each SA p using Eq.9; 3 While (There exists an unchecked SA p )

3.1
Select an unchecked SA p , and calculate hash address ha using Eq.11; Result restoration: After receiving nodes, fv a p and hash addresses from multiple cores, the whole process switches from multi threads to single thread.For the results from core i, according to hash addresses ha, the algorithm assigns nodes and fv a p to SetA p (ha), where SetA p with the size of C p n , contains the optimal node among each group of similar nodes in row p.If SetA p (ha) is not empty, the algorithm keeps the one with lower fv a p in this position, to further prune unpromising nodes.This procedure repeats until all results from different cores are checked.Then, the algorithm is ready to explore row p + 1.

Result combination phase
After the task distribution phase finishes, the algorithm connects the nodes SA na in SetA na with the corresponding nodes SB na in SetB na to construct the complete activity sequences, and calculate the total feedback length by fl = fv a na + fv b na , from which the sequence with minimum total feedback length is the global optimum.
In addition, since identifying and searching the right SB na in SetB na for each SA na are time-consuming, we introduce a hash strategy to improve the efficiency of the combination phase.In 'Hash strategy', we propose a function Eq. ( 15) that can derive the hash address of SB na from the hash address of SA na .Hence, when the algorithm receives a node SA na with hash address ha a from SetA na , the corresponding node SB na can be easily located at position SetB na (ha b ).

Computational complexity
We first consider the task distribution phase.For a FLMP with n activities, we assign na and n − na rows to forward and backward processes, respectively.Due to the parallel nature of DDPBP, the computational complexity of the process that explores more rows can represent the whole algorithm.Without loss of generality, we set na > n − na and take the forward process as an example.For any row p(1 < p ≤ na), the number of nodes needed to be processed is C We now consider the result combination phase.At the beginning, there are C na n pairs of nodes needed to be connected.With the help of hash address, DDPBP can locate and connect right nodes directly.Hence the complexity of this phase is O(C na n ).In addition, due to the similar structure of search trees, the overall complexity of DDPBP is quite close to the complexity of HAPBP (Shang et al., 2019) ).However, double-decomposition framework allows the proposed algorithm to applied more computational resources in the search process.

HASH STRATEGY
During the forward and backward processes, the algorithm needs to find the similar nodes in SetT for each node of row p; however, the time complexity of determining whether two nodes are similar is O(n 2 ), and locating the similar nodes in SetT is O(|SetT | * n 2 ), in the worst case.Hence, it is necessary to convert nodes into hash values, and perform hash value comparison instead of node comparison to boost the efficiency.Shang et al. (2019) applied the hash-address strategy in the HAPBP algorithm, which transforms each group of similar nodes into a unique hash address in a result Set , by using hash function ha = i∈SA p (SB p ) 2 i−1 .HAPBP can find the similar nodes for any node at Set (ha) directly.However, as shown in Fig. 6, this function allocates hash address for all groups of similar nodes in the search tree, no matter which rows these nodes belong to.In fact, the space complexity of this strategy is O( ) in the worst case, and can significantly reduce space costs of SetT .For any node in row p: SA p = (s 1 ,s 2 ,...,s i ,...,s j ,...,s p ) We first reorder SA p as s i < s j (1 ≤ i < j ≤ p), then encode it as a hash address by the following hash functions: where ha is unique for each group of similar nodes in row p.Therefore, when reaching any node SA p in row p, we can find its similar nodes at SetT (ha) and compare them directly.
As shown in Fig. 7, each group of similar nodes in row 3 is mapped to a unique ha(1 ≤ ha ≤ 10), and SetT only contains the information of row 3, which is quite spacesaving and easy to split among different cores.Similar nodes (marked in red) are first reordered as (2,4,5), then Eq. ( 11) converts SA 3 = (s 1 ,s 2 ,s 3 ) = (2,4,5) into ha = 9 as follows: For SA 3 = (1,4,5), SA 3 = (1,2,4), we can achieve their ha as follows: Since Eq. ( 11) is performed frequently during the search process, we can calculate combination number C m n before the search process starts.The algorithm just selects appropriate values from a predefined array according to (m,n), instead of recalculating The hash strategy is also applied to accelerate the combination phase.For any node SA na in SetA na with a hash address ha a , the algorithm can use the following function to derive the hash address ha b of the corresponding node SB na in SetB na : For instance, suppose that n = 6,na = 3,SA 3 = (4,2,1) and SB 3 = (5,3,6).After resorted two nodes, we can apply Eq. ( 11) to obtain ha (4,2,1) = 2 and ha (5,3,6) = 19.Based on Eq. ( 15), we achieve that ha (5,3,6) In other words, Eq. ( 15) holds.

COMPUTATIONAL EXPERIMENTS
This section reports computational experiments to evaluate the effectiveness of the DDPBP algorithm.Specifically, we first describe the benchmark instances and the experimental protocol.Then, we make comparisons between the proposed algorithm and state of the art algorithms in literature.

Benchmark instances and experimental protocol
We use random DSM with various sizes and densities as benchmark instances.For each DSM, the degree of information dependence (d i,j ) follows uniform distribution, and the density level is the ratio of non zero elements.A DSM generator is introduced to produce random instances (Qian & Lin, 2013), where the number of activities (n) is set as ,17,19,21,23,25,26,27}, and the density of DSM (den) is set as {0.1,0.2,0.4,0.6,0.8,1}.
For each pair of n and den, 10 instances are generated, leading to a total number of 480 instances used in the experiments.
The DDPBP algorithm is coded in MATLAB 2018 (MathWorks, Natick, MA, USA) with the Parallel Computing Toolbox and runs under the recommended setting of {cn = 8,na = 5} ('Parameter analysis').The algorithms for comparisons include: the HAPBP algorithm from Shang et al. (2019), the branch and cut algorithm and the branch and bound algorithm of the CPLEX and Gurobi solvers.All experiments are conducted on a Lenovo laptop with a 2.90 GHz AMD Ryzen 7 processor (8 cores) and a 64 GB RAM.
Two kinds of experiments are conducted.The first one is the comparison between DDPBP and state of the art exact algorithms ('Comparisons of DDPBP with exact algorithms').We report the average times of obtaining the optimal solutions, and the breadth first strategy to traverse search trees, they cannot achieve feasible solutions until the search finishes.Thus, in Table 3, DDPBP and HAPBP do not have column '' b_gap'' and the resulting solution is the global optimum.
As shown in Table 3, HAPBP cannot achieve the optimal solution of FLMP with 27 activities in 1 h (marked by ''-'').As for the general solvers, CPLEX can obtain the optimal solutions for 17 out of 48 kinds of instances (o_gap = 0), most of which have a low activity number and low density.For example, for FLMP with 27 activities and 0.1 density level, CPLEX achieves the global optimum within 1 h.However, when the density level increases to 0.2, the average gap between a feasible solution and the global optimum is o_gap = 13.02%, and the average bound gap is b_gap = 58.40%,which is quite large.On the other hand, the quality of feasible solutions obtained by Gurobi are much better.In fact, some of them are actually the global optimum (o_gap = 0,b_gap = 0, 10 kinds of instances), compared to the optimal results from DDPBP.However, it is difficult for Gurobi to prove the global optimum within 1 h, since the corresponding bound gaps b_gap are still very high.

Comparisons of DDPBP with heuristic algorithms
Since DDPBP can provide the optimal solutions of FLMP with up to 27 activities, it is worthwhile to use DDPBP as a benchmark to evaluate the performance of heuristic approaches, especially to see if heuristic algorithms can obtain the global optimum.In this section, we introduce two state-of-the art algorithms to solve the instances from Section 'Benchmark instances and experimental protocol'.The first one is the insertion-based heuristic algorithm(IBH) (Wen et al., 2021), which follows the local search framework and apply multiple operators including activity insertion and activity block insertion.
The second algorithm is the multi-wave tabu search (MWTS) algorithm (Shang et al., 2023), which alternates between a tabu-search based intensification phase and a hybrid perturbation phase.The computational complexity of both algorithms is O(n 2 ), which is much lower than O( na p=2 C p−1 n * (n−p+1)) of DDPBP.We implemented these algorithms on the Matlab platform, and set the time limits as 6 min.
Table 4 reports the average gaps of the objective values obtained by these heuristic algorithms and the optimal values from DDPBP.We observe that IBH actually reaches the global optimum for 15 out of 48 kinds of instances (o_gap = 0), compared to the existing optimal objective values (opt ).However, as the number of activities in FLMP increases, it is more difficult for IBH to achieve the optimal solutions.For example, for FLMP with 25 activities and 0.2 density level, the average gap between feasible solutions and the global optimum is o_gap = 3.91%.For MWTS, it performs significantly better for obtaining the optimal solutions of all instances, which confirms the strong intensification ability of tabu search, and the necessity of applying a perturbation strategy for diversification in solving FLMP.This experiment inspires us to apply tabu-search in the parallel exact algorithm to efficiently generate good bound and cut search branches.On the other hand, decomposing a complex problem into subproblems, then apply tabu search to solve them concurrently, may lead to an effective heuristic framework for solving complex problems with large scale.

ANALYSIS
This section provides systematic analyses for parameters and strategies applied in the algorithm.We first conduct a sensitivity analysis to see if there exist significant differences among different parameter settings.Then, to confirm the effectiveness of doubledecomposition strategy, result-compression strategy and hash-address strategy, DDPBP is compared with three variants whose related components are removed.

Parameter analysis
The proposed algorithm is controlled by parameters cn and na.Parameter cn represents the number of cores that are utilized by the algorithm, and the default value is 8 which is the maximum number of available cores in our computer.Parameter na is the number of rows explored by the forward process, and the recommended setting is 5, which is used to adjust workloads of two processes.
The instances are set as {18,20,22,24} activities and {0.1,0.5,0.9}density level.The ranges of cn and na are {2,4,6,8} and {3,5,7,9,11,13,15} respectively.For each parameter, we change the value within a range, while keeping the other parameter constant, and perform DDPBP to solve one random instance for each FLMP setting.In addition, we use the Friedman test to determine if there exist statistical differences among different parameter settings.Table 5 indicates that the setting of cn = 8 leads to less time consumptions for all instances.For example, for FLMP with 24 activities and 0.9 density level, the solving times under 2 and 8 cores are 151 s and 68.39 s, respectively.In addition, the Friedman test shows that changing the number of applied cores leads to significant differences on algorithm performances with p-value of 2.29e−07, which confirms the necessity of utilizing all the computing resources for solving FLMP.Meanwhile, from Table 6, we observe that DDPBP performs marginally better when na = 5, and the p-values of varying na is 0.92, which means that changing the workloads of forward and backward processes does not impact much the performance of the algorithm.

Strategy analysis
In order to confirm the validity of important strategies employed by the proposed algorithm, we produce three variants for comparisons, including: DDPBP-DDS that only uses the first decomposition, DDPBP-RCS without result-compression and DDPBP-HAS whose hash-address strategy related components have been removed.The addition experiments follow the same experimental protocol as Section 'Benchmark instances and experimental protocol'.

Effectiveness of double-decomposition strategy
The double-decomposition strategy allows DDPBP to make full use of available computing resources.To evaluate the rationality, we create a variant DDPBP-DDS, where the second decomposition has been disabled.Hence, this variant only deploys the forward and backward processes on two cores to explore the search tree.Table 7 presents the solving times obtain by two algorithms.The results show that DDPBP performs significantly better than its variant for all instances (12 > CV 12 0.05 ≈ 9).Specifically, the average gap of solving time (avg (Variant − DDPBP)/Variant ) is 69.02%.
To conclude, this experiment confirms that the proposed DDPBP algorithm is enhanced by the double-decomposition strategy.

Effectiveness of result-compression strategy
The result-compression strategy is designed for reduce the communication cost when each core finishes tasks and transmits results.To assess the role of this strategy, we produce a variant DDPBP-RCS, where the cores send the resulting SetT {i} directly, instead of transmitting extracted information.In Table 8, column ''Row 3-9'' shows the total amount of data transmission when two algorithms are about to end the explorations for rows 3, 7 and 9 of the search trees, and column ''Time'' presents the corresponding solving time for each instance.It should be noted that the density level does not affect the size of SetT {i} or the amount of extracted information, hence the amount of data transmission maintains for instances with the same number of activities.
From Table 8, we observe that for all instances, DDPBP obtain the optimal solution with less time and transmission costs (12 > CV 12 0.05 ≈ 9).For example, for the instance with 22 activities and 0.5 density level, DDPBP spends 16.17 s to reach the optimum, and transfers 70.99MB data when finishing the exploration of Row 7, while the experimental results of the variant are 23.19 s and 239.46 MB.In general, the average gap of data amount and solving time are 62.46% and 22.18%, respectively.One reason for this result is that the result-compression strategy only delivers the key information of a sparse SetT {i} within each core, which reduces the amount of data transmission, and thus improves the efficiency of the parallel framework.

Effectiveness of hash-address strategy
The hash-address strategy is introduced to accelerate the process of locating similar nodes in SetT {i} during the forward and backward processes.To evaluate the impact of this strategy, we create a variant DDPBP-HAS, which identifies whether two nodes are similar by comparing activities within nodes.Hence, in order to find similar nodes, the variant must check all the nodes stored in SetT {i}.
Table 9 shows that DDPBP significantly outperforms its variant for spending much less time on all instances (12 > CV 12 0.05 ≈ 9).For example, for the instance with 17 activities  and 0.5 density level, the solving times of DDPBP and its variant are 0.72 s and 476.33 s respectively.In general, the average gap of solving time is 98.61%, and as the number of activities increases, the solving times of the variant increase rapidly.This experiment proves the necessity of hash-address strategy for the proposed algorithm.

CONCLUSIONS
Minimizing the total feedback length is an effective objective to optimize development projects.In this study, we presented an efficient double-decomposition based parallel branch-and-prune algorithm, to obtain the optimal activity sequence of FLMP.The proposed algorithm divides FLMP into several subproblems through an original doubledecomposition strategy, then employs multiple CPU cores to solve them concurrently.In addition, we proposed a result-compression strategy to reduce communication costs in parallel process, and a hash-address strategy to boost the efficiency of sequence comparisons.Computational experiments indicate that the proposed algorithm is able to increase the scale of FLMP that exact algorithms can solve within 1 h to 27 activities, and clearly outperforms the best exact algorithms in literature.Furthermore, additional experiments show the effects of two parameters on algorithm performances, and confirm the advantage of the double-decomposition strategy, the importance of the result-compression strategy and the hash-address strategy.
Some strategies applied in this study are general and could be introduced to solve other sorting problems.For example, the double-decomposition strategy first divides a sorting problem into forward and backward subproblems, then further decomposes them into several sorting tasks, which can significantly reduce the complexity of sorting problems.Furthermore, the hash-address strategy maps similar sequences into a unique value, which can be used to compare and search sequences.Since type C feedback spans two regions, changing subsequences in A p or B p can affect fl c p , which means that this type of feedback is not independent of any regions.Apparently, the total feedback length of FLMP consists of three types of feedback length, i.e., fl = fl a p + fl b p + fl c p .In Fig. A2, assume that subsequences SA p and SB p are fixed.Without loss of any generality, set activities s h = i and s k = j, hence the feedback between activities i and j is type C, and its length l = k −h.Further, we divide l into la = (p+1)−h and lb = k −(p+1).
If we fix activity i at position h, and move activity j to any position in region B p , la remains unchanged, which means that la is not affected by the subsequence in B p .The same is true for lb.

Figure 7
Figure 7 Numerical illustrations for the hash strategy.

Figure
Figure A1 Three feedback types.

Figure
Figure A2 Further decomposition of type C feedbacks.

Table 9 Solving time (seconds) obtained by DDPBP and DDPBP-HAS.
Then, we divide fl c p into feedback values fv ca p and fv cb p , which are only related to A p and B p , respectively.