New Construction of Family of MLCS Algorithms

The multiple longest common subsequence (MLCS) problem involves finding all the longest common subsequences of multiple character sequences. This problem is encountered in a variety of areas, including data mining, text processing, and bioinformatics, and is particularly important for biological sequence analysis. By taking the MLCS problem and algorithms for its solution as research domain, this study analyzes the domain of multiple longest common subsequence algorithms, extracts features that algorithms in the domain do and do not have in common, and creates a domain feature model for the MLCS problem by using generic programming, domain engineering, abstraction, and related technologies. A component library for the domain is designed based on the feature model for the MLCS problem, and the partition and recur (PAR) platform is used to ensure that highly reliable MLCS algorithms can be quickly assembled through component assembly. This study provides a valuable reference for obtaining rapid solutions to problems of biological sequence analysis and improves the reliability and assembly flexibility of assembling algorithms.


Introduction
e problem of finding the longest common subsequence between sequences is called the longest common subsequence (LCS) problem, and that of finding the longest common subsequence among more than two sequences is called the multiple longest common subsequence (MLCS) problem. e MLCS problem was shown to be an NP-hard problem in [1].
is problem is widely used in the fields of bioinformatics and computational genomics [2][3][4]. e rapid development of high-throughput sequencing technology [5] has promoted the implementation of many international genomic projects [6], and massive amounts of data have been generated in the field of bioinformatics.
is has led to the formation of large-scale bioinformatics data, where this scale continues to grow. e LCS of a set of biological sequences usually preserves informative segments (subsequences) that are conserved during the evolutionary process on the condition that no internal deletions or insertions occur within contiguous segments. Finding the LCS of biological sequences is the basic means and method of studying biological sequence data and forms the basis for further sequence analysis (such as multiple sequences alignment, gene prediction based on similarity, genome rearrangement, and so on).
For over 40 years, significant efforts have been made to find efficient algorithms for the MLCS problem. For the LCS problem, the most effective algorithm is the dynamic programming algorithm. e normal dynamic programming algorithm was first proposed by Wagner and Fischer in 1974 [7]. However, when searching the LCS of large-scale data, the normal dynamic programming algorithm has problems such as excessive memory space occupation and high time complexity. erefore, the dynamic programming algorithms that optimize space complexity were proposed, such as Hirschberg algorithms [8], SB algorithms [9], and Smith-Waterman algorithms [10]. e dynamic programming algorithms can effectively solve the LCS problem of double sequences, but in practical applications, it is often necessary to deal with the MLCS problem of multiple sequences. As the number of sequences increases, the time and space complexity of the dynamic programming algorithms increase exponentially, resulting in a poor performance of the dynamic programming algorithms when dealing with MLCS problems. In order to solve this problem, many other MLCS algorithms have been proposed, which can be divided into the heuristic MLCS algorithms and the dominant pointbased MLCS algorithms. e heuristic MLCS algorithms are approximate algorithms, and the dominant point-based MLCS algorithms are precise algorithms. e approximate algorithms for MLCS problem were first proposed in 1994 [11,12], and then the Expa algorithm was successively proposed [13], as well as best next for maximal available symbols (BNMAS) algorithm [14,15], ant colony optimization (ACO) algorithm [16], beam search (BS) algorithm [17], and so on. In the ACO algorithm, Shyu and Tsai transformed the MLCS search problem into a node state change problem in the ant colony algorithm and defined state transition rules and pheromone update rules. In the BS algorithm, Blum transforms the MLCS problem into a problem of tree search, first builds a search tree for all input sequences, and then uses the heuristic tree search algorithm, Beam search, to search for MLCS. e efficiency and solution quality of ACO algorithm and BS algorithm are better than other heuristic MLCS algorithms. In practical applications of MLCS problem, there are many cases where the exact solution needs to be calculated. erefore, the dominant point-based algorithms are most widely used in the MLCS problem. In 1977, Hunt and Szymanski found that, when using dynamic programming algorithms to solve LCS problems [18], the actual data used is only the value of the position of a few points with the same characters in the dynamic programming score matrix, and the data at other positions in the score matrix can be deleted. According to the above findings, the idea of dominant point-based algorithm is proposed: by calculating the dominant points between a set of molecular sequences, and generating all the dominant point data into a directed acyclic graph (DAG), each longest path in the DAG is a longest common subsequence of the input sequences. Compared with the dynamic programming algorithm, it greatly reduces the time and space complexity. In 2006, Chen proposed the Fast_LCS algorithm first using the successor table (a data structure) [19]. e successor table can quickly calculate the match points of the molecular sequence and then quickly delete the redundant match points through a defined series of pruning operations to generate dominant points, thereby greatly reducing the time to generate a DAG. In 2011, wang proposed Quick_DP algorithm and its parallel algorithm Quick_DPPAR by improving the pruning function of Fast_LCS [20]. In 2016, Li proposed Top_MLCS algorithm [21] by constructing Irredundant Common Subsequence Graph (ICSG). ICSG removed all redundant and duplicate matching points, which greatly improved the efficiency of searching MLCS. In addition, there are some algorithms that use heuristic functions as pruning functions to delete redundant matching points, and achieve good results, such as ProMLCS algorithm [22], MLCS-A * algorithm [23], MLCS-APP algorithm [24], and so on.
Current research on the MLCS problem has focused on optimizing specific steps of particular algorithms [25] or their parallel implementation [26]. Due to the diversity and structural complexity of MLCS algorithms, many users cannot select appropriate algorithms that conform to the characteristics of the given sequence, where this may lead to unnecessary errors in the application process. erefore, it is important to research MLCS algorithms at the level of domain abstraction. is paper uses generic programming [27], domain engineering, feature modeling [28][29][30], abstraction, and related technologies to study the MLCS problem. We consider MLCS algorithms as research domain and conduct a domain analysis. e featureoriented domain analysis (FODA) [28] method is first used to establish the model for the MLCS domain. Following this, the PAR platform is used to describe, design, and implement components of the domain that are involved in the MLCS domain model to generate a highly abstract MLCS algorithms domain component library. Finally, this component library is used to assemble and generate executable MLCS algorithms.

Partition and Recur Method. Partition and recur method
(PAR method) is a formal development method based on division and recursion [31][32][33][34][35]; it is a unified algorithm design method, covering a variety of known algorithm design techniques, including dynamic programming method, greedy method, divide and conquer method, exhaustive method, and so on. It contains an algorithm design language (Radl, recurrence-based algorithm design language), an abstract generic programming language (Apla), and a unified algorithm design and proof method, as well as a series of generation systems (the PAR platform). Among them, PAR provides formal support for the process of designing the Radl algorithm based on the problem; it provides automated support for the generation of the Radl algorithm from the Apla program to the executable language program. PAR has established two formal program development paths, and its platform architecture is shown in Figure 1. e first is that, for the quantization problem, the PAR method can convert the Structured Natural Language (SNL) demand model into a Radl protocol model, and then into a Radl algorithm model, and then further into an Apla abstract program model, and finally into a high-level language program that can be run directly. e second way is, for nonquantitative problems, you can directly design the Apla program through the SNL demand model, supplemented by the corresponding formal proof, and then convert the Apla program into an executable program.

Domain Modeling.
Software reuse is considered an effective means of solving the software crisis and implement the mode of industrialized production to the software industry [36]. Software reuse activity consists of two related phases: the production phase of reusable software assets, and the development phase of application systems based on reusable software assets [37]. Domain engineering corresponds to the stage of production of reusable software assets (system identification) and can develop and organize reusable software assets in the domain to provide the necessary material and technical foundation for the subsequent development of the application system. Domain engineering is the main technical means of the production of reusable software assets. It consists of three stages: domain analysis, domain design, and domain realization. Domain analysis is based on an analysis of the requirements of several typical systems in the domain. It first considers such factors as expected changes in demand, technological development, and objective constraints to determine the appropriate scope of the domain. It then identifies the commonality and variability in the domain to obtain a set of domain requirements with sufficient reusability. Finally, it abstracts the requirements to form a domain model. Domain design and implementation is based on the domain model to identify, develop, and organize reusable assets, such as frameworks and components in the domain. When developing a new algorithm in the same field, we need to determine the requirements of the algorithm based on the domain model, select the appropriate algorithmic framework, and use this as the basis for selecting components for assembly to finally form the new algorithm. In this way, new algorithms are no longer developed from scratch but are based on the massive reuse of algorithmic components in the stages of analysis, design, and implementation.
FODA method [28] was introduced into the research and practice of domain engineering in 1990. FODA uses the relationship between features (feature model) as an important part of the domain model. Features in domain engineering can be divided into two categories: mandatory and variable features. Mandatory features are those that exist in all parts of the domain, and variable features are those that exist only in parts of the domain. Mandatory features clarify the commonality of the domain and variable features clarify its variability. Compared with commonality, variability is more important to study. Variable features can be divided into the following types: (1) Optional feature (optional), which represents an optional feature of the instance in the domain (2) XOR feature, which represents that the instance in the domain has and can only be selected one feature in a set of XOR features, because the XOR features are mutually exclusive (3) OR feature, which means that the instance in the domain contains at least one feature in a set of OR features e mandatory feature and three variable features are represented in the domain model as shown in Figure 2. In this paper, the feature modeling method is used to establish the domain feature model for the domain of MLCS algorithms, and the abstract programming language Apla of the PAR platform is used to formally implement the component library of the domain.

Dominant Point.
We use point P � [p 1 , p 2 , . . . , p n ] to denote a unit in the dynamic programming scoring matrix L of n sequence, where p i (i ≤ n) represents the position of a character of sequence a i , and the value of position P in matrix L is L[P].
at is, the characters of each sequence are consistent at point P. Journal of Healthcare Engineering

Domain Analysis Process.
is section invokes research on the commonly used MLCS algorithm and analyzes the domain of such algorithms. e FODA feature modeling method is used to perform feature modeling on the service (S), function (F), and behavior characteristics (B) in the domain of MLCS algorithm. e longest common subsequence search service is the core service in the field of MLCS.
e main functions in this domain include dynamic programming (dp) algorithm operation, heuristic multiple longest common subsequence (HMLCS) algorithm operation, and the dominant point-based multiple longest common subsequence (DP-MLCS) operation. Among the core service, the dp algorithm operation and the HMLCS operation are optional functions, and the DP-MLCS algorithms operation is a mandatory function. e description of optional functions only focuses on its main components. For the mandatory function DP-MLCS, the main flow of its algorithm operation is described in detail. e dynamic programming mode selection (dp_mode) is a significant behavioral characteristic of the operation of the dynamic programming algorithm, including two main behavioral characteristics: standard dynamic programming algorithm (normal) and dynamic programming algorithm that optimizes memory consumption (space_opti). Heuristic mode selection (heur_mode) is a significant behavior characteristic of HMLCS. Here, only ACO algorithm and BS algorithm are taken as its main behavior characteristics. e operation process of the mandatory function DP-MLCS is as follows: (1) Check the validity of the sequence. e input of the algorithm is DNA sequence, and protein sequence. Before the algorithm is executed, the sequence information needs to be checked for validity to determine whether the input sequence is biologically significant.
(2) Establish a preprocessing matrix for the input sequence. e data structure (successor  e flow chart of DP-MLCS algorithms is shown in Figure 3.
rough accurate analysis of a series of DP-MLCS algorithms, the successor table (Successor Table,

Components Implementation.
According to the above domain feature model and component dependency relationships of MLCS algorithms, the PAR platform is used to formally implement the domain component library. e Radl language is first used to accurately describe the functional specifications of components of the MLCS field, which are then integrated according to the dependencies between them to form an ADT. Finally, the abstract programming language Apla is used to implement all ADTs. e specifications of the key components cal_match_point and prune_mode in the DP-MLCS algorithms are as follows:  e output flag is a Boolean value. If it is true, it indicates that the point P is the dominant point. Otherwise, the point P is a redundant point and should be deleted.
By using abstract data types and abstract processes that write programs directly, the Apla language can describe algorithmic problems abstractly. is makes it easy to verify its correctness and in turn ensures the reliability of the program. As the input language of the program generation system of the PAR platform, Apla can be quickly and easily converted into C++, Java, Python, and other programming languages and platforms.
According to the MLCS domain feature model established above, and by considering the dependencies between components as well as the relationships between data types and functions in the domain, we use Apla to define and implement MLCS algorithms components to form a domain component library. Owing to limitations of space, only the formal implementation of the DP-MLCS algorithm is given, and the specific implementation code is omitted.  temporarily uncertain, a type parameter elem is defined. e function read_seqs is used to obtain all the sequences to be searched in the input text file. e function seq_check is the realization of the function of the component seq_check and is used to check the legality of the input sequences. e subroutine Generate_successor_matrix is the realization of the function to the successor_matrix_mani component. e generic subroutine successor_matrix in Generate_Successor_matrix dynamically provides memory for the successor matrix according to the input data and initializes the successor matrix. Of the input parameters, char_num is the character type in the input sequences, and seq_num is the number of input sequences.
is generic subprogram supports the instantiation of different forms of successor tables required by different algorithms. e memory_successor_of_matrix subroutine is responsible for calculating and storing the successor matrix of the input sequence. e get_value and set_value subprograms, respectively, obtain and set the value in successor_matrix. procedure backtrack (sometype elemMatrix; func print_MLCS(): string); enddef e ADT DAG encapsulates three operational components related to directed acyclic graphs: the DAG, backtrack, and result_op. e function Generat_graph generates a DAG based on the set of the input dominant points, and the subprogram backtrack uses a depth-first search strategy to traverse the DAG. According to the matrix of dominant point, the subprogram will choose different traversal methods, and it also includes a subfunction print_MLCS, which is used to output the obtained MLCS, and record and output the number of MLCS.      Results. By using the Apla-C++ program generation system of the PAR platform, we convert the Apla program of the assembly algorithm into C++ code. e generic programs and functions defined in Apla were converted into separate class member functions in C++ to reduce the coupling between components. In particular, the calling functions were converted into indicator functions in C++, and the generic parameter was converted into the pointer parameter to implement the polymorphism of the Apla program. After converting all ADT components to C++, the Apla code of the MLCS search operation was converted into a main function, and finally an algorithm program that can be run was generated by manually assembling the components.

Applications
We tested our assembly algorithm on the rice gene sequence of the GenBank database and compared the performance of our assembly algorithm with that of Clustal-W algorithm [38], which is a popular algorithm for multiplesequence alignment and MLCS. First, we experimented with different numbers of DNA sequences when the sequence length is 50; the computation time of MLCS to the two algorithms is shown in Table 1. Second, we tested the two algorithms on five sequences sets with different lengths; the result is shown in Table 2, and one computation result of the assembly algorithm is shown in Figure 6. From Tables 1 and 2, we can see that our assembly algorithm is faster than Clustal-W for sets with different numbers of sequences and sequences sets with different lengths. Because Clustal-W is not an exclusive algorithm for MLCS problems, the accuracy rate of MLCS obtained by it is also lower than that of the DP-MLCS algorithm we assembled.
Because of the excellent verifiability of high-level language Apla, the MLCS algorithm formed by component assembly in the domain of MLCS not only improves the reliability, execution efficiency, and maintainability of the assembly algorithm program, but also can be manually assembled to form the specified algorithm according to customer needs, and also enhances the generality of algorithm components.

Conclusion
Sequence alignment is a key method and problem in research on bioinformatics. Finding the longest common subsequence among multiple biological sequences is an effective technique for sequence alignment. e MLCS algorithms are widely used in bioinformatics, data mining, information retrieval, and pattern recognition. is article is the first to consider algorithms to solve the MLCS problem as a special field, research them at a high level of abstraction, improve their reliability and developmental efficiency, and reduce the likelihood of occurrence of suboptimal solutions and errors. We used FODA method to analyze the domain of the MLCS algorithms, extract its general and variable features, and transform them into components. e formalized specification language Radl was used to accurately describe the functional specifications of components of MLCS algorithms, and the high-level language Apla was used to implement the function components. Automatic or semiautomatic methods are expected to be available to solve the problems of the assembly and generation of highly reliable components for specific problems. Finally, through a series of program conversion systems on the PAR platform, it is quickly converted into executable algorithms. e final experimental results show that our research has high practicability.
is paper combines the theories and techniques of domain engineering, generative programming, formal methods, etc., and the practical research carried out on the MLCS algorithm family can provide new ideas for algorithm research in other fields of bioinformatics. We have applied this research method to the study of pairwise sequence alignment problems and multiple-sequence alignment problems and have achieved substantial achievement. is shows that the algorithmic assembly generation method in this paper is extended to other types of problems and is expected to develop into a method to solve a series of similar problems. Next, we will design a graphical user interface (GUI) for the component library to enable users to obtain the required algorithms by selecting components on the visual interface.

Data Availability
e datasets generated for this study are available on request to the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.