Selection and Optimization of Software Development Life Cycles Using a Genetic Algorithm

In the software field, a large number of projects fail, and billions of dollars are spent on these failed projects. Many software projects are also produced with poor quality or they do not exactly meet customers’ expectations. Moreover, these projects may exceed project budget and/or time. The complexity of managing software development projects and the poor selection of software development life cycle (SDLC) models are among the top reasons for such failure. Various SDLC models are available, but no model is considered the best or worst. In this work, we propose a new methodology that solves the SDLC optimization problem using a genetic algorithm. The methodology selects the best SDLC and optimizes the completion time of the selected model. This study aims to help project managers in a software development organization select the proper SDLC model for their projects and optimize the selected model by minimizing the project completion time. The proposed SDLC model selection approach is based on a selection matrix that consists of a set of selection criteria and information related to the project’s nature given by the project managers. Our methodology optimizes the selected SDLC model by reducing the project completion time and assigning duration for each phase. Several experiments were conducted to obtain the optimal completion time for the selected SDLC models. These experiments showed that our algorithm can optimally minimize the completion time of a given project and assign a duration for each phase. Experimental results showed that our methodology can reduce the completion time of a given project and produce realistic and optimal completion times for different SDLC models.


Introduction
Software development life cycle (SDLC) is a process used in software engineering (SE) [1]. It consists of a set of finite phases and activities that will be conducted during software development [2]. SDLC models define a complete plan for producing high-quality and reliable software products within time and budget targets. They describe how to plan, analyze, design, implement, and test software projects [1,2]. Various SDLC models are available in the scientific literature, but no model is considered the best or worst. Each model has advantages and drawbacks. No specific model is suitable for all project types [3]. The most popular SDLC models in the literature are the waterfall model, iterative and incremental development (IID) model, spiral model, agile model, rapid application development (RAD) model, and software prototyping [2][3][4][5]. Many studies have discussed the advantages and drawbacks of SDLC models [1][2][3][4][5]. In software project management, selecting the best SDLC model is one of the critical issues. Managing software projects is a complex task [3]. The software project manager has a vital role in the project's success, and they should select an appropriate SDLC model. The project manager needs guidance in selecting the best SDLC model for the project. The project manager has to deal with challenges during software development to deliver the project within time and budget. This is due to the complexity of managing software projects, which demand complex management involving planning and scheduling [3]. In project management, the human resources and tasks should be efficiently controlled to achieve a specific objective, such as minimizing cost and time or maximizing profit and productivity. Reducing the software completion time becomes an essential requirement for the early introduction of the software to the market to gain the benefits of market share and profitability [6]. Minimizing the project completion time optimally is known as the "search problem" or "optimization problem." This type of problem can be solved using search-based SE (SBSE) techniques. The project manager needs additional help to solve this problem.
The concept of SBSE was presented by Herman and Jones [7]. The term "search" refers to metaheuristic search-based optimization techniques [7,8]. SBSE is related to the areas of SE and uses search-based optimization techniques to solve different SE problems as optimization problems. Search-based optimization techniques are widely applied to solve many optimization problems in SE, such as problems related to software requirement engineering, testing, design, refactoring, and management. SBSE aims to identify optimal solutions to a given problem in a search space of candidate solutions. To solve a given problem, the problem representation should clearly define an objective function. Then, the solutions are evaluated on the basis of their quality (worse and better) during the search process [8,9]. Many search-based techniques have been proposed to solve various SE optimization problems. The most widely used are genetic algorithms (GAs), simulated annealing (SA), particle swarm optimization, ant colony optimization, and hill climbing (HC) [7]. All of these algorithms are known as evolutionary algorithms (EAs).
In this study, we address two main issues: (i) To enable the project manager to select the best SDLC model, how can we define a mechanism for SDLC model selection, and which criteria can be used? (ii) To enable the project manager to optimize the SDLC model selected, how can we optimize this model to achieve the optimization objective (minimizing the project completion time and assigning duration for each phase), and which search-based technique can be used to solve this problem to provide an optimal solution? To solve these issues, we have to meet the following objectives: (i) Help the user (project manager) select the best SDLC model. The SDLC model selection is in accordance with the information related to project characteristics given by the user and based on a set of selection criteria and matrices. (ii) Optimize the SDLC model selected with the objective of minimizing the project completion time and assigning an optimal duration for each phase. This optimization problem can be solved by applying an SBSE technique to provide an optimal/near-optimal solution. In this study, we propose a methodology for SDLC model selection and optimization. The aims are to help project managers select the best SDLC model and optimize the model selected by reducing the completion time of the project and assigning duration for each phase.
The major contributions of this study are as follows: We propose a selection mechanism based on a selection matrix for the most frequently used SDLC models. We design a framework for SDLC selection-optimization. We formulate an SDLC optimization problem.
We apply a GA to solve our SDLC optimization problem and design a GA's crossover operator (crossover points) to apply to our problem.
The rest of this article is structured as follows. Section 2 provides the background and related works. Section 3 presents the proposed SDLC selection-optimization methodology. Section 4 describes the experimental settings. Section 5 presents the results and discussions. Section 6 provides the conclusion and suggestions for further research.

Background and Related Works
GAs are a type of EAs introduced in [10]. GAs have been successfully applied to solve complex optimization problems. The main reasons for the success of GAs are their ease of use, flexibility, accuracy, and broad applicability. Therefore, we apply a GA to solve our problem [11]. In general, a GA starts with generating an initial population. A population is a random set of initial candidate solutions or individuals that satisfy the constraints imposed on the problem [12][13][14]. Each individual (also known as a chromosome) represents a solution to the problem. A chromosome consists of a set of parameters called genes, which are usually represented in a binary bit string. Chromosomes evolve through continuous iterations (known as generations). They are evaluated, and their fitness values are computed using an objective function during each generation. New chromosomes (known as offspring or children) are formed by applying crossover and mutation operations to generate the next generation. After a sequence of generations, the fittest chromosome is selected to represent the optimal or near-optimal solution to the problem [12].
In the literature, several studies have addressed the optimization of the task and human resource allocation in software project-scheduling problems by minimizing the project time. To our knowledge, none of those studies have addressed optimizing SDLC models. Moreover, no study has combined SDLC model selection and optimization. In this study, we focus on the selection of SDLC models and optimization of the completion time for the entire particular model by assigning the duration of each phase instead of allocating the tasks.
GAs have been successfully applied to different search and optimization problems. As mentioned, the main reasons for the success of GAs are their ease of use, flexibility, accuracy, and broad applicability. Therefore, we apply a GA to solve our problem [11].

Proposed SDLC Selection and Optimization Methodology
In our study, we propose a new methodology that solves the SDLC optimization problem using a GA. The methodology selects the best SDLC and optimizes the completion time of the selected model. This study aims to help project managers select the proper SDLC model for their projects and optimize the selected model. The SDLC model selection approach is based on a selection matrix. It consists of a set of selection criteria and information related to the project's nature given by the project managers. Our methodology optimizes the selected SDLC model by reducing the project completion time and assigning duration for each phase. The methodological framework has two main parts, as shown in Fig. 1. The first part is the selection process. The overall objective is the selection of the proper SDLC model; it contains the selection matrix module that is responsible for the decision of choosing the best model. The second part is the optimization process for the selected model; it contains the GA module. The overall objective of this part is to reduce the completion time of the project and assign duration for each phase in an optimal way.
The pseudocode of the SDLC selection and optimization methodology is given in Fig. 2. The selection of an SDLC process (Step 1) starts when the project manager/decision maker enters the required information related to the given project, such as the total number of employees (project team) involved in the project development. The project manager also determines the project characteristics on the basis of a set of selection criteria. These criteria are used in the selection matrix to select an appropriate SDLC methodology among different SDLCs. Further details of the selection mechanism and matrix are provided in Subsection 3.1. The best SDLC model that suits the project is selected, and the number of selected SDLC phases and the number of repetitions for each phase are calculated. Then, the process of optimizing the selected SDLC via a GA is applied in Step 2. In this step, the selected SDLC is optimized by assigning durations for each SDLC phase with the objective of minimizing the project time. Before the optimization process is run, the initial solution (population) is constructed with an appropriate representation based on the selected SDLC. Once the initial solution is fully constructed, the optimization process is started. Additional details of the GA process are in Subsection 3.2. Eventually, the optimal completion time of the project and the optimal duration for each SDLC phase are obtained.

SDLC Selection Mechanism
We introduce a new mechanism for the selection of the SDLC model. We constructed a selection matrix based on a set of selection criteria. We compared the most popular SDLC models based on 11 criteria. Each criterion has a predefined set of values, such as clear/unclear and small/medium/large. For instance, the requirements should be clear at the beginning of the software development process in waterfall projects.  We evaluated the most common models, namely, waterfall model, IID model, spiral model, agile model, RAD model, and software prototyping. Then, we set their values. Before the selection of any model, all models were evaluated by our methodology according to their fitness with respect to the project information given by the project manager. Subsequently, the ranking values for each model were obtained by computing the 11 values for every model. Finally, the model with the highest ranking value was selected to be the best model for the project development. The proposed selection matrix is provided in Tab. 1.
The data provided by the selection process are the name of the best model for the project, number of phases, and number of iterations for each phase. These data were used as inputs in GA optimization.

Solution Representation
In addition to the selection of the best SDLC model for the project, we optimized the completion time of the project and assigned duration for each phase of the selected SDLC model.
Initially, we represented the problem solution. In GA, each individual of the population represents a possible solution to the problem. In the case of our problem, an individual is considered an instance of the SDLC model. The individual represented by string of genes is known as a chromosome. Each gene in the chromosome represents one phase of the SDLC model. The values of a gene represent three factors: (i) repetition factors, which include the number of iterations for each phase; (ii) the phase duration assigned by the algorithm; and (iii) the number of employees assigned for each phase in accordance with the total number of employees given by the project manager. The data provided by the SDLC selection process are used to construct the initial solution with appropriate representation. The chromosome length differs based on the number of phases of the selected SDLC model. Fig. 3 illustrates an example chromosome representation.
A GA searches for the optimal solution of a problem. It calculates the objective values and selects those individuals that obtain the best objective values (selection). The example in Fig. 4 exhibits how the crossover and mutation operations are performed in our algorithm. We designed multipoint crossover and chose crossover points in every gene in the chromosome. Those points are the phase duration portion in each gene. The crossover operator swaps the values of parents among themselves for every crossover point. The mutation operator randomly changes the value of the phase duration of a randomly selected gene of the offspring.

Problem Formulation
We seek to reach the solution that minimizes the completion time of a given project. Therefore, our optimization problem can be formally described as It is restricted by 1≤ y ≤E, where E is the total number of employees for a given project that the project manager must determine beforehand. The number of employees for each phase is assigned in a way that satisfies the constraint 1≤ y ≤E. The repetition factor (number of iterations for each phase) is constant for each phase in a given model. That is, the repetition factor in the individuals' representation remains constant. The notations indicated in Tab. 2 were used to define Eq. (1).
Thus, the problem solution consists of a combination of values that minimizes the completion time of the project. Those values represent the duration assigned by the algorithm to each phase of the selected model. The objective f (x) sums up the total duration values for all phases and results in the objective value. The individual with the lowest objective value will be selected by the algorithm to be the optimal solution.

Experimental Settings
We conducted a total of 150 experiments on the five most popular SDLC models (waterfall, RAD, Vmodel, spiral, and agile) using a GA. We divided the experiments among the five models, each representing an example of a particular project with a certain employee number. We repeated them 30 times. In this study, we analyzed how our algorithm can optimally minimize the completion time of a given project and assign duration for each phase. We implemented the GA in Java 8 using the NetBeans IDE 8.2 platform and performed 30 independent runs for each model on an Intel CORE i7, 8 th Gen, 2.0 GHz, 8 GB of RAM PC running Windows 10. A population of 64 individuals and multipoint crossover and mutation were applied. The crossover probability was set to 1.0, and the mutation probability was set to 0.07. The stopping criterion was set to reach the maximum generation, which was 500. We set the values of the GA parameters based on a survey of studies that have used GAs to solve similar optimization problems [10]. We believe that these studies have provided good solutions to optimization problems. Tab. 3 illustrates the parameter settings that we set for the GA runs. Before running the GA, our methodology constructed the initial solution with an appropriate representation based on the selected SDLC model. Our GA optimization was implemented with a range of durations that were randomly assigned for the SDLC model phases. Each range of durations was set to reflect the real completion time of a given model. Accordingly, the duration of the phases was assigned randomly to its appropriate position in the genes. The number of employees for each phase was also assigned with a random distribution to its appropriate position in the genes according to the total number of employees given by the project manager. We set a range of total number of employees for every model. These ranges reflect the real sizes of project teams required to accomplish the project. When we conducted independent experiments for each model, we considered various employee ranges.

Experimental Results and Discussion
Several experiments were conducted to obtain the optimal completion time for the selected SDLC models. Tab. 4 shows the results obtained in the experiments for the five models, where the first column denotes the experiment number, and the second to the sixth columns indicate the objective values (optimal completion times in days) for each model. The boxplot analysis of the results is depicted in Fig. 5, which exhibits the distribution of the experimental results. Our GA was originally implemented to handle different ranges of phase durations to reflect the real completion time. Accordingly, Tab. 4 presents the distribution of the results, that is, completion time of the project, which varies with the model type. The results may also be affected by several factors, such as the number of genes (number of phases of the SDLC model), the total number of employees and their distribution at each phase, and the iteration number.
Tab. 5 presents the minimum, maximum, and standard deviation of the obtained results for the five models, and Tabs. 6-10 show the minimum, maximum, and standard deviation of the obtained results in detail for each model. From the statistical results of the SDLC models presented in Tab. 5, the waterfall model had high StDev, which meant that significant differences existed among objective values. The agile and spiral models had medium StDev, and V-model and RAD model had low StDev. The reason for the high StDev in the waterfall model was the large range set to assign the durations of the phases of the waterfall model randomly. The StDev in V-model and RAD was low due to fewer ranges. Although the distribution of the objective values obtained in the experiments varied with the model type, these results were realistic and optimal completion times of the projects.       From the results of the waterfall model experiments presented in Tab. 6, the Min value was 723, Max value was 1019, and StDev was 66. Analysis of the waterfall boxplot results shown in Fig. 5a indicates that the distribution of the objective values was realistic and optimal as completion time of the waterfall model projects. The significant differences among objective values (StDev) obtained in the experiments were normal due to the random distribution of the significant duration range suitable for the completion time of the waterfall model projects. Indeed, objective values may also be affected by several factors, such as the number of genes (number of waterfall phases) and the employee distribution for each phase. With respect to the agile model shown in Tab. 7 and Fig. 5b, the Min, Max, and StDev were 360, 450, and 28, respectively. Significant differences existed among the objective values (StDev) obtained in the experiments. These results were expected and realistic for the agile model projects. This is because the significant duration range of the project completion time for the agile model affected the random distribution of objective values. In addition, the agile model was affected by several factors, such as the number of genes (number of the agile phases) and the number of model iterations. The results may further be affected by the employee distribution for each phase. The spiral model results presented in Tab. 8 and the visual results shown in Fig. 5c indicated that the distribution of objective values was realistic and optimal for the project completion time of the spiral model. The differences among objective values obtained in the experiments were normal and considerably high due to the random distribution of few duration ranges suitable for the project completion time of the spiral model. The results may also be affected by several factors, such as the number of genes (number of spiral phases), the number of iterations, and the employee distribution for each phase. The results of the V-model experiments presented in Tab. 9 and the boxplot results shown in Fig. 5d demonstrate that no significant differences existed among the objective values obtained in the experiments. The results seem stable, realistic, and optimal for V-model. These results are normal and expected due to the random distribution of the very few duration ranges suitable for the project completion time of the V-model. The RAD model experimental results in Tab. 10 and Fig. 5e indicate no significant differences among the objective values obtained in the experiments. The results are stable, realistic, and optimal for the project completion time of the RAD model. These results were expected due to the random distribution of very few duration ranges suitable for the project completion time of the RAD model. The experiments in this study were limited only to the five SDLC models, which are waterfall, agile, spiral, V-model, and RAD. Other models, such as IID and prototyping, were excluded because they rely on an iterative factor that needs an intensive computation.

Conclusion and Future Work
In this study, a new methodology for SDLC model selection and optimization is proposed. No approach or study exists in the literature that addresses the optimization of SDLC models. This gap motivated us to conduct this study. We proposed a methodology that would provide a good solution for project managers in software project organizations. The proposed methodology could help project managers in the selection of a proper SDLC model and optimize completion time of the project in the selected model. The selection of a proper SDLC model depends on many criteria related to the nature of the software project, project team, experience level of the developers, and risk and cost management. We identified a set of selection criteria on the basis of many studies and introduced a selection matrix into SDLC model selection. Moreover, we applied a GA to optimize the selected model with the objective of minimizing the completion time of the project. We implemented the GA in a way that represented our solution. We conducted a total of 150 experiments on the 5 most popular SDLC models (waterfall, RAD, V-model, spiral, and agile). These experiments demonstrated how our algorithm can optimally minimize the completion time of a given project and assign duration for each phase. The obtained results of the experiments showed that the objective values for the five models were realistic and optimal as completion times of projects.
In the future, other SDLC models can be included in the SDLC selection matrix. The application of other search-based techniques to the SDLC optimization problem needs further investigation. In addition, the performance of the results could be improved by combining GAs with other optimization techniques, such as HC and SA.