Identifying Software Complexity Topics with Latent Dirichlet Allocation on Design Patterns

The scientific literature has paid limited attention to studying software complexity subjects from the design point of view. There is a significant number of papers that study software complexity in relation with maintenance, refactoring, source code changes and that establish metrics for measuring software complexity. This paper compares design patterns and software complexity in order to identify trends of research in the software complexity area. For this purpose, we assess the strengths and weaknesses of software complexity scientific articles through the lens of design patterns. We have reviewed 1068 papers via latent Dirichlet allocation technique (LDA) for our work. We found that existing software complexity paths disproportionate emphasis in how software complexity could benefit from design patterns instead on how contributions to design patterns can benefit from software complexity.


Introduction
Software complexity is a key property in software engineering and developing applications.The subject relates to refactoring, reusability, reducing software management project costs, and large infrastructures at low cost.Although there are many software metrics, the scientists still consider there are subjects to improve by scientific studies.Because the modern software engineering relates to object oriented field when designing applications and because of the wide area, that software complexity supposes, we decided to study the subject of software complexity by approaching the field of design patterns, hopping to identify the main topics to study and future trends in research.Various topics on object oriented design have been proposed over the years.Design patterns are a subject of interest because they offer solutions for the coupling and cohesion between different layers of an application.The problems due to of a design with high coupling are: changes in related classes force local changes; harder to understand in isolation; harder to reuse because it requires additional presence of other classes.The problems due to a design with low cohesion are: hard to understand; hard to reuse or to maintain.High cohesion means that a class has moderate responsibility in one functional area and it collaborates with other classes to fulfill a task.Software complexity can be reduced by designing systems with the weakest possible coupling between modules [1].Historically, complexity in programs arising because of the number of conditional and iterative statements has been measured using the cyclomatic complexity metric [2].Refactoring code with design patterns reduces complexity, although it increases the number of classes [3].The authors show that design patterns do not always improve the quality of systems.Some patterns are reported to decrease some quality attributes and to not necessarily promote reusability, expandability, and understandability.Also, they bring further evidence that design patterns should be used with caution during development because they may actually impede maintenance and evolution.Their study also reveals that object-oriented principles may not be so "good" as they may not necessarily result in systems with good quality.However, we consider that the subject of studying the effect that design patterns might have on software complexity is not very well represented.The scope of this paper is to identify the main topics, the trends, and to test if there is a correlation between the two subjects.

Materials and methods
In this section, we present the research goals and questions to be answered, and we describe the inclusion or selection criteria for the studies chosen to analyze and data collection.The purpose of this article is to get a broad and current overview of the two subjects considered in this paper: design patterns and software complexity.The analysis was realized on academic journal articles.The search for papers was conducted in 2019 on Thomson Reuters' Web of Science, which has a large interdisciplinary database of academic texts, and limited to peer reviewed articles and reviews in English.We realized two searches in the title, abstract, and keywords of papers from ISI Clarivate: • design patterns, which returned 2045 articles; • "software complexity", which returned 302 articles.Data selection is presented in Figure 1.

Fig. 1. Papers dataset
In order to identify the topics from software complexity subject area we applied LDA on the corpus of articles belonging to the design patterns subject.For the LDA analysis, we used the abstracts of the selected papers.Figure 2 presents the approach of our study.

Fig. 2. LDA research approach
The abstracts are expected to give a sufficient indication of what is the subject of the paper and thus provide an overview of the topics discussed in the respective fields [4].In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.LDA is an example of a topic model.The method for topic modelling employed in this paper, Latent Dirichlet Allocation (LDA), has greater flexibility using as input the whole abstracts of the papers.LDA is a generative DOI: 10.12948/issn14531305/23.4.2019.01probabilistic topic model proposed by [4], which can be used for the unsupervised identification of underlying topics in a large corpus of data without any prior knowledge of the topics [5][6].Although the documents, or abstracts, are known and observed, the topics are hidden or latent [7].The total number of abstracts (papers) is noted with N. For each abstract d belonging to N, we extracted a vector of words Xd=[Xd1, Xd2, …, XdWd] where Wd is the number of words in abstract d.W is the number of unique words in the dataset, and and V=[w1,w2,...,wn] is the vocabulary of words.Rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text in its topic space as {Topic_i: weight(Topic_i, T) for Topic_i in Topics}.The LDA topic model algorithm requires a document word matrix as the main input, e.g.Document-Word Matrix (or Document-Term Matrix).DWM[i][j] = The number of occurrences of word_j in document_i.Topics are latent variables composed of word distributions.Producing an interpretable solution is the beginning, not the end, of an analysis.To draw adequate conclusions, the interpretation of the latent variables must be substantially validated [8].Several authors proposed guidance for evaluation and validating LDA models [9].We studied the coherence and perplexity of different LDA resulted models, choose the model that had the best coherence value, filtered the articles written on software complexity starting from design patterns discovered topics and conducted the analysis of resulted papers by reviewing their subject and method of research.The LDA was carried out in Python.We used LDAvis to visualize and optimize the number of topics [10].The selection criteria for our sample of studies were based on the following considerations: 1) The scientific articles indicate the concerns from the research field; 2) The scientific articles published in the main journals from the ISI Web of Science database offer a broad overview on the subject.This work is based on the three goals with related motivations presented in Table 1.
• G1: to investigate the relation between design patterns subject and software complexity by identifying the proportion between the articles written on the subject of design patterns and software complexity and test the correlation.• G2: to investigate if the two subjects are statistically different.• G3: to investigate the topics on design patterns from the software complexity subject.Research questions were derived from each goal, and testable hypotheses formulated, as summarized in Table 1.

Goal
Research question Motivation Null Hypothesis H0 G1 Q1: Is there a directional relationship between the topics on the subject of design patterns and the topics on the subject of software complexity?
The subject of design patterns and the subject of software complexity belong to the same field of study, namely software engineering No linear relationship between the two subjects G2 Q2: Are the two subjects of study different?While it has been identified that design patterns are important for software complexity, the topics on the subject vary and treat different aspects The two subjects do not differ G3 Q3: What are the topics studied on the subject of design patterns from software complexity "world" The enunciated hypothesis that we established in our study were: H1: there is a linear relationship between the subject of design patterns and the articles written on the subject of software complexity H2: there is a difference in the topics treated by the articles belonging to these 2 subjects.With respect to related work, our study intends to be an attempt to evaluate the relationship between the subject of design pattern and software complexity.Also, it is the first work that examines the topics from software complexity field of research with techniques from natural language processing.Additionally, we outlined the steps performed in the methodology: 1) identifying the articles written on the two subjects, identifying the topics in the field of design patterns by using LDA; 2) identifying the proportion of articles per each topic from design patterns and software complexity subject; 3) testing the hypotheses.Establishing whether there is a relationship between the two subjects has several applications in software engineering, including: A1) Predictions of topics on the design pattern subject; A2) Predictions of topics on design pattern subject across software complexity subject.

Results
The proportion over Web of science categories for the papers with the design patterns subject is presented in Table 2.  3. The first step in the pre-processing was to remove stop-words.The stop-words are words such as "the", "a", "I", "him", etc. Next, we created bi-grams and tri-grams.These terms are new words that are combinations of words that are commonly juxtaposed.Next, we lemmatized the words.This involved removing inflectional endings, thus returning it to its base form.An example is changing the word "Working" to "Work".This helps with topic modeling and interpretation.We applied LDA on the abstracts from the design patterns (computer science) articles.The topics, the first 30 terms and their graphical visualization are presented in Fig. 3a and Fig. 3b for the first and the second topic, respectively.The topics are circles in the twodimensional plane whose centers are determined by computing the distance between topics [11].The overall topic prevalence is represented by the areas of the circles, where the topics are sorted in the decreasing order of prevalence.In the right part of the figure, a pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term, as in [12].The top 30 most salient terms from the entire dataset (when no topic is selected) are: pattern, design, software, language, code, ontology, object, class, aspect, method, application, implementation, service, programming, architecture, framework, process, type, system, development, program, problem, orient, performance, web, quality, mechanism, cloud, structure, component.
We optimized the model by identifying the number of topics that provide the best coherence.It seems that the optimum number of topics from the subject area of design patterns is 8.The coherence measure is equal to 0.7612.The resulted topics from the design patterns subject of research and their words and probability to appear in relation with each other are presented in Table 4.We extracted the papers belonging to design patterns and software complexity on each topic.The results are presented in Table 5.The analyze revealed that there is an important number of articles written on the subject of software complexity that benefit from the well-grounded subject of design patterns, but the numbers of articles on design patterns that benefit from software complexity research articles is not that well represented.We found that between design patterns subject and software complexity subject there is a direct and strong relationship (Pearson correla-tion=0.72),so we reject the null hypothesis.The t-test value was 0.005 for a p value of 0.01, so we were able to reject the null hypothesis.Therefore, there is a significant difference in the way the two subjects are treated by the authors of scientific articles from the mainstream research.We continued our analysis towards identifying the specific topics on design patterns from the software complexity subject.We eliminated the common keywords: design, system, complexity, software, application, model, program.Therefore, we were able to observe the topics that relate the two subjects.We computed the most representative article for each topic, both for design patterns subject of research and for software complexity subject of research.Table 6 presents the title of the papers per each topic.In general, the articles written on the subject of software complexity address the main topic of "metric".LDA returned the word metric in each identified topic.Usually, these metrics are well established in theory and in practice.
We noticed one single paper, entitled How changes affect software entropy: an empirical study [13], in which the authors analyzed how changes affect software entropy by: the presence of refactoring activities, the number of developers working on a source code file, the participation of classes in design patterns, and the different kinds of changes occurring on the system, classified in terms of their topics extracted from commit notes.The research subject of design patterns is represented by topics like pattern development and pattern usage.The distribution of papers per each subject, namely design patterns and software complexity, for each topic starting from topic 1(T1) till topic 8 (t8) is presented in Figure 5.We chose to analyze in the Discussion section the first three topics belonging to software complexity that have the biggest proportion over design patterns subject: analyzing software complexity, source code change, system software complexity.

Discussions
Complexity is an important property to analyze when developing software.Software measures are the way to quantify the structural complexity of software.The analyzing software complexity topic contains papers that discuss developing tools and papers dedicated to developing metrics.The topic of code changes is analyzed intensively by the scientific literature.The interests are on sustainability [14], reusability, maintenance, decreasing complexity.It is interesting to observe that there is a significant number of papers which approach the topic by using machine learning techniques.The main subject is version to version code change.The topic of source code change is approached empirical through case studies or analysis on open source projects.As software changes requires some form of managing the changes, some authors proposed the use of repositories.They consider that the company must have a central knowledge repository with software specifications [15][16][17], designs and code from previous system developments.The central knowledge base can be used through Case-Based Reasoning.
An important number of authors treat the problem of maintenance.They hypothesize that source code complexity exerts a causal influence on maintenance difficulty experienced during the system test phase of the product [18].At least three components contribute to the complexity of the software maintenance effort: (1) the code and documentation being produced, (2) the process used to manage the maintenance, and (3) the maintenance and target computer system environments [19].
The authors that study on the topic of system software complexity approach subjects like software metrics or refactoring but in the context of software engineering or re-engineering and systems development.Also, the subject of software architecture design process is ap-proached in [20] where a supportable meta-architecture (SMA) and roundtrip engineering is proposed for large software projects.
There are authors who study requirements engineering in relation to software complexity [21].They developed a quality-driven RE framework and tool that applies knowledge management techniques and quality ontologies to support RE activities.Software refactoring is an important subject to study when designing systems.Alkhalid et.al [22]  According to the Forrester Research report on AI's impact on software development [24], the bulk of the interest in applying AI to software development lies in automated testing and bug detection tools.The article 6 ways AI transforms how we develop software [25] discusses rapid prototyping, intelligent programming assistants, automatic analytics & error handling, automatic code refactoring, precise estimates, and strategic decision-making.Our results confirm this idea.

Conclusions
In reviewing the literature, no data was found on the association between design patterns and software complexity or in identifying software complexity topics.The current study found that although software complexity and design patterns belong to the same subject area, namely software engineering, the topics vary.The most interesting finding was that there is possible to identify topics on software complexity by identifying topics on design patterns.Another important finding was that measuring software complexity and evaluat-ing its effects on the developed systems is approached very often with artificial intelligence techniques.This combination of findings provides some support for the conceptual premise that design patterns might be studied from the software complexity point of view.
In conclusion, in this study, the relations between software complexity associated metrics and design patterns have been investigated.Also, this study emphasizes the importance of design patterns, the lack of standard metrics for design patterns, and the lack of standard ways for studying design patterns in relation with software complexity.The LDA technique proved its reliability in studying topics from the field of software complexity.

Fig. 3a .
Fig. 3a.Intertopic distance map via multidimensional scaling considering the marginal topic distribution employing the first and second principal components -Topic 1(PC1 and PC2) (LDAvis) -detailed further in Table 4

Fig. 3b .
Fig. 3b.Intertopic distance map via multidimensional scaling considering the marginal topic distribution employing the first and second principal components -Topic 2 (PC1 and PC2) (LDAvis) -detailed further in Table 4

Fig. 5 .
Fig. 5.The number of papers as proportion from total number for Design patterns (DP) and Software Complexity (SC) respectively

Table 2 .
The number of articles from the design patterns subject on science categories (Top Ten)

Table 3 .
The software complexity across science categories (Top Ten)

Table 4 .
Topics on design patterns (computer science)-LDA

Table 5 .
The statistical results

Table 6 .
The articles that contributed most to topic identification