Integration of design smells and role-stereotypes classification dataset

Design smells are recurring patterns of poorly designed (fragments of) software systems that may hinder maintainability. Role-stereotypes indicate generic responsibilities that classes play in system design. Although the concepts of role-stereotypes and design smells are widely divergent, both are significant contributors to the design and maintenance of software systems. To improve software design and maintainability, there is a need to understand the relationship between design smells and role stereotypes. This paper presents a fine-grained dataset of systematically integrated design smells detection and role-stereotypes classification data. The dataset was created from a collection of twelve (12) real-life open-source Java projects mined from GitHub. The dataset consists of 18 design smells columns and 2,513 Java classes (rows) classified into six (6) role-stereotypes taxonomy. We also clustered the dataset into ten (10) different clusters using an unsupervised learning algorithm. Those clusters are useful for understanding the groups of design smells that often co-occur in a particular role-stereotype category. The dataset is significant for understanding the non-innate relationship between design smells and role-stereotypes.


a b s t r a c t
Design smells are recurring patterns of poorly designed (fragments of) software systems that may hinder maintainability. Role-stereotypes indicate generic responsibilities that classes play in system design. Although the concepts of role-stereotypes and design smells are widely divergent, both are significant contributors to the design and maintenance of software systems. To improve software design and maintainability, there is a need to understand the relationship between design smells and role stereotypes. This paper presents a fine-grained dataset of systematically integrated design smells detection and role-stereotypes classification data. The dataset was created from a collection of twelve (12) real-life open-source Java projects mined from GitHub. The dataset consists of 18 design smells columns and 2,513 Java classes (rows) classified into six (6) rolestereotypes taxonomy. We also clustered the dataset into ten (10) different clusters using an unsupervised learning algorithm. Those clusters are useful for understanding the groups of design smells that often co-occur in a particular rolestereotype category. The dataset is significant for understanding the non-innate relationship between design smells and role-stereotypes.  Table   Subject Software Engineering Specific subject area This paper focuses on analysis of the association between design smells and role-stereotypes in source code. Type of data Raw and analysed How data were acquired Software projects were downloaded from GitHub using the git clone command line tool. Data format CSV Parameters for data collection The data were collected from GitHub public code repositories. All the selected projects are written in Java and licensed for redistribution within the terms of its license.

Description of data collection
Our dataset is based on five (5) desktop and seven (7)

Value of the Data
• We provide a fine-grained dataset derived through a systematic combination of design smells detection and role-stereotype classification data. This data is essential for researchers who are interested in studying design smells from the "lens" of role-stereotypes software system design. • The dataset is important for software engineers to enable them to identify classes that are vulnerable to certain types of design smells based on their role-stereotypes (class responsibilities). Identifying design smells at the early stage of software design could improve software maintenance and reliability. • The dataset is useful for software analysts who can use it to review and include new quality assurance guidelines that consider design smells and role-stereotypes. • This dataset can provide insights to software tool builders to optimize design smell detection tools by tailoring design smell metrics to a specific project and /or role-stereotype. • To the best of our knowledge, this is the first publicly available dataset which combines design smells detection and role-stereotypes classification data.

Data Description
We present a fine-grained dataset of systematically integrated design smells detection and role-stereotype classification data. The dataset was derived from twelve (12) real-life opensource software projects selected from GitHub public repositories. The raw dataset is publicly available as a Mendeley repository [4] . The data is described through the following tables and figures. Table 1 presents the projects used to build the dataset. Table 2 shows a fine-grained dataset which consists of the integration of design smells and role-stereotypes. Fig. 1 shows sample content of the ".ini" files in its raw format. Table 3 outlines the regular expressions used for extracting class names from the design smells files. Fig. 2 , presents the sample output of role-stereotypes preprocessing tasks generated using the srcML tool. Fig. 3 shows the relationship between role-stereotypes based on the common co-occuring design smells. Table 4 presents the number of occurrences of each design smell in each category of role-stereotype. Table 5 is derived from the clustering task and it presents groups of design smells that occur in a given role-stereotype. Finally, Table 6 shows the association of design smells with role-stereotypes extracted using association rule discovery technique. Table 1 shows the description of the selected projects including their release, their domain, release version, and total Lines of Code (LoC) and the domain to which each project belongs (desktop or mobile). The total LoC was computed using a freely available, lightweight tool called      SpeculativeGenerality  12  4  0  1  0  0  BaseClassKnowsDerivedClass  ------MessageChains  1  0  2  0  0  1  LongParameterList  402  154  149  7  29  12  SpaghettiCode  2  0  2  0  1  0  BaseClassShouldBeAbstract  24  10  5  0  1  2  LongMethod  830  92  412  0  15  20  ClassDataShouldBePrivate  103  81  57  0  13  3  TraditionBreaker  ---  We present a sample output of the fine-grained dataset in Table 2 . The goal was to extract class names and the corresponding design smell detected in that class. The dataset consists of 23 columns (including index column) and 2513 rows which represent the total number of Java classes obtained for the selected projects. The "FullClassPath" column was extracted from the design smells detection raw files and role-stereotypes classification data respectively. The "SubClassPath" column was derived from the "FullClassPath" column and used to "inner join" design smells and role-stereotypes preprocessed data. The dataset consists of 18 design smells detected using the Pattern Trace Identification, Detection, and Enhancement in Java (Ptidej) tool 2 [2] . These design smells include; LongMethod, ComplexClass, LongParame-terList, BaseClassShouldBeAbstract, SpeculativeGenerality, ClassDataShouldBePrivate, ManyField-AttributesButNotComplex, MessageChain, SpaghettiCode, RefusedParentBequest, SwissArmyKnife, Blob, AntiSingleton, LargeClass, LazyClass. The design smells columns contain a value of 1 or 0, which indicate the presence or absence of that design smells in a given Java class respectively. The dataset is also classified into six role-stereotypes classification taxonomy i.e. Service Provider, Controller, Structurer, Interfacer, Coordinator and Information Holder as shown in the "label'' column. The last column of our dataset represents the cluster in which each class belongs. The clusters were constructed using Powered Outer Probabilistic Clustering (POPC) [1] algorithms. The clustering information is useful for determining the group of design smells that often co-occur in a given role-stereotype.

Experimental Design, Materials and Methods
The process of constructing the dataset was conducted as follows.

Preprocessing design smells data
For the preprocessing task, we passed the project class files as input to the Ptidej tool [2] for the task of design smell detection. The tool is an open-source Java-based reverse engineering tool suite that includes several identification algorithms for idioms, micro-patterns, design patterns, and design defects [2] . Using this tool, we were able to detect eighteen (18) design smells across the selected projects. Design smells were detected and stored in ".ini" files. The file names are tagged with a specific design smell type. For example, in the K-9 Mail project, "AntiSingleton" design smell is stored as "DetectionResults in K9 for AntiSingleton .ini" . Fig. 1 shows sample content of ".ini" files in its raw format. Our goal was to extract class names and the corresponding design smell detected in that class.
We apply heuristics to determine the structure and pattern of class names in the detected design smells files. Regular expressions were used to extract class names and associate them with respective design smells type. The regular expressions applied to each project are listed in Table 3 . A replication package for the task of design smells preprocessing is available on Zenodo as a citable GitHub repository [6] .

Preprocessing role-stereotypes data
The processing of role-stereotype data was based on the replication package offered by Nurwidyantoro et al. [3] and can be directly accessed here. 3 First, the selected project source code is passed to srcML, 4 a lightweight, highly scalable, robust, multi-language parsing tool to convert source code into an XML format. Fig. 2 shows the sample output of the srcML tool. Next, we built unlabeled data consisting of 21 features for each project using code provided in the replication package 4 . In this study, the feature extraction task was carried out as follows; 1. Create a srcML representation of the source code. The output of srcML tool is a list of source code classes in a standardized XML format. 2. We use multiple XPath queries to obtain the features of interest.
The detailed steps of the features extraction are elaborated in the work of Nurwidyantoro et al. [3] . Finally, the unlabeled data was classified to one of the role-stereotype categories i.e. Service Provider, Information Holder, Interfacer, Controller, Coordinator and Structurer. The classification was achieved using the Random Forest classifier which obtained the best classification result as described by Nurwidyantoro et al. [3] . A separate repository containing source code and step by step guide for feature extraction and classification can be accessed in this GitHub repository [6] .

Integrating design smells and role-stereotypes data
The fine-grained dataset is obtained through systematic integration of the preprocessed design smells and role-stereotype data. We created a column with unique entries called "subclasspath" in both design smell and role-stereotype data. The "subclasspath" column is derived from the full classpath and ensures that every record in that column is unique. After that,the new column was used to "inner join" design smells with role-stereotypes data. At this point,all the role-stereotypes classification features were removed to include only the classification labels. As shown in Table 2 , the design smells data is also added.

Integration using clustering
The fine-grained dataset was clustered using Powered Outlier Probabilistic Clustering (POPC) to analyze the relationship between design smells and role-stereotypes. POPC ensures flexibility in cluster construction since we do not have to specify the number of clusters upfront. This is not possible for other clustering approaches like k-means algorithm. POPC tries to mitigate these drawbacks using back-propagation techniques. It starts by building many clusters and ends with an optimal number of clusters. The algorithm is observed to work quite well on a binary dataset and converges to the expected (optimal) number of clusters on theoretical examples as elaborated by Taraba [1] .
In this study, 10 clusters were created as shown in the "cluster" column of Table 2 . In order to gain more insight to the significance of the clustering task, this study presents the output of the clustering task in form of a dendogram as shown in Fig. 3 .
The dendrogram in Fig. 3 represents the relationship between role-stereotypes based on the common design smells that often occur in them. It is observed that the following pairs of rolestereotypes are associated i.e. (Coordinator, Structurer), (Service Provider, Information Holder) and (Controller, Interfacer). The observed associations is an indication that those pairs/groups of role-stereotypes are often affected by similar types of design smells. In Table 4 , the study presents a group of design smells identified in each role-stereotype.

Integration using association rule mining
In order to better understand the association of design smells with role-stereotypes, the study explored an alternative approach to the clustering task. The study applied the well-known Apriori algorithms [5] to construct the association rules. Association rule discovery is an unsupervised learning technique used to detect local patterns which indicates attribute value conditions that occur together in a given dataset [5] A replication package for the task of association rule mining is also provided as a citable GitHub repository [6] . Table 6 shows the result of the association rule mining. The results from the study shows, with respective degrees of confidence, the association of various design smells with rolestereotypes. For example, it can be observed that LazyClass, LongMethod and LongParameterList have strong association with the Service Provider role-stereotype.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.