Aspect Based Construction of Software-Specific Words Similarity Database

There exist distinctive words that are used to express same semantics and as a result of this it has become hard to quantify the exact matching of words. To deal with this issue, past investigations endeavored to ascertain a likeness between distinctive pair of words. Conventional methodologies for computing word similarity are based on repositories like WordNet. It is a manually created lexical database and it processes semantic connection between various words. However, WordNet is a universally useful asset but wide range of words are not present in it and furthermore there exist an issue of identifying the meaning of words. Implication of words are diverse in WordNet when we utilize it in a textual framework. There exists a need of the refined approach that can gauge words resemblance in light of their co-occurrence. In this examination, we proposed an approach that registers likeness in text particular words, with the assistance of literary substance of various posts on StackOverflow. Our proposed strategy figures out word similarities in text by ascertaining the weighted co-occurrence in view of Computing Term Cooccurrence (CTC) and SentiWordNet. The exploratory outcome demonstrates that our system proposed an arrangement of words that are identified with text data is exceptional. Moreover, when it was compared with WordNet-based strategy named as WordNetres, it results with better outcomes.


Introduction
With the quick advancement of computers in all fields of life, the volume of information and data increases with the advancement of data innovations.These innovations increment the volume of information by Microblogging locales, Blogs, E-Commerce sites and so on.It is assessed that consistently 2.5 trillion bytes data is delivered and 90% data of worlds is created in most recent two years.Increment in fast volume of data likewise named as 'Big Data' has made major issues i.e. how to locate the required data from trillions of data.To deal with this inquiry another term named as 'Big Data Retrieval' is conceived.exploration led by Howard et al.(2013)extricates related verb sets accumulated from post and procedure marks.However various words that are identified with programming are additionally not present in the code, but rather placed in the various connected content substances, posts of forms, the reports identified with bugs, distinctive conferred logs and so forth.Moreover, there are few words that are in code, especially extraordinary commenting used to recognize something or utilized as a part of various strategies that are identified with few activities.Another examination directed by Wang et al., 2012 that assembles semantically same labels in FreeCode.In any case, they can just ascertain the likeness of various labels and not with the numerous ones that are in FreeCode.In our work, we will build a more refined word similarity database which will be utilized for various programming designed obligations on a wide range of related ventures.
In the event in which the substance of two words are same then it might be viewed as comparative.For instance, "tcp" and "customer" often show up in few sections, sentences, or online journals that depictnetworking.Keeping in view the end goal to recognize such excess, there is a need to build up another approach on the basis of the idea of word co-occurrence to compute the resemblance of two distinct words.We aggregate each word which evolves in the co-occurrence of a vector with some notable labels identified with software, different words and diverse programming labels which can relate each pair based on their co-occurrence.
Our new similitude metric  ,we endeavour to outline a semantic database for text corpus that is domainspecific and superior to WordNet.We used StackOverflow dataset that is a prevalent inquiry noting site and take its posts as info which incorporates a substantial number of words identified with software context.We additionally control the technique for labelling on account of its regularity which is maintained by countless data sites including SourceForge, FreeCode and StackOverflow.These labels are utilized for marking the key highlights of client created substance which are frequent terms that are programming particular.We utilize diverse posts from StackOverflow as semantic words to ascertain the similarity of various words.
We think about our technique that depends on  , with an old word similarity database ascertained by a WordNet based strategy.Pedersen et al. (2004) named that as WordNet resource used for various words.We utilize SentiWordNet alongside with   to get the upper 10 related words.In this we utilize ten people to judge the effectiveness of each and every technique by labelling the yield words with various scores to some degree.We assembled a wide range of words identified with programming setting that are not accessible in WordNet asset.Few words that are accessible in both   and SentiWordNet DB,that is averagely computed and it is 51% higher than the average score of the WordNet asset.Our main contributions in this research are: 1.
We build up a word similarity database that is programming particular utilizing 10,000 posts in StackOverflow.

2.
We proposed another closeness construct strategy that depends on the technique for labeling and gathering a word based on co-occurrence.A similarity of words is ascertained by figuring the resemblance of their reliable characteristics and contexts.

3.
We applied our proposed technique on various words that are programming particular with the assistance of ten humans.Our research demonstrates that strategy gives better outcomes when contrasted with WordNet.There are 55% words that don't show up in WordNet and other 45% percent that are accessible can't coordinate with its correct implications as indicated by the programming specific context.Whatever remains of paper is sorted out as takes after.Segment 2 presents the Related Work.Segment 3 discusses about the Preliminaries.Segment 4 outlines the proposed framework depiction.A framework incorporates four noteworthy strides of proposed model including Dataset and Pre-processing.Segment 5 demonstrates the Experiments and results.Conclusion is tended to in area 6 and finally, there is a segment of References.

Related work
Acquiring similarity between two words is one of the straightforward NLP assignments.Numerous papers show the various techniques to degree of this similarity.A large portion of the mainstream existing methods contain a lexical database to ascertain similarities of words.Pedersen et al.(2004) have made a UI to allow clients to compute the semantic separation between words.They ascertain the likenesses of all sets of words in WordNet and freely (Porter, 1980).
Like these examinations, we additionally endeavour to figure out the similitudes in words.However, endeavouring to utilize, WordNet is a universally useful asset, we used to administrate the Normalised Google Distance(NGD), which is specific for the undertakings identified with programming build setting.Various strategies have additionally been recommended to naturally develop a dictionary (Chen et al., 2005;Falleri et al., 2010).They developed it on the distributional theory that embraces that wordin similar settings that is required to have the comparable sense.For introducing novelty in existing approaches wecentreon the software engineering group, preferably not the same as the general dataset we make for utilization of a dataset which is identified with software engineering.
Yang and Tan (2013) displayed the strategy for ascertaining the semantic closeness in programming source code document.They introduced a system that takes input code with a container of stopwords and produces the corresponded set of words.Exploratory outcomes demonstrate that this procedure is debugged in C and JAVA to judge the semantic related words with more accuracy.Later a comparable system is introduced by Howard et al., (2013) which compute the semantic scores from client remarks.They extricated 97 same verb sets from 150 strategies are tested arbitrarily from 36 Java codes over few spaces.In this examination, we likewise create semantically associated words.However, we analyse textual context that is programming particular opposed to breaking down code.Conventional techniques including machine learning and lexicon based strategies are especially utilized as a part of customary methodologies.Matveeva (2006) proposed the Vector Space Model (VSM) to figure the similarity between two vectors utilizing Cosine similarity..

Research Questions
There are some research questions: 1. How precisely our projected method is associated with the baseline method?
2. How universal our projected method is used to calculate the similarity of words? 3. How our proposed method is scalable?As for as we see this from data recovery perspective, the primary inquiry figures exactness while the second computes review.A measure of soundness is called accuracy, while measure of culmination is called review.In last inquiry, there is a need to research an opportunity which creates WordSimSE DB from a product related words and the probability to increase WordSimDB SE by observing more surveys.

Preliminaries
We initially examine StackOverflow, which is extremely famous these days in term of questions and their answers.At this point we discuss about various prominent content pre-handling techniques, for example stop-word disposal, tokenization of data and stemming.
1. StackOverflow: It is one of the popular site on which we question about our problem.It offers a bridge for developers to support one another by answering and questioning.With more than 1.8 million people and over 5,000,000 queries on StackOverflow.Most of the subjects of StackOverflow are associated to software related tasks.In our research, we get dataset from StackOverflow to make a database that contains similarity between different words.2. Word Co-occurrence: It is the idea of co-occurrence of words based on "context", which talks about the nearby words of a specific word (Höst et al., 2000).For scope of word we used a sliding window that limits some context.The targeted words should be located in centre of the window.For example, a window having mass 7 would also contain the targeted word itself i.e. the three words having three words to its left and three to right.If word is located in the start, a size 5 sliding window only contains the target word and 2 other words that are appearing on right side of it.
3. The context of phrases and words varies as per their use in daily life of comparative semantics to some other phrases and words.In term of computer, "society" can be considered as "database" and "use" can be considered as a way that is used for database.For a particular query we usedGooglesearch engine.This concept is then further applied to construct a technique that automatically extracts all those pages that pertained a particular word association using Google page count.This technique is likely applicable in clustering, classification and language translation.

Proposed Model for Word Similarity Database
In order to get more successful results, we should make sure that our proposed approach performance should be equal or better than accepted solutions to software specific words similarity database construction.In the domain of software engineering especially, word similarity the proposed approach is compared with existing state-of-theart approaches that have much better acceptance and credibility.

Fig. 1. Proposed model for Word Similarity Database
The proposed approach for   attempts to identify most appropriate words based on computational process.It extractssimilar words from reviews, blogs and users questions/answers repositories and thenassociates it with the word similarity database as shown in Fig. 1.Different steps of proposed model will discuss in below section.

Pre-processing
Pre-processing can be considered as a key step in dataset pruning.In this module documents from social web possibly stack overflow and Facebook forums are taken as input.Such raw documents may contain text, code, tags and may also contain redundant or irrelevant data.Some redundant or irrelevant text snippetis discarded on the basis of following rules: 1. Universal Resource Locator (URL) will be removed because URL does not consider as a part of the job forgetting viewpoints.2. We will remove every single word that does not start with English alphabet or a digit.3. Common words like full stop, commas and punctuations etc. will also be eliminated by using a standard porter stemmer algorithm.4. We will also remove those words that start with the symbol "@" because this symbol is used at the beginning of usernames and we are not taking users and their relationship in it.We will also remove words start with "#" symbol.

Computing Term Co-occurrence (CTC)
In this phase, each pre-processed document can be considered as a bag of words.CTC (Computing Term Co-occurrence) used to compute the semantic similarity between occur words with target words used in a document.Computation of Term Co-occurrence (CTC) is defined in equation (1).
Wheref ( 1 ) is the number of pages containing occurrence of term 1 and f ( 1 ,  2 ) containing association of both reported by Google.For the number of pages returned by Google we have to choose N and it is apparent that by reducing the N, the CTC will increase.In this experiment some main properties of CTC that were applied are as follows: 1.The approximate value of the CTC lies between 0 and ∞, may be sometimes little bit negative if the Google search count irrelevant score or when it contains too much junk information for: and the( 1 ,  2 ) = ∞/∞.2. The weight of CTC is almost nonnegative and ( 1 ,  2 ) = 0 for every 1 .For every pair ( 1 ,  2) we have( 1 ,  2 ) = ( 2 ,  1 ), e.g.x indicate the set of web pages holding one or more occurrences of 1 ,e.g.choose  1 ≠  2 with x = y, formerly f ( 1 ) =f ( 2 )= ( 1 ,  2 )and( 1 ,  2 ) = 0.This association measure can be utilized to identify the most accurate co-occurrence of a particular term(see section 2.2 part 3).The main advantage of this approach is that doesn't require any background knowledge or any particular analysis of problem domain.Instead it automatically analyses all features through Google search using World Wide Web.In this phase term matrix is created that arrange each term along with corresponding target term.Term with minimum scorethan standard thresh hold i.e. 0.5 are discarded.
Algorithm 1. ComputeWord Similarity Input : All Pre-processed words (  ) Output:      Initialize an empty   for   to    for   to     = (  ,   ) end for end for

Computation Terms Similarity
Generally, context words that are appearing with each term in a document are taken as a potential candidate for computing terms similarity in a document.To identify the context from a text, a sentiment dictionary or lexicon will be generated for each specific domain (SentiWordNet).Unit of resources are adjectives and verbs.Such as <No> + <adjective + verb>.Similarity calculation technique is applied to all sentiment units.We assign a unique name to this similarity score as   (Point wise Occurrence).

Experiments
This section will take a closer look at the experimental results.Various experiments are performed on the different dataset to evaluate the performance of proposed approach.The detail description of each experiment is as follows

Datasets
We use three different datasets for experiments.Details of all three datasets are given below.
We develop a   using the question and answer posts from StackOverflow.We get this data from MSR 2013 Mining Challenge (Demeyer et al., 2013).Our collected dataset is around 12 GB and holds all the posts that are produced from February 2014 to February 2018.We portray the data into different documents where every single document covers a question and all its answers.The description, title and all the tags of the question with their corresponding answers are mined and we save all this data in the document.We have collected 83,468 documents.Randomly we select sample 10,000 documents from all the dataset and use them to construct WordSimSE DB.All the tests are executed on an Intel Xeon X5460 3.26GHz server with 32.0GB RAM running Windows Server 2008 (32 bit).For the third step of our building procedure,we use 460 pairs of words gained from old work to adjust the weight parameters.We discover the better weights for α, β, and γ are 2.9, 2.1, and 1.5 respectively.It shows that other software tags are less important than popular software tags and other words are less important than the software tags.As a baseline, we use WordNet word pair similarity dataset.In this dataset we have billions of word pairs and size of this dataset is approximately 100GB.We compute the similarity of a word on the basis of Resnik matric (Roldan-Vega et al., 2013).
The popularity of micro-blogging is increasing day by day.People share their ideas on social media sites.Forums play an important part in social media sites, as forums allow users to share their ideas on any technique, method, issue etc. Software related forums are also present in vast number that is mainly used by developers or programmers.Usually, new ideasabout bug fixes and software fixes are discussed in these forums.We collected ten million comments from web related social media forums.These comments are extracted using Graph API (Weaver and Tarjan, 2013) from last 3 months.
Software repositories include packages related to software.Software companies and organizations maintain these repositories on their server.These repositories contain all information about software runtime errors, bugs, fixes and version details.These repositories are very much useful for checking any software reliability.We are using one of the biggest repositories named as 'tera-PROMISE' repository (Anwer et al., 2017).This repository deals with software engineering data.It contains millions of records about software engineering domain.

Results
Evaluation criteria are very much important in evaluating the results of any technique.We describe the results in two different and new criterias.These criteria are discounted cumulative gain (DCG)and Likert score (LS) (Jarvelin and Kekalainen, 2002).WordNet is one of the largest lexicons and also the base of many new generated lexicons.Usually, 45% words are present in WordNet lexicons which mean 450 words out of 1000 are present in WordNet.It also means WordNet returns only 450 software related words out of 1000.Our proposed approach returns1000 software related words out of 1000 which makes out approach far better than WordNet.
As mentioned earlier, WordNet provides the accuracy of 45% in terms of software related words.WordNet produces the average Likert score of 1.53 and our proposed approach has an average Likert score of 2.42.Our proposed approach gets the improvement of 60.18% which clears that our approach captures the software related words accurately by identifying the semantic meanings of words.Average Likert scores in tabular form are showing in Table 1.Extraction of words with the ranking is one of the hardest tasks.Average Discounted Cumulative Gain is used to calculate the ranking of the extracted software related words (Wang et al., 2013), which implies that the most relevant document should be ranked first.Ranking of words always helps to use words in the right way by using their ranks.
Table 2 shows the results coming from WordNet and from our proposed approach.It also shows improvement of 74.21% from Wordnet approach.One of the research questions is scalability of proposed technique and the stability when reviews increase?To answers these questions we run proposed technique on 5,000, 10,000, 15,000, 20,000 and 25,000 reviews.Fig. 2 shows the results when we run on different sets of datasets.

Fig. 2. Word Pairs
We also observed that how fast our proposed technique is working with some state-ofthe-art technique.For this, we plot the runtime values with number of reviews.Data preprocessing, co-occurrence using normalized google distance and SentiWordNet are included in runtime.An experimental result shows that our proposed technique works 5 to 6 times faster than the other technique.Fig. 3 shows the comparison graph between proposed technique and state-of-the-art technique.

Fig. 3. Comparison between basic approach and optimized approach
Almost 30,000 comments are extracted from Facebookforums using Graph API.Comparison graph using basic approach and optimized approach is showing in Fig. 4. Reviews taken from social media forums are plotted on x-axis and time of a process is plotted on the y-axis.Results are extracted in term of the number of reviews like 5,000, 10,000, 15,000, 20,000 etc. Fig. 4 shows the graph that optimized technique also works well for the last amount of reviews and its results are improved in last reviews.'tera-PROMISE' repository is one of the famous repositoriesthat iswidely used in many research papers.There are millions of records present in a repository but we pick fifty thousand sentences for our experiment.Experimental results are very promising, its runtime decreases as reviews or sentences increases.Experimental results are plotted in Fig. 5, where runtime is present in y-axis and reviews are present in an x-axis.

Conclusion
In this research, we proposed a method that automatically constructs software based term databases that save the common words of software engineering domain.We create a similarity metric named as WordSimSE based on question and answer posts in StackOverflow to calculate the similarity of different words based on their weights in cooccurrences with three different kind of anchors.We compare our technique with a WordNet-based approach named as WordNet res.From results, it seems that our technique produces better results than WordNet res in terms of average discounted cumulative gain (DCG) and average Likert score by more than 67% and 51% respectively.Our enhanced method can evaluate the similarity of more than 35 million pairs of words in less than 17 minutes by examining a 60,000 document dataset.In future, we are going to enhance it for larger  SE by executing it with more question and answer posts from StackOverflow.We also have a plan to allow open access to an expanded   SE as a web service.

Table 2 .
Average Discounted Cumulative Gain scores

Table 3
shows the comparison of our proposed technique with some state-of-the-art techniques.These techniques are extracted from past studies named asWordNet and  Castellanoset al. (2017).We extract 10,000 word pairs by analysing 3,000 reviews.As shown in Table3proposed technique extracts fewer word pairs from WordNet and it extracts almost double times pair of words from Castellanos technique.

Table 3 .
Comparisons in number of pairs