Vocabulary Richness Metric for Extracting Author’s Semantic Mark in English Written Literary Works

The present paper starts from a short introduction of the major aspects debated regarding the stylometric measures used for extracting the personal signature added by a particular author to its English written works. Those measures are used in the context of indicating an author from a limited cardinality set of authors being given a set of documents or a defined indicators values which characterizes the semantic way that an author is writing its works. The paper addresses the problems of the semantic level of a work depending on the tokens that he uses in the paper, tokens that are extracted in a preprocessing step of analysis. The tokens are defined using a lexical ontology, for the English words referring to WordNet, and the automatic extracting of those tokens from the words found in the particular processed papers. The main vocabulary richness evaluation metrics are presented taking into account the major literature review and extracting the main steps into a new proposed metric that is combining the vocabulary richness with the semantic layer of a paper. The concept of author mark is described. The objective of this research paper is highlighted into the new proposed metric that is non-dependent on the main subject discussed in the analyzed paper. This objective leads to a general metric that combines documents from different subjects into a metric that can describe the vocabulary richness of a specific author depending on the works that he had written. Furthermore, the analysis is conducting into a time evolution of this metric, using the extraction of the trend of the author’s vocabulary richness indicator. Using a set of 13 years values of this indicator upon a specific author, the results are presented in this research paper. Future work refers to inserting this metric into a general description of the author mark into his specific English written works.


Introduction
Intrinsic plagiarism detection implies the recognition of those parts of text within a document that are different taking into account the writing style of a certain author.Those parts are later on analysed as input data for the verification using external plagiarism detection tools.If a document in written by a single author, it is supposed that the passages written by him to be similar accordingly to its unique writing style.Using this technique of comparing the writing style within each part of text from the papers written by multiple authors and adding unsupervised automatic classification techniques, those parts of text are grouped in clusters depending on the membership of each author.The problem of plagiarism detection using this type of analysis involves extracting the unique writing style of each author, method also called stylometry analysis.Having a set of characteristics that best describes in a unique manner the writing style of an author, a metric is created for value description of percentage membership of documents to authors.In the research conducted in [3], [4], [5], [6], [7], [8] and [10], the problems and methods of inserting intrinsic plagiarism are referred, adding into the discussion also the stylometry, the writing style of a specific author over his history of research or just within a single document.Regardless of the type of plagiarism evaluation, intrinsic or external one, it is very important to determine the set of characteristics 1 DOI: 10.12948/issn14531305/20. 3.2016.04that must be taken into account in order to obtain as accurate results as possible.Those characteristics depend on the set of analysed documents, the language in which the documents are written and also the type of documents.The present research paper addresses the problem of literary English written documents by English native or European authors.For extracting from the initial set of documents the semantic analysis that describes the stylometry, multidimensional analysis is used.Chapter 2 reveals the relation between semantic analysis and main vocabulary richness metrics used in order to extract a value indicator of the words found in the analysed authors' set of written documents, transformed into tokens, and the semantic distances between them.The terms of words, tokens and frequency appearance are presented along with the main set of features of written style.The pre-processing phase is also presented, a step needed to convert words into WordNet tokens.In chapter 3, the improved semantic richness vocabulary metric is presented and defined along with an example of applying in upon a given phrase.The time evolution analyses is done within chapter 4, where 13 years values are inserted into a time series.Using three methods, absolute mean change, average index and linear regression, the trend indicator is evaluated.Comparing the sum of squared errors of the three methods, the linear regression method is chosen for the forecast.The conclusions are withdrawn in chapter 5 along with the future work directions.

Vocabulary Richness Metrics In Stilometry Analysis
For analysis of an author's style of writing in the context of external analysis or intrinsic characteristics of plagiarism, the richness of vocabulary is defined as the characteristic of the author defines the degree to which the author uses words in a wider or narrower vocabulary.This feature was demonstrated in works such as [1], [2], as a feature closely related to the author, it can be fed into optimal set of features of the style of writing.
Table 1 contains a list of metrics used to assess vocabulary wealth within the set of features writing, detailing the variables in formulas defined metrics that are presented in this paper [1].where: Ntotal number of words in the document analyzed; Vtotal concepts identified in the set of words; Vitotal concepts that appear of i times in the document; pvthe relative frequency of the most v present concept in the document.
Preprocessing phase for drawing text vocabulary wealth consists in separating the words of the text or analyzed fragment text, eliminating spaces and punctuation.An optimization of the processing and disposal is given connecting words, which are present in any text written by different authors.Denoting with V the set of words resulting from the preprocessing phase,  = { 1 ,  2 , … ,   , … ,   }, insert the analysis and ontology WordNet lexical set of unique concepts for generating recovered from the initial set of words W intersection concepts in WordNet by reducing duplication and creating vector occurrences of each concept found so: where:  T represents the set of unique concepts identified in the text and found in ontology WordNet;  nap represent the set made up of the number of occurrences of each concept from the T set in the analyzed document.While most indicators measuring the wealth of vocabulary used by the author of the work refers to the relationship between the number of unique words identified in a analyzed text in relation to the total number of existing words in that text, these metrics do not account instead the existing semantic component derived from those specific words extracted.Also proposed metrics extracted from the literature and does not assess the time course of this feature is implemented in a very high percentage in assessing a person's writing style.Starting from this issue, it needs metrics to evaluate the proposal while richness of vocabulary used in this document under review and in previous documents, if they exist.Metric uses the number of words found, WordNet lexical concepts identified using ontology extraction through processing of root words and functions for calculating distances between any two concepts from WordNet.

Improved Metric for Evaluating the Vocabulary Richness In The Presence Of Semantic Relations
Impact of using this metric is given by the semantic side added in the set of words used in an analyzed text.By enriching this metric with semantic analysis feature generates a complex stylometry, the local point of view and in terms of the time course.Thus, ISRV, Indicator of Semantics Richness of Vocabulary, it is defined as being equal to: ) max( where:    represent the number of unique instances of the word founded on the position i of the unique set of terms extracted from the analyzed document;   represent the cardinality of the set of unique terms extracted from the analyzed document;  (  ) is the maximum distance between single term and any other single term extracted from the set of terms, (  ) = max  ≠ (  ,   ), distance is calculated using semantic distances defined in the WordNet lexical ontology, [9];   is the cardinality of the set of words, single or not, extracted from the analyzed document resulting from the preprocessing phase of the text.

Compared analyze
The proposed metric, ISRV, weighs the result obtained by Type-Token metric in the sense of semantic similarity.Even if reducing words to concepts identified in the text WordNet unique value ratio is 0.55 (55%) is not considered the component of the semantic approach.
In the analyzed text, there are different concepts from WordNet or close in similarity with a distance value which tends to 0. Thus, expressed in metric ISRV more realistic vocabulary richness found in a text document or analyzed fragment text.
Extending the analysis of the wealth of vocabulary and semantic distance between concepts with the time evolution of this characteristic oriented authors, the defining trend for this indicator.

Time Evolution Analysis of the Proposed Vocabulary Richness Metric
Context of analysis is given by an initial set composed of documents drawn up by a specific author for doing analysis and will register the proposed metric values ISRV for each document.This set is sorted chronologically, in order to generate a time series.Noting with D, the initial set of analyzed documents, where  = { 1 ,  richness of the vocabulary used by the author, during the analyzed time series.Where there are several documents written by the author during the same year, the indicator ISRV for the year is calculated as the arithmetic average of the indicator values ISRV recorded in the documents of the equal years.To evaluate the trend indicator authoring plays, is defined time series using three methods of estimating the trend:  absolute mean change method implies a linear dependencies form an arithmetic progression in which each term of the series is formed from the initial period, first term in terms of time, and adding an algebraic delay multiplied by the mean absolute change; this method is suitable in the context of first-degree linear dependencies;  average index method It requires an exponential dependence of the shape of a geometric progression in which each term of the series is made from the original deadline by multiplying it by the average index of dynamic exponentially; this method is unsuitable for an exponential dependence between the indicator ISRV and series of time periods;  linear regression method is the only method proposed in the present example for analysis of the type of analytical methods and involves estimating an equation of first degree estimation carried out using the method of least squares; form the trend is given by: Choosing the best method for approximating the trend which it has ISRV indicator over time involves comparing the sum of squared errors caused by the three estimation proposed methods:  A preliminary analysis of the resulting chart shows an increasing trend indicator ISRV value, generating an interpretation on the use of vocabulary development by increasing its level of wealth semantic analysis.
To predict developments in the next period of research, it must be running three extraction methods of trend.Table 3 contains the calculations estimated using the first method absolute mean change method to generates series.Table 3. Trend estimate calculations using the absolute mean change method Year(i) ISRV

1.02913
Table 5 contains the series of calculations which generates estimated using the third method, the method of linear regression.As for the method using linear regression has been obtained the lowest summation of squared errors, for estimating the trend equation was chosen the equation: Figure 2 contains chart trend estimated in the estimation using linear regression.There is an increasing evolution of the trend, with 0.011 percentage points at a time to another.The interpretation is given by an expansion of the area using concepts extracted from the documents written by author.DOI: 10.12948/issn14531305/20.3.2016.04The advantages of this method are that the proposed metric for assessing the richness of the vocabulary does not depend on the fields that are treated in the documents reviewed, but on the semantic distance between unique concepts identified in those documents.Adding time analysis component, resulting in a possible estimate of future works written by authors who are known previously written works in terms of time.

Conclusions
Transforming the vocabulary richness indicator into a semantic one adds a new layer of analysis within the general intrinsic plagiarism detection methods.First step in detecting the plagiarism is defining the author's mark within its written papers, that leads to a parts of documents analysis of similarity.Minimizing the set of phrases considered to be plagiarized, the entire process of plagiarism detection is diminished, using as input data for the next step, the external plagiarism, only those parts of documents that are considered to be different in terms of author mark analysis.The present proposed vocabulary richness metric using semantic layer does not depend on the main subjects of the documents written by a particular author, thereby removing the subject dependency.In particular, multiple authors tend to expand their research into different domains.Using this expansion of sub-

Figure 1
Figure1shows the evolution of ISRV indicator over 13 years.

Fig.
Fig. Error!No text of specified style in document.. Chart of ISRV indicator evolution using linear regression

Table Error !
No text of specified style in document.. ISRV metric rolling on a set of testing compared to standard metric Type -TokenAnalyzed fragment textVocabulary richness metrics are in depth analyzed in order to propose a new metric for evaluating the richness of the vocabulary used by authors of different documents by adding the semantic layer as a further characterization.

Table 4 .
Trend estimate calculations using the average index method

Table 5 .
Trend estimate calculations using the linear regression method

Table 6 .
The sum of squared errors and the estimated trend equations for the three proposed estimation methods