Stylometry Metrics Selection for Creating a Model for Evaluating the Writing Style of Authors According to Their Cultural Orientation

The present paper starts from a short introduction of the major aspects debated regarding plagiarism and author identification, along with the principles that are at the base of forming the property rights laws within the European community and the Anglo-American one. Regardless of the community involved, plagiarism is a form of using others research, as it is or modified, and presenting it as a personal creation. The terms of creativity and plagiarism are described in an antithesis analysis, reaching to the concept of originality, defined as a property that a creative research paper has when the ideas presented within in are different from the ones already published by different authors. A metric is implemented in order to obtain a measurable value in determining the level of originality of a paper. The main ways of testing a paper of plagiarism, intrinsic and external analysis, are described for choosing the proper methodology for determining originality of scientific papers. The research leads to the stylometric analysis, a field found at the crossroad of plagiarism, originality and author identification. This stylometric analysis is done within the intrinsic plagiarism detection and is formed on the bases of a number of metrics that describe unique a writing style of a specific author. The testing platform implies using a set of research papers written by European authors and extracting the values of eight writing style metrics. A clustering is applied and the best combination of metrics is resulted.


Introduction
Researches on the intellectual property rights face determining the level of originality of a research paper, in contract to the action of plagiarism which is defined as the full or partial ownership of ideas, expressions, methods or procedures and their presentation as a personal creation.In the Anglo-American laws, the economic considerations and those that refer to public politics are prevailed in the elaboration and development of the property rights laws while, in the European point of view, the moral and civil arguments based the elaboration of the same laws.The legislative framework does not resolve identifying plagiarism and level of originality of a scientific work.The present paper aims to apply the legislative property rights in the context of publishing scientific research papers.In practice, there are different types of plagiarism, the most common being: copy-paste, paraphrase, plagiarism through translation in different languages, artistic plagiarism, ideas plagiarism, source code and not using the proper citations.Article [1] presents the fact that plagiarism through paraphrase is analyzed, reaching to a classification of the major known types, along with a testing using software detection of plagiarism at the level of percentage of correctness by identifying the paraphrase within a text document.The present paper is consisted in five chapters, starting from a short introduction of the major aspects debated, along with the principles that are at the base of forming the property rights laws within the European community and the Anglo-American one.Regardless of the community involved, plagiarism is a form of using others research, as it is or modified, and presenting it as a personal creation.Chapter 2 describes the terms of creativity and plagiarism in an antithesis analysis, reaching to the concept of originality, defined as a property that a creative research paper has when the ideas presented within in are different from the ones already published by different authors.A metric is implemented in order to obtain a measurable value in determining the level of originality of a paper.The main ways of testing a paper of plagiarism, intrinsic and external analysis, are described for choosing the proper methodology for determining originality of scientific papers.The research leads to the stylometric analysis within the third chapter, a field found at the crossroad of plagiarism, originality and author identification.This stylometric analysis is done within the intrinsic plagiarism detection and is formed on the bases of a number of metrics that describe unique a writing style of a specific author.Within the fourth chapter, eight stylometry metrics are extracted from a number of scientific research papers in order to obtain the best combination that describes best the writing style of an author.For that, Weka tool along with the integration of WordNet lexical ontology analysis are used, obtaining a set of four metrics that can further describe the writing style of an author according to its cultural orientation.Conclusions are highlighted in the fifth chapter along with directions for future research.

Creativity and Plagiarism Analysis
Creativity, seen as a form of originality, represents the characteristic of adding something new, original and appropriate to reality, defining the novelty and originality.For that, in order to analyze the level of originality of a scientific paper, it needs to create an antithesis between this component of creativity and the plagiarism one.Starting from the objects used within the present research, scientific research papers written by Romanian and other European authors, the component of semantic phase is defined as a compact component within a paper, formed out of one or more adjacent phases, which is significantly different from the semantic phases prior or subsequent to it.To say that a work is original is similar to the result of the evaluation of a paper in terms of plagiarism.
IEO, Indicator for Originality Evaluation, is defined as being the ratio between the total number of original semantic phases reported to the total number of semantic phases found within the analysed paper.[15] regarding originality.Copyright law emphasizes that "originality" fundamentally mean that a work that comes from the inspiration of the author and was not copied from another source.Hence, "original" is used in the sense of the original in order to identify the source of the work originates.The more a work contains fewer phrases that overlap with previous research, the more the work will have a higher degree of originality.This paper uses the concept of plagiarism not only in the narrow and very known of it, within the meaning of copied without concern right, moral and legal source text, but in a sense of the idea, the research topics that can influence research an author taking into account previous studies and similar to other authors.A work is original when treating a concept, art, new or existing situation in a unique manner compared to other studies.The present approaches to identify plagiarism include evaluating by comparing two or more documents.The degree of similarity is used as a quantitative assessment of the similarity between two documents on the basis of a system of metrics.In the paper [2] it is proposed a classification of the main metrics used in plagiarism detection.In the literature, there are two main strategies for identifying plagiarism approach, [3]:

𝐼𝐸𝑂
 intrinsic, which has the aim of identifying the passages plagiarized by examining only the analyzed document, concluding if parts of the material are or not written by the same author, such models are presented in [3], [4];  external, which involves assessing through comparison of the document with other existing documents within the database od material and identifying the pair of similar documents; multiple studies analyzes this problem, such as: [5], [6], [7] and [8].Intrinsic plagiarism identification technique uses the writing style of an author as a basis for comparison.A template is constructed, consisting in features such as: statistics on the text, features syntax, parts of speech, or sets of words commonly used structural features of the text.Feature set is attached to a function evaluation criterion of changes over the analyzed text.The disadvantages of this method are highlighted in the case of works written by several authors.On the other hand, external approach to plagiarism brings benefits for the purposes of comparing the document with other documents written by the same author as well as other documents from the same central area.The disadvantages are given by the exponential complexity in relation to the size of the database for comparison.

Stylometry Metrics in Intrinsic Plagiarism Analysis
In surveys such as those developed in [9], [10], [11], [12] and [13] the problems and ways to integrate plagiarism intrinsic referring also to stylometry are treated, the writing style of an author over its history of research or in a document unit.In the intrinsic plagiarism, in which are considered internal parts of a document suspected for plagiarism, Table 1 summarizes the types of features analyzed, used software tools and resources involved. .These characteristics depend on the set of documents analyzed, the language they are written and the type of documents.This work addresses the type of papers articles.Also, the language in which the documents are written is English.To retrieve the original document from the crowd semantic component describing an author affiliation to culture, components of multidimensional data analysis are used to identify features that set the style of writing that optimize the objective function to extract cultural orientation.

Clustering Metrics for Creating a Model for Description of Author's Writing Style
For extracting the correct set of writing style characteristics which defines at the maximum level the lexical, semantic and cultural components of an author by using his own scientific papers, the initial set of characteristics must be defined.This initial set is the one on which different combinations are performed.In this way, the set of writing style characteristics is composed of the following elements:  the average length of words;  the average length of sentences, measured in number of words;  the number of connection words with regards to the total number of words from the processed documents;  the usage frequency of special symbols like {,;.!?@#$%&*(){}[]};  the richness of the Type-Token vocabulary;  the semantic richness of the vocabulary.Beside the initial set of six writing style characteristics, two more are defined which describe better the semantic component.The first characteristic is the contextual meanings indicator or ISC and second one is the weighted indicator of contextual meanings or IPSC, characteristics that are both determined based on the WordNet ontology.In these way the following variables are used:    is the i word from the set of words found in the processed document;  (  ) is the contextual meaning returned for the word   using the Word Sense Disambiguation component available in the WordNet ontology;  ((  ) ) is the occurrence weight assigned for the contextual meaning returned by the (  ) for the   word, weight that is determined based on a training set and taken from the WordNet lexical ontology.The ISC is the contextual meanings indicator, meanings that the author is using them in average in his scientific papers.The IPSC is the weighted indicator of contextual meanings that an author uses them in average in his scientific papers weighted with the occurrences probability of the meanings found in the WordNet ontology.The two indicators, ISC and IPSC, complete the initial set of characteristics by integrating the usage analysis of common or particular meanings of polysemantic words.The ISC indicator is based on the following formula: where:  n is the size of the set that includes the total number of words extracted from the analysed document; this set is not reduced by eliminating the redundant words because of the possibility of using multiple meanings of one word in the same document depending on context.On the other side, the IPSC indicator, includes the ISC but in a more improved form by integrating the occurrences probability of each contextual meaning, using the following formula: IPSC is an inversely proportional variable to the occurrences weight of contextual meanings of polysemantic words: The zero value for this variable means that common meanings have been used in contrast with the case when the value of this indicators tends to infinity,  → ∞, meaning that the author frequently uses uncommon contextual meanings of polysemantic words.
For choosing an optimal set of characteristics that would describe better the cultural affiliation of an author's scientific papers, the set of combinations between these eight characteristics of size NC is defined.
− 1   For choosing the optimal combination an objective function is defined which must comply with the restrictions of cluster formation in an unsupervised classification:  minimizing the inter-cluster dispersion;  maximizing the intra-cluster dispersion.
DOI: 10.12948/issn14531305/19.3.2015.10 The role of these two conditions is that the objects groups of each combination to be as closely packed together as possible and in the same time clearly delimited between them.A group is that set of documents written by those authors which have the same origin country.In this regard, the clustering is made up to the level which defines the country of origin.
For extracting the cultural component, the centroid is selected, called also the average value, for each cluster in hand.In figure 1, objects A, B, C, D, E, F, G are a representation example of scientific papers, so that, after an analysis of the similarity between them to gradually form clusters.
Fig. 1.Representation of object clustering using the hierarchical clustering algorithm, [14] The dotted line in the Figure 1 is the level defined by authors belonging to countries of the same origin, thus extracting the cultural component.For the example in figure 1, the first group consists of items A, B and C, the second group of objects D and E, and the third and final group should cover the last two items: F and G.If an analysis for a higher level grouping of countries is desired, then the hierarchical clustering algorithm stops above the level defined by the dashed line, level where the groups are actually made up of multiple groups of countries previously analyzed.
In order to identify a correlation between the analyzed characteristics, an open source application is required for converting Excel files in ARFF file type, format needed for the data mining analysis.
For creating an ARFF file using this EXCEL-toARFF conversion applications, the Excel file path is chosen.An example of loading the excel file into the application in order to be converted is depicted in       Component selection attributes in Weka is dealing with principal component analysis, analysis is performed in order to minimize duplication of information and the relationship of the courts.For the analyzed example, the eight features are transformed with the help of six new vectors used to explain the information.Transformation is generated in expression attributes of their values in the context of minimizing the loss of information, keeping the data size as small as possible.Figure 6 highlights the generated eigenvalues.

Figure 4
highlights this information after inserting the path to the file data Digital Economy Ranking.

Fig. 6 .
Fig. 6. Results of principal components analysis using a selection threshold of 80%

Fig. 7 .
Fig. 7. Descriptive information upon the variables used in the model

Table 1 .
Text characteristics, software tools and resources involved in the case of intrinsic plagiarism

Table 3 .
Correlation matrix for the input variables