Source Code Plagiarism Detection Method Using Protégé Built Ontologies

Software plagiarism is a growing and serious problem that affects computer science universities in particular and the quality of education in general. More and more students tend to copy their thesis’s software from older theses or internet databases. Checking source codes manually, to detect if they are similar or the same, is a laborious and time consuming job, maybe even impossible due to existence of large digital repositories. Ontology is a way of describing a document’s semantic, so it can be easily used for source code files too. OWL Web Ontology Language could find its applicability in describing both vocabulary and taxonomy of a programming language source code. SPARQL is a query language based on SQL that extracts saved or deducted information from ontologies. Our paper proposes a source code plagiarism detection method, based on ontologies created using Protégé editor, which can be applied in scanning students’ theses’ software source code.


Introduction
In our days we have a huge volume of digital information, thing that can be very useful on one side, but a disadvantage on the other.The useful part is that we can find any needed information more quickly (at a click of a button as we usually say) than in the past by taking advantage of the digital repositories.The disadvantage is that finding similar or duplicated documents is very difficult now, especially when this job is made manually.That is why we try to find alternative solutions in the field of plagiarism detection systems [1].The term "ontology" is inherited from philosophy where it refers to existence and the things that exist.In computer science those things are represented by data and the ontology generally describes the semantic of terms used in a specific domain (in our case programming), providing a vocabulary for that domain as well as a computerized specification of the meaning of terms used in the vocabulary.Ontologies range from taxonomies and classifications, database schemas, to fully axiomatized theories.In recent years, ontologies have been adopted in many business and scientific communities as a way to share, reuse and process domain knowledge.Ontologies are now central to many applications such as scientific knowledge portals, information management and integration systems, electronic commerce, and semantic web services [2].In our work we will use ontologies for building the knowledge graph specific to each source code that we suspect of plagiarism.OWL Web Ontology Language is a specification by the World Wide Web Consortium (W3C) and serves as a fundamental component of the Semantic Web initiative.OWL is based upon the Extensible Markup Language (XML), XML Schema [3], the Resource Description Framework (RDF) and RDF Schema (RDF-S) [4].It is composed from three sublanguages OWL-Lite, OWL-DL and OWL-Full, from those OWL-DL being the one most often used because it provides maximum expressiveness.The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web.It is particularly intended for representing metadata about web resources, such as the title, author, and modification date of a web page, copyright and licensing information about a web document, or the availability schedule for some shared resource [4].However, by generalizing the concept of a web resource, RDF can also be used to represent information about things that can 1 DOI: 10.12948/issn14531305/17.3.2013.07be identified on the web, even when they cannot be directly retrieved on the web.RDF is intended for situations in which this information needs to be processed by applications, rather than being only displayed to people.RDF provides a common framework for expressing this information so it can be exchanged between applications without loss of meaning.Since it is a common framework, application designers can leverage the availability of common RDF parsers and processing tools.The ability to exchange information between different applications means that the information may be made available to applications other than those for which it was originally created.We will use RDF and OWL in our method as standards and formats for saving the ontologies created via the Protégé editor.We prefer this approach because they are W3C standards and in this way we can provide interoperability between our work and other future related works.Protégé is a free, open source ontology editor and knowledge-base framework that provides a suite of tools to construct domain models and knowledge-based applications with ontologies.At its core, Protégé implements a rich set of knowledge-modeling structures and actions that support the creation, visualization, and manipulation of ontologies in various representation formats.Protégé can be customized to provide domainfriendly support for creating knowledge models and entering data [2].SPARQL for RDF [5] is a query language that can be used to retrieve information across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions.SPARQL also supports extensible value testing and constraining queries by source RDF graph.The results of SPARQL queries can be results sets or RDF graphs.

Proposed Method
Our method is a step by step algorithm build with the help of the Protégé editor, version 4.3.0.The Protégé platform supports two main ways of modeling ontologies:  the Protégé-Frames editor enables users to build and populate ontologies that are frame-based, in accordance with the Open Knowledge Base Connectivity protocol (OKBC).In this model, an ontology consists of a set of classes organized in a subsumption hierarchy to represent a domain's salient concepts, a set of slots associated to classes to describe their properties and relationships, and a set of instances of those classesindividual exemplars of the concepts that hold specific values for their properties;  the Protégé-OWL editor enables users to build ontologies for the Semantic Web, in particular in the W3C's Web Ontology Language (OWL).An OWL ontology may include descriptions of classes, properties and their instances.Given such an ontology, the OWL formal semantics specifies how to derive its logical consequences, i.e. facts not literally present in the ontology, but entailed by the semantics.These entailments may be based on a single document or multiple distributed documents that have been combined using defined OWL mechanisms.As we have already stated, we will choose the second way of modeling ontologies provided by Protégé and we will create W3C's OWL based ontologies.The first step in the development of the source code plagiarism detection system is building the needed OWL Classes [6].This approach is similar to the OOP paradigm [7].We will implement classes like Variable, Constant, DataType, ProgrammingStructure, Comment, SystemFunction and Operator.The classes created within the editor are visible in Figure 1.DOI: 10.12948/issn14531305/17.3.2013.07We can also define specialized concepts that can therefore be used to build taxonomies.This is the case of RepetitiveStructure and ConditionalStructure from Figure 1.They are defined as special programming structures (subclasses of ProgrammingStructure).The correspondent OWL syntax for this is: To define relations between the modeled concepts we use ObjectProperty.These relations can be marked as transitive, symmetrical or functional.Two relations can be marked as inverse to each other.Furthermore relations can be specialized by using subPropertyOf in analogy to subClassOf for concepts.The following example, shown in Figure 2  Other defined relations in our ontology are is_included_in (which is marked as transitive) and is_type_of.We could limit their domain and range as well, to ProgrammingStructure or Variable and DataType.The correspondent OWL syntax for them is:  We will do the same for the rest of individuals found in the investigated source code.In this way we create an ontology for each source code that is suspect of plagiarism.We do this thing manually using Protégé just for demonstration purposes only.This process can be made automatically by building a crawler that reads the source code and builds the OWL file correspondent to it [8].The crawler will receive as input the raw source code and will return as output the OWL file corresponding to the built ontology.In this way we will have an ontology file for each source code no matter of the programming language in which it is written.We choose as an example the following source code written in C: The next step, and the final one, in our proposed method is to find a way of comparing the two ontologies obtained from the presented process.A solution is to take advantage of the fact that OWL ontologies are based on RDF and build different SPARQL queries for comparing the source codes.The queries will depend on the algorithms that we want to test.For example we can choose some metrics that will be measured using SPARQL and then compared to see the plagiarism degree.
SPARQL is the standard query language for accessing RDF data [10], where the basic access pattern is called the triple pattern.A triple pattern has the same form as an RDF triple, but with variables.Like the counterpart of select-project-join queries in SQL, the SPARQL query supports both conjunctions and disjunctions of the triple patterns.Furthermore, the predicates in the SPARQL query can also be variables, which allow "predicate-agnostic" queries.Protégé editor provides us an user interface where we can run SPARQL queries (as shown in Figure 4), but limits our output to results sets.

Fig. 4. Protégé SPARQL Query Editor
We will define ten metrics [11] (presented in Table 1) that will be measured for each ontology apart.Based on these metrics we will compute a plagiarism degree.

Total plagiarism degree (arithmetic mean of metrics) 95.5 %
As we can see this method is precise enough for determining the plagiarism degree, but it depends very much on the chosen metrics.So if we choose to make a software application for this, it is a better approach if the final user will have the possibility to choose the interest metrics and how they influence the final result (in our case we consider them equal in influencing the final result).Because this method is not as accurate as we wish we have searched for alternative solutions in the field of ontologies to confirm the result obtained in this way.So we found that another method of comparing two source codes ontologies is by using the graphical representation of the semantic networks.The semantic network (called in some cases concept network) is a graph, where the nodes represent concepts and the arcs represent the relations between the concepts [12].Most semantic networks are cognitively based.They also consist of arcs and nodes which can be organized into a taxonomic hierarchy.Semantic networks contributed ideas of spreading activation, inheritance, and nodes as proto-objects.They are intractable for large domains.Some properties are not easily expressed using a semantic network, e.g., negation, disjunction, and general non-taxonomic knowledge.Expressing these relationships requires workarounds, such as having complementary predicates and using specialized procedures to check for them, but this was not a problem in our method.
A particular case of a semantic network representation is the topic map.The Topic Maps family of standards is designed to facilitate the gathering of all the information about a subject at a single location.The information about a subject includes its relationships to other subjects; such relationships may also be treated as subjects (subject-centric) [13].
These visual representations of ontologies can help in our method.Topic models (which can be viewed as the Bayesian version of latent semantic analysis) are useful for extracting semantic content from any type of collections.After topic modeling, the topic representation is projected onto two dimensions to create the topic map visualization [14].The first OntoGraph (shown in Figure 5) is the representation of the C ontology with its specific individuals.The second one (shown in Figure 6) is the Javascript ontology.We can see that this one has another set of individuals.To have a reliable detection system based on this method, all the steps of the presented architecture must be made automatically.So to create such a system we will need a crawler that will parse the code and extract the OWL ontology, a set of defined metrics, each one with its own dynamically generated SPARQL query, and a custom representation of all the involved topic maps.These components will be created in our future work.

Conclusions
In this paper it was shown that ontologies can be used in detecting source code plagiarism.By using the OWL Web Ontology Language which is based on RDF Resource Description Framework and the SPARQL RDF based query language we can extract the needed information from our ontology that was built based on the vocabulary and taxonomy of a programming language source code.We saw that a way of constructing this kind of ontology is Protégé, a free open source ontology editor and that beside the metrics that can be measured using SPARQL we can see the graphic representation of the ontology by using a topic map.However, the real benefit of using ontologies for complex software plagiarism detection systems is that all the detection process can be made automatically and in this way we can improve the quality of students' theses in particular and the quality of education in general.The introduced approaches are a good starting point for the future work to establish a fully automatically system for source code plagiarism detection.

Fig. 7 .
Fig. 7. Architecture of the plagiarism detection method