Abstract
Plagiarism has become a major cause of concern that has even spread its root across academic area. Universities are becoming more concerned about it because of growing development of internet especially socio media and thereby increasing opportunity among students to copy and paste the electronic content. Students in today’s digital era follow the trend of exchange and copying of information in order to maintain their socio integrity among their circle without considering its long term negative social impact especially from their career perspective. ‘They feel The more you exchange, the more social you are’. To avoid this kind of plagiarism especially in Universities labs, Hawk Eye an innovative mobile plagiarism detection system was an initiative in this regard. Hawk Eye combination with Cohort Intelligence (CI) represents higher state of vision to see things even with more clarity in ordinary experiences, by using Hawk’s keen and observant eyesight, and CI self-supervising nature. This would also help to take appropriate preventive measures to avoid plagiarism from its root among students. ‘Hawk Eye for Cohort (HEC)’ based on comparative analysis of various algorithms like CI and Genetic Search GA can play an important role in formulation of behavioral distribution patterns of students. CI algorithm deploys its self-supervising mechanism to improvise an individual behavior in a cohort and by observing these behavioral patterns, decisions can be taken by teachers in regard of re-design of appropriate evaluation systems to check and stop plagiarism among students. The final outcome of HEC would be an incrementally learning evaluation systems which would iteratively grow with evolving cohort behavioral patterns with every upcoming batch of students. This evolving behavioral patterns search process can be optimized using GA. HEC really would be a concrete evaluation system for analyzing percentage of plagiarism among students, understanding real time reasons behind the growing percentage and coming up with suitable prevention measures in order to cure plagiarism. The concept of study of cohort behavioral distribution pattern using algorithms like GA and CI for plagiarism detection based on student’s socio thinking using different Cohort Analysis Tools is indeed an entirely new idea which is being discussed in this paper in detail.
1 Introduction
Hawk Eye [ 1 ] is an innovative mobile plagiarism detection system. This system uses Mobile Scanner OCR (Optical Character Recognition) Engine to convert the clicked snapshots into the text format. The OCR Engine [2] would preprocess the clicked image in order to remove noise and disturbance from it and extract relevant keywords from image. Then the system uses Plagiarism detection Algorithms to remove unnecessary details like comments, or changing variables names. In end the mobile version of plagiarism detection applications like Viper or Plagiarism Checker can be used to detect plagiarism. Figure 1 represents the flowchart for Hawk Eye System.
After successfully deploying Hawk Eye System, the concept of GA and CI [3] comes into play. Based on proposed Genetic search and CI procedure can be deployed in order to formulate variable and optimized behavioral patterns of different students. Then suitable remedial measures can be taken and appropriate evaluation system can be designed in order to prevent plagiarism among students. The flow for analysis of cohorts (students) and Cohort Intelligence System is represented in Fig. 2.
Hence the complete System is a hybrid combination of following procedures as given below:
2 Literature Review
A number of studies have been reported on the feasibility of Detection of Plagiarism using different tools in different context.
Overview and Comparison of Plagiarism Detection tools [ 4 ]: Asim et al. have examined various plagiarism detection tools with respect to tools features and performance. The comparison of tools proves that there is no tool that can detect or prove plagiarism of a document 100 % because each tool has its own advantages and limitations. Due to the limitations of these tools some set of parameters and rules can be suggested that need to be considered in order to overcome plagiarism in academic areas.
Computer-Based Plagiarism Detection Methods and Tools [ 5 ]: Romans et al. introduces different ways to reduce plagiarism in terms of both prevention and detection. Analysis of few of the known plagiarism detection tools like Turnitin, Wordcheck, Moss, JPlag and many more shows although these tools provides excellent services in detecting plagiarism but these advance tools can’t detect plagiarism in the manner that manually a human can do. Thus this paper concludes that Human brain is a universal plagiarism detection tool, which analyzes document using various statistical and semantical methods. Thus it is able to operate with textual and non-textual information. At the present such abilities are not available with advance plagiarism detection software tools.
Software Plagiarism Detection using Model Driven Software Development in Eclipse Platform [ 6 ]: Pierre has described a concept, the design and the development of a software plagiarism detection application based on the Eclipse Platform. It is a generic front-end approach which converts the source program from different programming languages into generic models in order to detect source code plagiarism. The results of this approach highlight the fact that application doesn’t provide absolute results like any other plagiarism detection software but signals the source code submitted by student requires further investigation in regard of plagiarism.
Plagiarism Detection in Java Code [ 7 ]: Ahmad Gull and Aijaz focuses on Java programs that could assist teachers in detecting Plagiarism in Java programming. They also highlight properties of different approaches in detecting plagiarism in different Java code files and recommend accordingly best approaches for future work and study.
3 Our Approach and Contributions
In above studies it is clearly visible that there is no plagiarism tool that can detect plagiarism in a document or source code 100 % accurately. Every tool suffers from some or other limitations which makes its approach of detection not very concrete and absolute in terms of confidence in a particular plagiarism detection tool.
Hawk Eye is an initiative in this regard which tries to overcome some of the limitations of already available tools using its keen and observant eyesight. It takes into account various parameters into consideration while detecting plagiarism. To name a few of the parameters are Database Checking, Structure Checking, Supported Languages, using concept of Tokens rather than just strings, Abstract Syntax Tree and various Hybrid Approaches. In order to strike a balance between already available some of the best features in various detection tools like Moss, JPlag, Turnitin etc. we have studied and taken into consideration various limitations in existing tools and how can these limitations, can be overcome.
Hawk Eye extension with Genetic Search and Cohort Intelligence concept will continue to make it a more concrete system especially in terms of confidence as a software for plagiarism detection. With continuously evolving evaluation system which is the final objective of HEC as new behavioral patterns of a cohort will emerge, HEC will strengthen itself in terms of an absolute evaluation and prevention system for plagiarism detection.
4 Methodology for Hawk Eye
A. Mobile Scanner OCR Engine Working
Smartphone’s capability of scanning using various scanning apps like OCR Instantly, Cam-Scanner [8] can be used in proposed Plagiarism Detection System for extracting relevant text from clicked snapshot. The complete process can be summarized as:-
-
1.
Enhance Image (i.e. Image Preprocessing) for better image quality and reduction of noise as far as possible.
-
2.
Use OCR to extract relevant text in editable form from captured image.
-
3.
Save As (default options available—PDF/JPEG Format) in appropriate format.
-
4.
Share (as available with scanners like OCR Instantly) using various means instantly just with one click.
This concept of OCR [2] can be further extended to IWR [9] (Intelligent Word Recognition) that can be used for detection of handwritten plagiarized codes by students.
B. Plagiarism Detection Methods
B.1 Abstract Syntax Tree
An Abstract Syntax Tree or AST [10, 11] is a hierarchical representation of a program. Each node represents a programming language construct and its children are the parameters of this construct. The nodes of an AST [11] can be mathematical operators, function calls or other programming structures, the leaves are variables or constants. Compilers perform optimizations on AST before generating lower-level code because of this property; AST can be used in plagiarism detection.
B.2 Tokenizing String Based System
Consider the program as a normal text. The pre-processing phase removes all comments, white spaces and punctuation, and renames variables into a common token. Then a string sequences comparison is performed. It performs a string-based comparison using the Karp-Rabin algorithm [12–14]. This algorithm uses concept of hash function which can compute hash value. The main advantage of tokens is that they discard all unnecessary information therefore, token-based systems are insensible to “search and replace” changes.
5 Methodology for Genetic Search and Cohort Intelligence
CI [15] is a novel methodology inspired from the candidates’ self-supervised learning behavior in a cohort. Cohort analysis allows identifying relationships between the different characteristics and behavior of a population.
There can be different possible sources which can contribute to source code plagiarism ranging from low, medium to high level. A possible means is required to validate and optimize these range of cases of plagiarism. GA can be used to prove the validity and appropriateness of selected cases. Thereafter the proposed Cohort intelligence procedure can be applied to same cases and incrementally evolving dataset which will grow iteratively with every upcoming new batch of students. This will strengthen and prove the concreteness about assumed cases.
In this way we can optimize our search for possible sources of plagiarism and thereafter by applying cohort Intelligence procedure can formulate different cohort buckets categorizing different behavioral characteristics of students.
5.1 Possible Cases of Source Code Plagiarism
Low Level Code Plagiarism Sources
-
Case 1.
Attribute based Code Plagiarism—Number of variables, functions, classes, size of code and others could be one of the possible sources of plagiarism. Code Metrics calculation could be one of the means to prevent this type of plagiarism.
-
Case 2.
Token based Code Plagiarism—Renaming or changing methods, fields, class, identifiers or replacing the expressions by equivalent, changing comments or indentation could be another possible cause. To avoid this type of plagiarism the code can be tokenized and hashed thus creating the fingerprints of code and reducing the growing incidence of plagiarism.
5.2 Procedure for Genetic Search Algorithm [16]
-
1.
Start/Initialization—Generate random initial population of desired size.
-
2.
Fitness Criteria Evaluation—Each individual of population is evaluated for fitness of the individual with respect to desired requirement which may range from simple to complex requirements.
-
3.
Generation of New Population
-
(a)
Selection—Improve population fitness by selecting only best individuals and discarding bad designs among population i.e. Darwin’s Theory of Natural Selection—“Survival of the fittest individual among others”.
-
(b)
Crossover—Create new individuals by crossover i.e. combining the aspects of two or more individuals. By combining traits from two or more individuals would generate even fitter off springs.
-
(c)
Mutation—Making small changes at random to individuals in order to avoid combinations of solutions created to be from initial population only.
-
(a)
-
4.
Loop—Use new generation of population for further run of algorithm until termination condition is reached i.e. Repeat again from Step 2.
5.2.1 Implementation of GA on Different Cases of Source Code Plagiarism
Details of GA implementation on different cases of source code plagiarism is discussed in Tables 1, 2, 3, 4 and 5 using WEKA run time environment. GA gives the most promising and likely predictive attribute (i.e. optimized solution) from a given set of attributes in each of the cases of source code plagiarism as a possible cause of overall increase in impact of plagiarism.
5.3 Procedure for Cohort Intelligence
The details of procedure for Cohort Intelligence [15] for computation of a student behavioral pattern distribution are explained using algorithmic steps below:
-
1.
Initialize the Cohort C (students) whose behavior has to be analyzed.
-
2.
Initialize all other parameters like convergence parameter €, number of iterations n, sampling interval Si, sampling interval reduction factor r, and number of variation t.
-
3.
Calculate the probability of every cohort c being selected that’s associated with the behavior being followed by every student in cohort.
-
4.
Apply roulette wheel approach to decide the behavior to follow and qualities of the student to follow from C available choices.
-
5.
Every student shrinks/grow its sampling interval of quality based on:-
-
(a)
Calculate behavioral fitness value for each particle and initialize best behavioral fitness value as Cb among the cohort.
-
(b)
Compare the current behavioral fitness value with best value Cb. If current value is better than best value Cb, then update the best value with current value. Otherwise continue with the best value Cb.
-
(c)
Find the cohort in neighborhood with best fitness value so far and consider this value as a global best value Gb.
-
(a)
-
6.
Every cohort samples behavior from updated interval and associated behavior can be found.
-
7.
If there is no change in behavior of each cohort, the cohort can be considered as saturated.
-
8.
If cohort converges to same saturated behavior even after maximum number of attempts then current cohort behavior can be accepted as final behavior.
-
9.
Stop if number of iterations equals to cohort or cohort is saturated.
5.3.1 Results Based on Cohort Analysis Tools
Based on analysis of different Cohort Analysis Tools like RJMetrics, Excel and many more as visible from Tables 6, 7, 8 and 9 and using a combination of appropriate CI Algorithms suitable remedial measures can be taken and evaluation system can be devised to prevent plagiarism among students of different streams of study.
6 Why Genetic Search and Cohort Intelligence [15, 17]
Cohort Intelligence is a branch of behavioral analytics which plays an important role in big data analysis like students records of different streams of studies in Universities. It is also in use in various data mining applications. CI uses its self-supervising mechanism in order to improvise a student behavior in a cohort. Genetic Search [16] algorithm is a process of natural selection that is used to generate useful and optimized solutions to complex search problems.
Algorithms like Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), and Honey Bee Mating Algorithm are inspired from natural behavior of the living organisms whereas GA and CI considers natural selection human tendency to solve complex optimization problems. Because of CI self-supervised nature and GA optimized search procedure they proves to be better evaluation strategy compared to others. This approach is also reasonable with respect to computation cost and it gives more edge compared to other contemporary approaches already in use.
7 Conclusion
‘HEC’ intelligently uses different Cohort analysis tools like RJMetrics, datapine etc. to evaluate code cloning. By using procedure for GA and CI Algorithm the system evaluates different types of code plagiarism done by students. Cohort analysis along with Genetic Search procedure acts as a trigger/activator to Hawk Eye system to generate student’s different behavioral distribution patterns. Based on modeling individual student behavior, teachers can design individual assignments for students.
The proposed evaluation system design that would be the outcome of HEC, specific to a particular student plagiarism behavior this evaluation system design can be exchanged among other teachers. This reflects socially inspired behavior of teacher’s community. Students in order to maintain their socio integrity among their groups would continue with their behavior of exchanging and cloning of information thereby reflecting their socially inspired behavior. As students are more receptive to use of e-media for learning than traditional reference books.
HEC as a system would discourage the overall concept of plagiarism among students of social digital era. The evolving evaluation systems can act as a prevention measure to stop cloning and will continue to re-evolve as new cohort behavioral attributes will emerge. HEC as a concrete initiative can contribute significantly to improve the socio economic development of the country as well as help universities, teachers to understand the growing socio impact among today’s digital student generation.
References
Mulay, P., Puri, K.: Hawk Eye: a plagiarism detection system. In: Proceedings of the Second International Conference on Computer and Communication Technologies (IC3T), vol. 379, CMR Technical Campus, Hyderabad, Advances in Intelligent System Computing: AISC Series of Springer, Ch. 20, 24–26 July 2015
Comparison of optical character recognition Software. http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software. Accessed 28 Jan 2015
Cohort Analysis. http://cohortanalysis.com/. Accessed 26 July 2015
El Tahir, A.M., et al: Overview and comparison of plagiarism detection tools, pp. 161–172. Department of Computer Science, VˇSB-Technical University of Ostrava, 17, listopadu 15, Ostrava, Poruba, Czech Republic (2011). ISBN: 978-80-248-2391-1
Lukashenko, R., et al.: Computer-based plagiarism detection methods and tools: an overview. In: International Conference on Computer Systems and Technologies—CompSysTech (2007)
Pierre, Cornic: Software Plagiarism Detection using Model-Driven Software Development in Eclipse Platform. University of Manchester, School of Computer Science (2008)
Liaqat, A.G., Ahmad, A.: Plagiarism detection in java code. Linnaeus University, School of Computer Science, Physics and Mathematics (2011)
Cam Scanner—Phone PDF creator. https://play.google.com/store/apps/details?id=com.intsig.camscanner&hl=en. Accessed 25 Jan 2015
Intelligent Character Recognition Software. http://www.cvisiontech.com/ocr/text-ocr/intelligent-character-recognition-software.html?lang=eng. Accessed 2 Feb 2015
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance, vol. 98, pp. 368–377 (1998)
Poongodi, D., TholkkappiaArasu, G.: An automatic method or statement level plagiarism detection in source code using abstract syntax tree. Research Scholar, Manonmaniam Sundaranar University, Tirunelveli
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Rabin—Karp Algorithm. http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm.Accessed 20 Jan 2015
Karp-Rabin Algorithm. http://www-igm.univ-mlv.fr/~lecroq/string/node5.html. Accessed 20 Jan 2015
Kulkarni, A.J.: Cohort intelligence: a self supervised learning behavior. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 1396–1400 (2013)
Genetic Algorithm.: https://en.wikipedia.org/wiki/Genetic_algorithm. Accessed 26 July 2015
Bhosale, M.S., Mane, R.V.: study and analysis of cluster optimization algorithms: particle swarm optimization and Cohort intelligence. Int. J. Mod. Trends Eng. Res. 2(3), 567–571 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Mulay, P., Puri, K. (2016). HAWK EYE: Intelligent Analysis of Socio Inspired Cohorts for Plagiarism. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-28031-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28030-1
Online ISBN: 978-3-319-28031-8
eBook Packages: EngineeringEngineering (R0)