1 Introduction

Hawk Eye [ 1 ] is an innovative mobile plagiarism detection system. This system uses Mobile Scanner OCR (Optical Character Recognition) Engine to convert the clicked snapshots into the text format. The OCR Engine [2] would preprocess the clicked image in order to remove noise and disturbance from it and extract relevant keywords from image. Then the system uses Plagiarism detection Algorithms to remove unnecessary details like comments, or changing variables names. In end the mobile version of plagiarism detection applications like Viper or Plagiarism Checker can be used to detect plagiarism. Figure 1 represents the flowchart for Hawk Eye System.

Fig. 1
figure 1

Part-I: Flowchart for Hawk Eye system

After successfully deploying Hawk Eye System, the concept of GA and CI [3] comes into play. Based on proposed Genetic search and CI procedure can be deployed in order to formulate variable and optimized behavioral patterns of different students. Then suitable remedial measures can be taken and appropriate evaluation system can be designed in order to prevent plagiarism among students. The flow for analysis of cohorts (students) and Cohort Intelligence System is represented in Fig. 2.

Fig. 2
figure 2

Part-II: Flowchart for Cohort intelligence system

Hence the complete System is a hybrid combination of following procedures as given below:

$$ \begin{aligned} {\mathbf{Hawk}}\,{\mathbf{Eye}} + \left( {{\mathbf{Genetic}}\,{\mathbf{Search}} \times {\mathbf{Cohort}}\,{\mathbf{Intelligence}}} \right) & = \left( {{\mathbf{Efficient}} + {\mathbf{Flexible}} + {\mathbf{Preventive}}} \right) \\ & \quad \times {\mathbf{PLAGIARISM}}\,{\mathbf{DETECTION}}\,{\mathbf{SYSTEM}} \\ \end{aligned} $$

2 Literature Review

A number of studies have been reported on the feasibility of Detection of Plagiarism using different tools in different context.

Overview and Comparison of Plagiarism Detection tools [ 4 ]: Asim et al. have examined various plagiarism detection tools with respect to tools features and performance. The comparison of tools proves that there is no tool that can detect or prove plagiarism of a document 100 % because each tool has its own advantages and limitations. Due to the limitations of these tools some set of parameters and rules can be suggested that need to be considered in order to overcome plagiarism in academic areas.

Computer-Based Plagiarism Detection Methods and Tools [ 5 ]: Romans et al. introduces different ways to reduce plagiarism in terms of both prevention and detection. Analysis of few of the known plagiarism detection tools like Turnitin, Wordcheck, Moss, JPlag and many more shows although these tools provides excellent services in detecting plagiarism but these advance tools can’t detect plagiarism in the manner that manually a human can do. Thus this paper concludes that Human brain is a universal plagiarism detection tool, which analyzes document using various statistical and semantical methods. Thus it is able to operate with textual and non-textual information. At the present such abilities are not available with advance plagiarism detection software tools.

Software Plagiarism Detection using Model Driven Software Development in Eclipse Platform [ 6 ]: Pierre has described a concept, the design and the development of a software plagiarism detection application based on the Eclipse Platform. It is a generic front-end approach which converts the source program from different programming languages into generic models in order to detect source code plagiarism. The results of this approach highlight the fact that application doesn’t provide absolute results like any other plagiarism detection software but signals the source code submitted by student requires further investigation in regard of plagiarism.

Plagiarism Detection in Java Code [ 7 ]: Ahmad Gull and Aijaz focuses on Java programs that could assist teachers in detecting Plagiarism in Java programming. They also highlight properties of different approaches in detecting plagiarism in different Java code files and recommend accordingly best approaches for future work and study.

3 Our Approach and Contributions

In above studies it is clearly visible that there is no plagiarism tool that can detect plagiarism in a document or source code 100 % accurately. Every tool suffers from some or other limitations which makes its approach of detection not very concrete and absolute in terms of confidence in a particular plagiarism detection tool.

Hawk Eye is an initiative in this regard which tries to overcome some of the limitations of already available tools using its keen and observant eyesight. It takes into account various parameters into consideration while detecting plagiarism. To name a few of the parameters are Database Checking, Structure Checking, Supported Languages, using concept of Tokens rather than just strings, Abstract Syntax Tree and various Hybrid Approaches. In order to strike a balance between already available some of the best features in various detection tools like Moss, JPlag, Turnitin etc. we have studied and taken into consideration various limitations in existing tools and how can these limitations, can be overcome.

Hawk Eye extension with Genetic Search and Cohort Intelligence concept will continue to make it a more concrete system especially in terms of confidence as a software for plagiarism detection. With continuously evolving evaluation system which is the final objective of HEC as new behavioral patterns of a cohort will emerge, HEC will strengthen itself in terms of an absolute evaluation and prevention system for plagiarism detection.

4 Methodology for Hawk Eye

A. Mobile Scanner OCR Engine Working

Smartphone’s capability of scanning using various scanning apps like OCR Instantly, Cam-Scanner [8] can be used in proposed Plagiarism Detection System for extracting relevant text from clicked snapshot. The complete process can be summarized as:-

  1. 1.

    Enhance Image (i.e. Image Preprocessing) for better image quality and reduction of noise as far as possible.

  2. 2.

    Use OCR to extract relevant text in editable form from captured image.

  3. 3.

    Save As (default options available—PDF/JPEG Format) in appropriate format.

  4. 4.

    Share (as available with scanners like OCR Instantly) using various means instantly just with one click.

This concept of OCR [2] can be further extended to IWR [9] (Intelligent Word Recognition) that can be used for detection of handwritten plagiarized codes by students.

B. Plagiarism Detection Methods

B.1 Abstract Syntax Tree

An Abstract Syntax Tree or AST [10, 11] is a hierarchical representation of a program. Each node represents a programming language construct and its children are the parameters of this construct. The nodes of an AST [11] can be mathematical operators, function calls or other programming structures, the leaves are variables or constants. Compilers perform optimizations on AST before generating lower-level code because of this property; AST can be used in plagiarism detection.

B.2 Tokenizing String Based System

Consider the program as a normal text. The pre-processing phase removes all comments, white spaces and punctuation, and renames variables into a common token. Then a string sequences comparison is performed. It performs a string-based comparison using the Karp-Rabin algorithm [1214]. This algorithm uses concept of hash function which can compute hash value. The main advantage of tokens is that they discard all unnecessary information therefore, token-based systems are insensible to “search and replace” changes.

5 Methodology for Genetic Search and Cohort Intelligence

CI [15] is a novel methodology inspired from the candidates’ self-supervised learning behavior in a cohort. Cohort analysis allows identifying relationships between the different characteristics and behavior of a population.

There can be different possible sources which can contribute to source code plagiarism ranging from low, medium to high level. A possible means is required to validate and optimize these range of cases of plagiarism. GA can be used to prove the validity and appropriateness of selected cases. Thereafter the proposed Cohort intelligence procedure can be applied to same cases and incrementally evolving dataset which will grow iteratively with every upcoming new batch of students. This will strengthen and prove the concreteness about assumed cases.

In this way we can optimize our search for possible sources of plagiarism and thereafter by applying cohort Intelligence procedure can formulate different cohort buckets categorizing different behavioral characteristics of students.

5.1 Possible Cases of Source Code Plagiarism

Low Level Code Plagiarism Sources

  1. Case 1.

    Attribute based Code Plagiarism—Number of variables, functions, classes, size of code and others could be one of the possible sources of plagiarism. Code Metrics calculation could be one of the means to prevent this type of plagiarism.

  2. Case 2.

    Token based Code Plagiarism—Renaming or changing methods, fields, class, identifiers or replacing the expressions by equivalent, changing comments or indentation could be another possible cause. To avoid this type of plagiarism the code can be tokenized and hashed thus creating the fingerprints of code and reducing the growing incidence of plagiarism.

5.2 Procedure for Genetic Search Algorithm [16]

  1. 1.

    Start/Initialization—Generate random initial population of desired size.

  2. 2.

    Fitness Criteria Evaluation—Each individual of population is evaluated for fitness of the individual with respect to desired requirement which may range from simple to complex requirements.

  3. 3.

    Generation of New Population

    1. (a)

      Selection—Improve population fitness by selecting only best individuals and discarding bad designs among population i.e. Darwin’s Theory of Natural Selection—“Survival of the fittest individual among others”.

    2. (b)

      Crossover—Create new individuals by crossover i.e. combining the aspects of two or more individuals. By combining traits from two or more individuals would generate even fitter off springs.

    3. (c)

      Mutation—Making small changes at random to individuals in order to avoid combinations of solutions created to be from initial population only.

  4. 4.

    Loop—Use new generation of population for further run of algorithm until termination condition is reached i.e. Repeat again from Step 2.

5.2.1 Implementation of GA on Different Cases of Source Code Plagiarism

Details of GA implementation on different cases of source code plagiarism is discussed in Tables 1, 2, 3, 4 and 5 using WEKA run time environment. GA gives the most promising and likely predictive attribute (i.e. optimized solution) from a given set of attributes in each of the cases of source code plagiarism as a possible cause of overall increase in impact of plagiarism.

Table 1 WEKA GA run time information common to CASE 1 and 2 of source code plagiarism
Table 2 CASE 1: attribute based code plagiarism
Table 3 WEKA Run information for CASE 1
Table 4 CASE 2: token based code plagiarism
Table 5 WEKA Run information for CASE 2

5.3 Procedure for Cohort Intelligence

The details of procedure for Cohort Intelligence [15] for computation of a student behavioral pattern distribution are explained using algorithmic steps below:

  1. 1.

    Initialize the Cohort C (students) whose behavior has to be analyzed.

  2. 2.

    Initialize all other parameters like convergence parameter €, number of iterations n, sampling interval Si, sampling interval reduction factor r, and number of variation t.

  3. 3.

    Calculate the probability of every cohort c being selected that’s associated with the behavior being followed by every student in cohort.

  4. 4.

    Apply roulette wheel approach to decide the behavior to follow and qualities of the student to follow from C available choices.

  5. 5.

    Every student shrinks/grow its sampling interval of quality based on:-

    1. (a)

      Calculate behavioral fitness value for each particle and initialize best behavioral fitness value as Cb among the cohort.

    2. (b)

      Compare the current behavioral fitness value with best value Cb. If current value is better than best value Cb, then update the best value with current value. Otherwise continue with the best value Cb.

    3. (c)

      Find the cohort in neighborhood with best fitness value so far and consider this value as a global best value Gb.

  6. 6.

    Every cohort samples behavior from updated interval and associated behavior can be found.

  7. 7.

    If there is no change in behavior of each cohort, the cohort can be considered as saturated.

  8. 8.

    If cohort converges to same saturated behavior even after maximum number of attempts then current cohort behavior can be accepted as final behavior.

  9. 9.

    Stop if number of iterations equals to cohort or cohort is saturated.

5.3.1 Results Based on Cohort Analysis Tools

Based on analysis of different Cohort Analysis Tools like RJMetrics, Excel and many more as visible from Tables 6, 7, 8 and 9 and using a combination of appropriate CI Algorithms suitable remedial measures can be taken and evaluation system can be devised to prevent plagiarism among students of different streams of study.

Table 6 Cohort bucket for engineering stream of study
Table 7 Cohort analysis based on Cohort bucket 1
Table 8 Cohort bucket for commerce stream of study
Table 9 Cohort analysis based on Cohort bucket 2

6 Why Genetic Search and Cohort Intelligence [15, 17]

Cohort Intelligence is a branch of behavioral analytics which plays an important role in big data analysis like students records of different streams of studies in Universities. It is also in use in various data mining applications. CI uses its self-supervising mechanism in order to improvise a student behavior in a cohort. Genetic Search [16] algorithm is a process of natural selection that is used to generate useful and optimized solutions to complex search problems.

Algorithms like Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), and Honey Bee Mating Algorithm are inspired from natural behavior of the living organisms whereas GA and CI considers natural selection human tendency to solve complex optimization problems. Because of CI self-supervised nature and GA optimized search procedure they proves to be better evaluation strategy compared to others. This approach is also reasonable with respect to computation cost and it gives more edge compared to other contemporary approaches already in use.

7 Conclusion

‘HEC’ intelligently uses different Cohort analysis tools like RJMetrics, datapine etc. to evaluate code cloning. By using procedure for GA and CI Algorithm the system evaluates different types of code plagiarism done by students. Cohort analysis along with Genetic Search procedure acts as a trigger/activator to Hawk Eye system to generate student’s different behavioral distribution patterns. Based on modeling individual student behavior, teachers can design individual assignments for students.

The proposed evaluation system design that would be the outcome of HEC, specific to a particular student plagiarism behavior this evaluation system design can be exchanged among other teachers. This reflects socially inspired behavior of teacher’s community. Students in order to maintain their socio integrity among their groups would continue with their behavior of exchanging and cloning of information thereby reflecting their socially inspired behavior. As students are more receptive to use of e-media for learning than traditional reference books.

HEC as a system would discourage the overall concept of plagiarism among students of social digital era. The evolving evaluation systems can act as a prevention measure to stop cloning and will continue to re-evolve as new cohort behavioral attributes will emerge. HEC as a concrete initiative can contribute significantly to improve the socio economic development of the country as well as help universities, teachers to understand the growing socio impact among today’s digital student generation.