Elsevier

Pattern Recognition Letters

Volume 34, Issue 3, 1 February 2013, Pages 330-334
Pattern Recognition Letters

Source code author identification with unsupervised feature learning

https://doi.org/10.1016/j.patrec.2012.10.027Get rights and content

Abstract

Automatic identification of source code authors has many applications in different fields such as source code plagiarism detection, and law suit prosecution. This paper presents a new source code author identification system based on an unsupervised feature learning technique. As a method of extracting features from high dimensional data, unsupervised feature learning has obtained a great success in many fields such as character recognition and image classification. However, according to our knowledge it has not been applied for source code author identification systems. Therefore, we investigated an unsupervised feature learning technique called sparse auto-encoder as a method of extracting features from source code files. Our system was evaluated with several datasets and results have shown that performance is very close to the state of art techniques in the source code identification field.

Highlights

► We designed and built a new source code author identification system. ► It is based on an unsupervised feature learning technique. ► Our system is evaluated with several datasets. ► Results shown that our system outperforms some existing systems.

Introduction

Automatic identification of source code authors is an important activity in many fields in computing. Among those fields, source code plagiarism detection is one of the main fields which can be easily benefited by the source code author identification. According to Zobel (2004), “students may plagiarize by copying code from friends, the Web or so called “private tutors”. Therefore, detection and prevention of source code plagiarism is essential for programming assignments.

Chao et al. (2006) have mentioned that “A quality plagiarism detector has a strong impact to law suit prosecution”. Furthermore, Lange and Mancoridis (2007) have pointed out that, source code author identification is useful in criminal justice and corporate litigation as well.

In this research we investigated a novel technique in machine learning called unsupervised feature learning (Raina et al., 2007) in order to improve the performance of source code author identification systems. Recently unsupervised feature learning algorithms have obtained a huge attention in the machine learning community (Raina et al., 2007) and have been successfully used for many applications. However, according to our knowledge it has not been tried for source code author identification. Therefore, we investigated one such learning technique called sparse auto-encoder (Coates et al., 2010).

This paper is organized as follows. In Section 2, we present previous works related to source code author identification. The architecture of our system is described in Section 3. In Section 4, we describe training, and testing of the system. Finally, we conclude this paper by presenting our conclusions and discussing further improvements.

Section snippets

Related work

A few search papers have been written on source author identification using machine learning techniques. Lange and Mancoridis (2007) have proposed a source code author identification method, which uses source code metric histograms and genetic algorithm. Firstly, source code metrics were generated from software source code files. Then the normalized metrics were used as the input for the nearest neighbor classifier. The system is capable of identifying the true author of each source code file

Source code author identification system

Following section describe the architecture of our source code author identification system. The system consists of four pipeline stages as enumerated below.

  • 1.

    Source code files are fed as the input for the system.

  • 2.

    Code metrics are extracted from the input source code files. Lange and Mancoridis (2007) have conducted an extensive research on selecting an optimum set of source code metrics from a huge collection of metrics and have identified 8 such metrics. Since our main objective of this project

Experiments

In order to gauge the performance of our system, experiments were conducted by using five datasets and details about these datasets are given below.

Dataset I is the same dataset used by Lange and Mancoridis, 2007, Bandara and Wijayarathna, 2011. It consists of source code files belonging to 10 authors extracted from Sourceforge1 website. We created Dataset II by extracting Java source code files from free and open source project in the

Conclusion and future works

This paper investigated applicability of sparse auto-encoder as an unsupervised feature learning technique for source code author identification. We tested our system with five datasets. Further, we implemented SCAP system and tested with same datasets. Our system’s performance was very close to SCAP’s performance for large datasets. However, SCAP outperformed our system for small datasets.

We used sparsed auto-encoder for feature extraction. However it is interesting to investigate performance

References (20)

  • Upul Bandara et al.

    A machine learning based tool for source code plagiarism detection

    Internat. J. Machine Learn. Comput.

    (2011)
  • Y. Bengio

    Learning deep architectures for AI

    Found. Trends® Machine Learn.

    (2009)
  • Yoshua Bengio et al.

    On the expressive power of deep architectures

  • James Bergstra et al.

    Random search for hyper-parameter optimization

    J. Machine Learn. Res.

    (2012)
  • Christopher M. Bishop

    Pattern Recognition and Machine Learning

    (2007)
  • Burrows, Steven, Tahaghoghi, Seyed M.M., 2007. Source code authorship attribution using N-grams. In: Spink, Amanda,...
  • Liu Chao et al.

    GPLAG: Detection of software plagiarism by program dependence graph analysis

  • Coates, A. et al., 2011. Text detection and character recognition in scene images with unsupervised feature learning....
  • Coates, Adam, Lee, Honglak, Ng, Andrew, 2010. An analysis of single-layer networks in unsupervised feature learning....
  • Bruce S. Elenbogen et al.

    Detecting outsourced student programming assignments

    J. Comput. Sci. Coll.

    (2008)
There are more references available in the full text version of this article.

Cited by (20)

View all citing articles on Scopus
View full text