Source code author identification with unsupervised feature learning

doi:10.1016/j.patrec.2012.10.027

Pattern Recognition Letters

Volume 34, Issue 3, 1 February 2013, Pages 330-334

https://doi.org/10.1016/j.patrec.2012.10.027 Get rights and content

Abstract

Automatic identification of source code authors has many applications in different fields such as source code plagiarism detection, and law suit prosecution. This paper presents a new source code author identification system based on an unsupervised feature learning technique. As a method of extracting features from high dimensional data, unsupervised feature learning has obtained a great success in many fields such as character recognition and image classification. However, according to our knowledge it has not been applied for source code author identification systems. Therefore, we investigated an unsupervised feature learning technique called sparse auto-encoder as a method of extracting features from source code files. Our system was evaluated with several datasets and results have shown that performance is very close to the state of art techniques in the source code identification field.

Highlights

► We designed and built a new source code author identification system. ► It is based on an unsupervised feature learning technique. ► Our system is evaluated with several datasets. ► Results shown that our system outperforms some existing systems.

Introduction

Automatic identification of source code authors is an important activity in many fields in computing. Among those fields, source code plagiarism detection is one of the main fields which can be easily benefited by the source code author identification. According to Zobel (2004), “students may plagiarize by copying code from friends, the Web or so called “private tutors”. Therefore, detection and prevention of source code plagiarism is essential for programming assignments.

Chao et al. (2006) have mentioned that “A quality plagiarism detector has a strong impact to law suit prosecution”. Furthermore, Lange and Mancoridis (2007) have pointed out that, source code author identification is useful in criminal justice and corporate litigation as well.

In this research we investigated a novel technique in machine learning called unsupervised feature learning (Raina et al., 2007) in order to improve the performance of source code author identification systems. Recently unsupervised feature learning algorithms have obtained a huge attention in the machine learning community (Raina et al., 2007) and have been successfully used for many applications. However, according to our knowledge it has not been tried for source code author identification. Therefore, we investigated one such learning technique called sparse auto-encoder (Coates et al., 2010).

This paper is organized as follows. In Section 2, we present previous works related to source code author identification. The architecture of our system is described in Section 3. In Section 4, we describe training, and testing of the system. Finally, we conclude this paper by presenting our conclusions and discussing further improvements.

Section snippets

Related work

A few search papers have been written on source author identification using machine learning techniques. Lange and Mancoridis (2007) have proposed a source code author identification method, which uses source code metric histograms and genetic algorithm. Firstly, source code metrics were generated from software source code files. Then the normalized metrics were used as the input for the nearest neighbor classifier. The system is capable of identifying the true author of each source code file

Source code author identification system

Following section describe the architecture of our source code author identification system. The system consists of four pipeline stages as enumerated below.

1.
Source code files are fed as the input for the system.
2.
Code metrics are extracted from the input source code files. Lange and Mancoridis (2007) have conducted an extensive research on selecting an optimum set of source code metrics from a huge collection of metrics and have identified 8 such metrics. Since our main objective of this project

Experiments

In order to gauge the performance of our system, experiments were conducted by using five datasets and details about these datasets are given below.

Dataset I is the same dataset used by Lange and Mancoridis, 2007, Bandara and Wijayarathna, 2011. It consists of source code files belonging to 10 authors extracted from Sourceforge¹ website. We created Dataset II by extracting Java source code files from free and open source project in the

Conclusion and future works

This paper investigated applicability of sparse auto-encoder as an unsupervised feature learning technique for source code author identification. We tested our system with five datasets. Further, we implemented SCAP system and tested with same datasets. Our system’s performance was very close to SCAP’s performance for large datasets. However, SCAP outperformed our system for small datasets.

We used sparsed auto-encoder for feature extraction. However it is interesting to investigate performance

References (20)

Upul Bandara et al.
A machine learning based tool for source code plagiarism detection
Internat. J. Machine Learn. Comput.
(2011)
Y. Bengio
Learning deep architectures for AI
Found. Trends® Machine Learn.
(2009)
Yoshua Bengio et al.
On the expressive power of deep architectures
James Bergstra et al.
Random search for hyper-parameter optimization
J. Machine Learn. Res.
(2012)
Christopher M. Bishop
Pattern Recognition and Machine Learning
(2007)
Burrows, Steven, Tahaghoghi, Seyed M.M., 2007. Source code authorship attribution using N-grams. In: Spink, Amanda,...
Liu Chao et al.
GPLAG: Detection of software plagiarism by program dependence graph analysis
Coates, A. et al., 2011. Text detection and character recognition in scene images with unsupervised feature learning....
Coates, Adam, Lee, Honglak, Ng, Andrew, 2010. An analysis of single-layer networks in unsupervised feature learning....
Bruce S. Elenbogen et al.
Detecting outsourced student programming assignments
J. Comput. Sci. Coll.
(2008)

There are more references available in the full text version of this article.

Cited by (20)

A classification method for moving targets in the wild based on microphone array and linear sparse auto-encoder
2017, Neurocomputing
Moving target classification is an important issue in wireless sensors. The wild environment makes it a difficult problem for the acoustic signals. In this paper, a new classification method for moving targets in the wild is proposed based on microphone array and linear sparse auto-encoder (LSAE). First, the acoustic signals of moving targets are enhanced by delay-and-sum (DS) beamformer in the narrowband way for the simplicity. The enhancing effects are given a detailed analysis. Then, a spatial feature named noise likelihood (NLH) is presented to further resist the interferences and noise widely existing in the wild. The NLH has a good ability to distinguish between the moving targets and noise. Moreover, to make full use of both the signals beamformed and the NLH, a classification network combining the LSAE layers to learn their representations by self-taught learning and the softmax layer for the classification is built. Experiments show that not only the representations learned by the LSAE layers are robust and much distinguishable but also the proposed method achieves a much better classification performance in comparison with the baseline classifiers for moving targets in the wild.
Dataset Characteristics for Reliable Code Authorship Attribution
2023, IEEE Transactions on Dependable and Secure Computing
A study on identifying code author from real development
2022, ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Post-Authorship Attribution Using Regularized Deep Neural Network
2022, Applied Sciences (Switzerland)
Code Authors Hidden in File Revision Histories: An Empirical Study
2021, IEEE International Conference on Program Comprehension
ICodeNet - A Hierarchical Neural Network Approach for Source Code Author Identification
2021, ACM International Conference Proceeding Series

View all citing articles on Scopus

View full text

Source code author identification with unsupervised feature learning

Abstract

Highlights

Introduction

Section snippets

Related work

Source code author identification system

Experiments

Conclusion and future works

A machine learning based tool for source code plagiarism detection

Internat. J. Machine Learn. Comput.

Learning deep architectures for AI

Found. Trends® Machine Learn.

On the expressive power of deep architectures

Random search for hyper-parameter optimization

J. Machine Learn. Res.

Pattern Recognition and Machine Learning

GPLAG: Detection of software plagiarism by program dependence graph analysis

Detecting outsourced student programming assignments

J. Comput. Sci. Coll.