Heterogeneous tree structure classification to label Java programmers according to their expertise level

https://doi.org/10.1016/j.future.2019.12.016Get rights and content

Highlights

  • New feature learning approach to classify great amounts of tree structures.

  • Syntactic classification of the programming expertise level of Java developers.

  • The system is able to label expert and novice programs with 99.6% accuracy.

  • Six different kinds of heterogeneous code fragments can be classified.

  • Identifies Java syntax patterns commonly used by expert and novice programmers.

Abstract

Open-source code repositories are a valuable asset to creating different kinds of tools and services, utilizing machine learning and probabilistic reasoning. Syntactic models process Abstract Syntax Trees (AST) of source code to build systems capable of predicting different software properties. The main difficulty of building such models comes from the heterogeneous and compound structures of ASTs, and that traditional machine learning algorithms require instances to be represented as n-dimensional vectors rather than trees. In this article, we propose a new approach to classify ASTs using traditional supervised-learning algorithms, where a feature learning process selects the most representative syntax patterns for the child subtrees of different syntax constructs. Those syntax patterns are used to enrich the context information of each AST, allowing the classification of compound heterogeneous tree structures. The proposed approach is applied to the problem of labeling the expertise level of Java programmers. The system is able to label expert and novice programs with an average accuracy of 99.6%. Moreover, other code fragments such as types, fields, methods, statements and expressions could also be classified, with average accuracies of 99.5%, 91.4%, 95.2%, 88.3% and 78.1%, respectively.

Introduction

Big data is aimed at extracting value from large datasets, creating predictive models and reports, visualizing and describing data, and finding relationships between variables. Big data is being used in many different fields such as medicine, finance, healthcare, education, social networks and genomics. Considering programs as data, the existing open-source code repositories (GitHub, SourceForge, BitBucket and CodePlex) provide massive codebases to be used in the creation of programming tools and services to improve software development, making use of machine learning and probabilistic reasoning [1], [2]. This research area has been termed “big code”, due to its similarity with big data and the use of source code [3].

In the big code area, existing source-code corpora have already been used to create different systems such as deobfuscators [1], statistical machine translation [4], security vulnerability detection [5] and decompilation [6], [7]. Probabilistic models are built with machine learning and natural language processing techniques to exploit the abundance of patterns in source code. Three categories of models have been identified, based on the way they represent the structure of programs [8]: token-level models, that represent code as a sequence of tokens (terminal symbols in the language); syntactic models, that represent code as a trees (abstract syntax trees or ASTs); and semantic models, that use additional graph structures (e.g., control-flow graphs and data-dependency graphs).

One of the challenges of big code is to classify and score the level of programming expertise of developers, by analyzing the source code they write [9]. Then, new tools and IDEs1 to teach programming can be developed. Such tools would provide different hints to programmers depending on their level of expertise. A novice Java programmer could be instructed to use inheritance and polymorphism; for average developers, functional idioms using lambda expressions could be introduced [10]; and advanced patterns to avoid performance bottlenecks or security vulnerabilities could be advised to expert programmers [11].

A system capable of classifying programmers by their expertise level can also be used to analyze the recurrent idioms written by expert programmers. Such idioms could be published and used to improve the skills of average programmers. Likewise, programming lecturers can identify the recurrent programming patterns used by beginners, explaining how they could be improved with better alternatives.

A model that scores the expertise level of programmers can be used to check the improvement of student’s programming skills during a programming course. The model would identify those students that do not obtain the expected level of programming expertise, so lecturers could help them at the earliest. It would also identify those who have better programming skills, so they could be motivated with additional activities.

The scoring model could also be used by an Intelligent Tutoring System (ITS) that considers how the student evolves. If the student score increases, more advanced programming constructs will be taught. If the score stays the same, the ITS will offer new activities to strengthen the new language construct taught. Finally, if the score drops, the system will revisit some language constructs formerly explained. In this case, the language construct to be revisited would depend on the idioms coded by the student.

In this work, we face the challenge of building syntactic models to classify and score the programming level of expertise of Java developers. What follows are the main requirements we must fulfill.

When classifying programmers, different fragments of their code could be analyzed. Therefore, a classifier must consider different levels of syntax constructs, such as expressions, statements, methods, fields, types (classes, interfaces and enumerations) and whole programs (Fig. 1). A whole program will give the classifier more information to label the developer, but a useful tool should give hints to the programmer when one single statement, method or even expression is typed. Therefore, a programmer classifier should be constructed with different models that classify different levels of syntax constructs (expressions, statements, methods, fields, types and whole programs).

Fig. 1 shows that the syntax of the different language constructs are heterogeneous. For example, the syntax of methods is different to the syntax of statements and expressions. Moreover, many program constructs are composed of other program constructs. For example, the assignment statement “

” comprises the two expressions on the left- and right-hand side of the assignment operator (the right-hand side is also subdivided in other subexpressions). Likewise, object-oriented programs are composed of a set of types, types may comprise methods and fields, methods contain statements, and statements and fields are commonly built using expressions. Therefore, a Java syntax classifier should be able to label those heterogeneous compound AST structures.

As mentioned, the syntax patterns used to classify expert and novice programmer are valuable information. We described how they could be used to assist lecturers in a programming course, and to create Intelligent Tutoring Systems. Additionally, we propose to use the extracted patterns in a feature learning process (Section 3) to build classifiers with different kinds of syntax constructs.

Model construction must be scalable, since we follow the big code philosophy of using massive datasets. It must allow the construction of classifiers from millions of instances. For example, just the dataset we used to build the expressions classifier holds 13,498,005 instances (see Section 4.1).

An important challenge of syntax pattern classification is to build predictive models from trees, since most supervised classification algorithms require instances (individuals or rows) to be represented as fixed size n-dimensional vectors [12]. While there are standard techniques to compute such vectors for documents, images and sound, there are no similarly standard representations for programs [5]. There exist alternative structured prediction methods such as Graph Neural Networks (GNNs) and Conditional Random Fields (CRFs) —discussed in Section 2—, but they unfortunately seem to suffer sufficiently high computation and space costs to be used with massive codebases [13].

In this work, we use decision trees (DTs) as the supervised learning algorithm, because DTs create interpretable white-box models, and perform well with large datasets [14]. They are also able to handle both numerical and categorical data.

In order to build DTs, we tabularize the ASTs of the input programs. We represent as features the main syntactic characteristics of each kind of node (expression, statement, method, field, type and program), including its category (e.g., arithmetic operation, method invocation, field access, etc.), and multiple information about their context (data about its parent and child nodes, its role in the enclosing node, its depth and height, etc.).

We create different datasets for each kind of node. Then, we build different homogeneous DT models that classify each kind of syntax construct (e.g., expressions, statements, methods, etc.). Finally, we take the patterns used by the homogeneous models to build new classifiers of compound heterogeneous syntax constructs (e.g., a method classifier that also considers the syntax patterns of the statements and expressions written within the method).

The main contributions of this paper are:

  • 1.

    A new feature learning approach to classify great amounts of trees made up of compound heterogeneous structures.

  • 2.

    A system to classify the programming expertise level of Java developers by analyzing the syntax constructs of their code. The system can also be used to measure the probability of a code fragment to be written by an expert or beginner.

  • 3.

    The identification of Java syntax patterns used by both expert and novice programmers.

The rest of this paper is structured as follows. Next section discusses the related work, and Section 3 details the proposed system. Section 4 evaluates our system with different experiments, and the conclusions and future work are presented in Section 5.

Section snippets

Related work

We discuss the work related to source code classification with syntactic models, according to its objective and the method used.

System architecture

Fig. 2 shows the architecture of our system, and Algorithm 1 details how it works. We first provide a brief high-level description of the modules in the architecture. Forthcoming subsections detail the behavior of each module.

The input of the system is a database of labeled Java programs (expert or beginner); the output is a collection of heterogeneous decision tree models to classify programmers, plus the syntax patterns used by the classifier. Such patterns describe common idioms used by

Evaluation

In this section, we evaluate the performance of the proposed system to label Java programmers according to their expertise level. We first describe the experimental data (Section 4.1) and environment (Section 4.2). Then, we describe and show the results of the following experiments within the framework of the proposed system:

  • 1.

    Syntax pattern selection (Section 4.3). This experiment applies the method for pattern selection described in Section 3.3 to reduce the number of syntax patterns taken from

Conclusions

The proposed feature learning approach to classify heterogeneous compound tree structures provides higher accuracy than the existing methods to classify the programming expertise level of Java developers. The compound heterogeneous classifiers are enriched with the most determinant syntax patterns extracted from the homogeneous models, significantly improving the performance of the classifiers. Our system allows classifying different syntax constructs found in Java programs. Decision trees

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work has been partially funded by the Spanish Department of Science, Innovation and Universities: project RTI2018-099235-B-I00. The authors have also received funds from the University of Oviedo through its support to official research groups (GR-2011-0040).

Francisco Ortin is a Full Professor of the Computer Science Department at the University of Oviedo, Spain. He also works as an Adjunct Lecturer for the Cork Institute of Technology (Ireland). Francisco is the head of the Computational Reflection research group (http://www.reflection.uniovi.es). He received his B.Sc. in Computer Science in 1994, and his M.Sc. in Computer Engineering in 1996. In 2002 he was awarded his Ph.D. entitled A Flexible Programming Computational System developed over a

References (46)

  • LeeS. et al.

    Mining biometric data to predict programmer expertise and task difficulty

    Cluster Comput.

    (2017)
  • RoyC.K. et al.

    Comparison and evaluation of code clone detection techniques and tools: a qualitative approach

    Sci. Comput. Program.

    (2009)
  • V. Raychev, M. Vechev, A. Krause, Predicting program properties from “big code”, in: Proceedings of the 42nd Annual ACM...
  • OrtinF. et al.

    Big Code: new opportunities for improving software construction

    J. Softw.

    (2016)
  • Defense Advanced Research Projects AgencyF.

    MUSE envisions mining “big code” to improve software reliability and construction

    (2014)
  • KaraivanovS. et al.

    Phrase-based statistical translation of programming languages

  • YamaguchiF. et al.

    Generalized vulnerability extrapolation using abstract syntax trees

  • LevyD. et al.

    Learning to align the source code to the compiled object code

  • EscaladaJ. et al.

    An adaptable infrastructure to generate training datasets for decompilation issues

  • AllamanisM. et al.

    A survey of machine learning for big code and naturalness

    ACM Comput. Surv.

    (2018)
  • Abu-NaserS.

    Predicting learners performance using artificial neural networks in linear programming intelligent tutoring system

    Int. J. Artif. Intell. Appl.

    (2012)
  • MazinanianD. et al.

    Understanding the use of lambda expressions in java

    Proc. ACM Program. Lang.

    (2017)
  • Java Coding Guidelines

    (2019)
  • SeidelE.L. et al.

    Learning to blame: localizing novice type errors with data-driven diagnosis

  • CaiH. et al.

    A comprehensive survey of graph embedding: Problems, techniques, and applications

    IEEE Trans. Knowl. Data Eng.

    (2018)
  • RokachL. et al.

    Top-down induction of decision trees classifiers - a survey

    IEEE Trans. Syst. Man Cybern. Syst. C

    (2005)
  • NaserS.A. et al.

    Human computer interaction design of the LP-ITS: Linear programming intelligent tutoring systems

    Int. J. Artif. Intell. Appl.

    (2011)
  • W.S. Evans, C.W. Fraser, F. Ma, Clone detection via structural abstraction, in: 14th Working Conference on Reverse...
  • J. Mayrand, C. Leblanc, E. Merlo, Experiment on the automatic detection of function clones in a software system using...
  • I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, L. Bier, Clone detection using abstract syntax trees, in: Proceedings of...
  • AxivionC.K.

    Project bauhaus

    (2019)
  • V. Wahler, D. Seipel, J.W.v. Gudenberg, G. Fischer, Clone detection in source code by frequent itemset techniques, in:...
  • R. Koschke, R. Falke, P. Frenzel, Clone detection using abstract syntax suffix trees, in: Proceedings of the 13th...
  • Cited by (15)

    • Analyzing syntactic constructs of Java programs with machine learning

      2023, Expert Systems with Applications
      Citation Excerpt :

      Several works have used machine learning to classify syntactic patterns. In the work carried out by Ortin et al. models are created to classify programmers by their experience level (Ortin et al., 2020). They start from a set of ASTs with the same structure, and manually translate them into tables that represent the characteristics of the tree nodes.

    • Visualization of aggregated information to support class-level software evolution

      2022, Journal of Systems and Software
      Citation Excerpt :

      Similarly, another work proposes a graph-based visualization within a platform (a compiler), namely ProgQuery, allowing developers to write a modified program for analysis. The tool then computes a few syntactic and semantic links between the code elements in order to visualize the code in a graph, and extract different types of knowledge/features from the graph (Rodriguez-Prieto et al., 2020; Ortin et al., 2020). In comparison, DejaVu does not require environment monitoring, infers high-level changes, provides request-based access to lower-level changes, and retrieves further change-related information from external repositories.

    • Massive LMS log data analysis for the early prediction of course-agnostic student performance

      2021, Computers and Education
      Citation Excerpt :

      Not all the algorithms we use create interpretable white-box models. DT has this capability, since the paths from root to leaf nodes can be interpreted as classification rules (Ortin et al., 2020). Such classification rules are based on the values of the features.

    View all citing articles on Scopus

    Francisco Ortin is a Full Professor of the Computer Science Department at the University of Oviedo, Spain. He also works as an Adjunct Lecturer for the Cork Institute of Technology (Ireland). Francisco is the head of the Computational Reflection research group (http://www.reflection.uniovi.es). He received his B.Sc. in Computer Science in 1994, and his M.Sc. in Computer Engineering in 1996. In 2002 he was awarded his Ph.D. entitled A Flexible Programming Computational System developed over a Non-Restrictive Reflective Abstract Machine. He has been the principal investigator of different research projects funded by Microsoft Research, the Spanish Department of Science and Innovation, European Union, and different companies. His main research interests include big code, software development and programming languages. Contact him at http://www.reflection.uniovi.es/ortin.

    Oscar Rodriguez-Prieto is a full-time Ph.D. student at the Computer Science Department of the University of Oviedo, Spain. He received his B.Sc. degree in Software Engineering in 2014. In 2015 he was awarded an M.Sc. in Programming Languages and Systems from the Spanish National Distance Education University (UNED). His research interests are focused on using big code to improve software reliability and construction, using machine learning probabilistic reasoning.

    Nicolas Pascual is a student in the M.Sc. degree in Artificial Intelligence at the Polytechnic University of Catalonia (Barcelona). In 2018 he was awarded an B.Sc. in Software Engineering from the University of Oviedo. His research interests include natural language processing, artificial neural networks, big code, computer vision and probabilistic reasoning.

    Miguel Garcia is an Assistant Professor of the Computer Science Department at the University of Oviedo. He received his B.Sc. degree in computer science in 2005. In 2008 he was awarded an M.Sc. in Web Engineering, and an M.Sc. in software engineering research in 2010. In 2013 he presented his Ph.D. dissertation entitled Improving the Runtime Performance and Robustness of Hybrid Static and Dynamic Typing Languages. His research interests include big data, machine learning and programming languages. Contact him at http://www.reflection.uniovi.es/miguel.

    View full text