TOWARDS PROGRAMMER KNOWLEDGE PROFILE GENERATION

The following article deals with static analysis of source code in Java and it is intended for readers interested in techniques focused on evaluation of programming abilities of students or potential job candidates. The main objective of the static analysis is to collect the most relevant and signiﬁcant data about programmers. If such data is properly visualized, it can result in knowledge proﬁle which further determines programmer’s real programming abilities as well as his habits. This can be useful mainly for impartial observer who does not know the code author. In the following article we present our ﬁrst attempts to create and visualize knowledge proﬁles through static analysis and statistics regarding frequency of language elements. In perspective, the conclusion combines advanced techniques towards creation of more precise proﬁles as the future work.


INTRODUCTION
In many disciplines, the level of knowledge or skills is a stumbling-block, creating a competitive environment.Similar issue is visible in programming, though the variety of one's skills evaluation are quite limited.In this article, we introduce prototype of knowledge profile generator through (yet) static source code analysis.The main interest resides in the source code exploration with an objective evaluation of one's actual knowledge and programming abilities, individual progress compared to the past, or possible weaknesses to be addressed.
Both beginners and experienced programmers can benefit from such a knowledge profile.Moreover, profiles can be helpful for lecturers throughout overall student assessment or while identifying potential shortcomings of the course.Other benefiting areas are in labor market, offering an adequate evaluation of job candidates.I.e.we devote this article to researchers focusing on source code analysis and the code author(s).
There is a number of tools, both automated and semiautomated, dealing with source code analysis.Mostly, the main objective is to evaluate software security, quality or design, and the main result includes a report which includes various metrics or graphs.Since such tools deal with code regarding the final product, they do not focus on its author (programmer).
In the area of software security, some studies detect bugs, defects and other vulnerabilities, e.g.[1] and [2], both performing static analysis of C/C++ source code.Other studies explore static code and identify various bugs as well as bad programming practice [3].
Modern compilers include static analysis tools, usually referring to methods of automated determination of program behavior during compile time.Since traditional tools identify only simple errors, some studies are dedicated to identification of deadlock presence [4], others deal with breaking of mutual exclusion in concurrent applications [5].
A technique of program assembling, comparison, and combining, known as abstract interpretation, has been successfully used to derive run time properties of a program, used in source code optimalization.Other goals of static analysis mostly include code transformation [6,7], concept location [8,9] or reverse engineering [10].
A method presented in [11] introduced location of computational units via execution profiles, typical for a set of related features.The authors of this study performed concept analysis resulting in detection of the most feature-specific computational units.Combination of these units with static analysis resulted in detection of additional units along with the dependency graph.Moreover, static code analysis has been also the subject of several surveys, e.g.[12] or [13].
Exploration and examination of software repositories formed the research area of mining software repositories (MSR).In the past, MSR examination was focused on industrial systems [14].However, the popularity of opensource software led to challenges of clearer understanding of tool development, methods, processes and software evolution [15].
Depending on particular exploration objectives and software repositories, analysis of metadata is always different.The main issues include [16]: Detection of change patterns, prediction of changes, detection of bugs, analysis of bugfixing change, source code exploration, or identification of software developers.
All the mentioned issues have one common objective: To enhance traditional techniques of software engineering towards processes of guide decision in modern software projects [17].While MSR researchers deal with programming targets (programming result -software), our research is dedicated to the source (software author).Our aims include assessment of the code author, so the source code exploration and developer identification, approached in [18], are the most related issues.
In this article we describe creation of knowledge profiles from programmers' source codes where every profile can be compared with other profiles.In our experiment, we compare actual profile with older profiles, indicating programmer's improvement.Moreover, we compare a group of different programmers in order to highlight differences in their skills.In our vision, it should be possible to compare profiles to specific levels of knowledge as well, e.g.necessary to perform a specific task.

Source code
Metrics and exploration Subject profile

Comparison report
Fig. 1 Main idea of knowledge profile generator We believe that comparison of source codes in the form of knowledge profiles is the main scientific contribution.We perceive knowledge as an option to perform a better analysis and filter any irrelevant data.According to the literature overview and to our best knowledge, such a profilecreating tool has not yet been developed.
In the following sections, we define the concept of knowledge profile, sec.2, and we introduce a prototype of profile generator, sec.3. The generator is based on static analysis and yet it deals partially with the presented task.It analyses the use of language constructs of Java and creates profiles (including visualization) through various metrics and statistics.In sec.4, we discuss results achieved by the prototype within an experiment performed on student assignments.Conclusion remarks deal mainly with the future version of the profile generator, sec. 5.

PROGRAMMER KNOWLEDGE PROFILE
Knowledge profile delineates skills, abilities and bindings among their elements which are required to perform some task.In general, we admit various abstraction degrees of a profile definition (knowledge/skills).E.g. (outside the area of programming) to know how to saw, to know how to saw by a chainsaw, to know how to saw by a chainsaw if the wood is of a thinner diameter.
We suppose that it is hard to differentiate knowledge of similar concepts.If so, we rely on understanding the issue in most usual cases.E.g. (in the area of programming) if a programmer has proved he knows how to work with conditional expressions within if, we can assume that he knows how to work with conditional expressions within while or for.
In our perception, we are able to formally define knowledge profile and create it implicitly provided there is sufficient input information.In the area of programming, input information is represented by source code and a profile is formally set over particular programming language (Java in our case).We differentiate two profiles: • Subject profile -Expresses what the author of the code (the subject) understands and the range of tasks he is able to solve.In our case, the programmer should know, e.g.how to declare or call a method, and even how to use an annotation [19].One knowledge (subject) profile is expected to be generated after processing (analyzing) a finite number of source code files.Object profile can be constructed both manually or automatically, based on completely or partially solved tasks.
Fig. 1 illustrates the main idea.The object profile is optional, however, language definition and source code are mandatory.If both subject and object profiles are created, we can generate a comparison report.
By creating a subject profile, we can determine whether the programmer has enough knowledge to handle some task as well as we can identify any missing knowledge.Since each programming task or its solution is structured, the profile is required to be structured as well.
Currently, results of the profile generator are visualized within a table of data.In later stages of the research we plan its transformation to a graph or a tree containing annotated edges or nodes [19].
In order to analyze the source code, is it possible to use language parser.Since rules of the language grammar define concepts, we assume that if a programmer (subject) uses particular rules, then he understands constructs which describe and define the programming language.Source code exploration (Fig. 1) does not require to use complete language syntax but rules necessary to create a profile.However, an appropriate form of rules should be humaninterpretable, i.e.Eq. 1 is better than Eq. 2, expressing that in order to understand while, one should understand both expressions and statements.W hile → "while" "(" Expression ")" Statement A → "while" "(" B ")" C Moreover, when generating a complex profile, we cannot rely on a fact that the programmer understands something after one occurance.That is, we need to define metrics including both facts and empirical observation.E.g. if one class contains 20 methods, then the understanding may be derived as: 10 + 4 × number_of_used_methods, i.e. if the subject has used every method (out of 20) at least once, then 10 + 4 × 20 = 100, so he fully understands the class.We may also assume multilicity: The more is something used, the more the programmer understands it, or complexity: The longer is the code (or documentation), the more the programmer understands it.Regarding profile generation, such metrics definition is a separate part of our research.

PROFILE GENERATOR
Prototype of the proposed profile generator allows to process Java code, counting particular language constructs and generating profile as a table with summary data.Then, the data is visualized in various forms (currently four), so it is possible to further examine and compare the data: • Detailed table -For every source code, it contains usage frequency of language constructs, all divided to logical groups in separate tables, e.g. of arithmetic operators.
• Summary table -Contains summary data for all source code files regarding distribution of language constructs, e.g.arithmetic mean, modus, median, or standard deviation.
• Heat map -For every language construct, this matrix consists of cells colored by occurrence (the darker the color, the higher the frequency).Additional tooltip window contains additional statistical data.
In order to process Java source codes, we use ANTLR (parser generator, [21]), creating tables serialized in JSON format.Results are visualized through web interface, based on AngularJS framework and HighCharts library.
Currently, the profile generator creates simple profiles containing data assembled for some group of source codes created by one programmer (subject profile) or data of a single project (object profile).If comparing various data or programmers, heat maps have proved to be the most useful, since they allow to display a lot of data at one place.

RESULTS
In order to verify the proposed method of knowledge profile generation, we have measured subject abilities based on his profile and determined whether he is suitable to solve a specific task.A correct determination of what the subject knows or not may be influenced by the following: • Subject evaluates himself (through a questionnaire), • Subject is issued a task and his experience is assessed (question, program fragment, program synthesis), • Subject profile is generated.
In our case, we decided to verify the proposed approach within the educational process by tracking changes in a student profiles.Student assignments are programming projects of similar size within the same domain.We collected assignments of the OOP course, introducing Java laguage as well as object-oriented paradigm.Except that students were supposed to work on the same problem, within the experiment we assumed that students (subjects) had similar dispose of knowledge.To be more precise, student profiles were compared with lecturer profiles as well.
Fig. 2 displays comparison of several profiles (part of the overall results).Rows represent different language elements, e.g.break or try.Columns correspond to different programming projects (assignments).Students are labeled by numbers while teacher is labeled as master.For every source code, the table contains language element occurrence complemented by statistics (displayed as a tooltip window after pointing to a table cell).
Although profiles of students and the master are rather similar, a careful reader may have noticed some notable differences.E.g. student 3 used the highest number of various language elements (also those not used by the master) while student 7 may have encountered difficulties with understanding the principles of object-oriented paradigm as the static modifier was used much more often than in other profiles.Some students did not use language elements frequent in other profiles.E.g. student 5 missed switch while student 1 missed float and long.These students could both not understand these types or constructs, or they just inclined to a different way of solution.That is, in some cases, further exploration of the source codes is necessary.On the other hand, some data can clearly indicate weaknesses, e.g.student 6 did not use final modifier, i.e. he does not understand the importance of immutability in programs.

CONCLUSION REMARKS
We introduced an approach towards creation of programmer knowledge profiles through profile generator, exploring Java source codes.The aim is to implicitly create such profile.Despite this topic is relatively extensive, the analysis revealed it is little explored.There exists a large variety of potential methods, yet we focused on static analysis, language element frequency and descriptive statistics [22].Authors of [13] claim that tools based on static analysis create a lot of data.This is why there are three relevant research topics: methods of profile generation, usability of profiles and visualization of profiles.
We also described profile generator tool and experimentally explored student assignments.The results showed the tool counts frequency of particular language elements using descriptive statistics [22] and visualizes assembled data in various tables, heat maps and whisker plots (available also in JSON meta-form).The statistics can show the subject is familiar with particular elements, yet it does not implicitly mean he/she is using them correctly.The same applies to unused elements.The fact that the subject did not use particular element does not mean he/she is not familiar with it.Thus, more appropriate metrics should be proposed, otherwise manual code exploration will always be necessary.
Nevertheless, presented approach can be applied in the following areas: Course or book profile (object profile based on subject profile, selection of the most useful course or book), candidate profile (subject profile based on object profile, what is required to be solved), skills profile and student assessment [23] (object profile based on subject profile, consisting of abilities necessary to solve some task), statistics (subject profile, frequency evaluation of a language construct indicating its difficulty), or complexity profile (complexity evaluation of a language construct or a library).Obviously, comparison of student profiles with each other may reveal plagiarism.
Yet, the profile generator has proved to be an interesting tool to assess programmers and it is ready to be extended towards treating more comprehensive programming projects (e.g.model-driven software development [24]) and applying advanced metrics.Further research iterations will enhance the tool, so it will be able to identify various patterns in programmer behavior [25], or to detect and assess advanced language usage, e.g.programming idioms [26] or nested loops, or to deal with security issues [27].In addition to language elements, we also plan to track the usage of library classes and methods.Comparison of student profiles will involve multiple master profiles.Future work will also include model-based assessment similar to [5] or reference processing in a programming language [28].

Fig. 2
Fig. 2 Heat map: Comparison of student assignments or a programming book, consisting of prerequisites, i.e. what one should already understand before attending the course or reading the book.We can even create a distinguished object profile expressing what one should understand after attending the course or reading the book.In other words, object profile represents an expected knowledge profile while the reality may be different.However, if the subject profile is supposed to be general, object profile becomes optional.
• Object profile -Expresses what is required to solve a task or tasks.We can define object profile for an ed-ucational course