research-article

Using (bio)metrics to predict code quality online

Authors:
Sebastian C. Müller

University of Zurich, Switzerland

University of Zurich, Switzerland
View Profile

,
Thomas Fritz

University of Zurich, Switzerland

University of Zurich, Switzerland
View Profile

ICSE '16: Proceedings of the 38th International Conference on Software EngineeringMay 2016Pages 452–463https://doi.org/10.1145/2884781.2884803

Published:14 May 2016Publication History

ICSE '16: Proceedings of the 38th International Conference on Software Engineering

Pages 452–463

ABSTRACT

Finding and fixing code quality concerns, such as defects or poor understandability of code, decreases software development and evolution costs. A common industrial practice to identify code quality concerns early on are code reviews. While code reviews help to identify problems early on, they also impose costs on development and only take place after a code change is already completed. The goal of our research is to automatically identify code quality concerns while a developer is making a change to the code. By using biometrics, such as heart rate variability, we aim to determine the difficulty a developer experiences working on a part of the code as well as identify and help to fix code quality concerns before they are even committed to the repository.

In a field study with ten professional developers over a two-week period we investigated the use of biometrics to determine code quality concerns. Our results show that biometrics are indeed able to predict quality concerns of parts of the code while a developer is working on, improving upon a naive classifier by more than 26% and outperforming classifiers based on more traditional metrics. In a second study with five professional developers from a different country and company, we found evidence that some of our findings from our initial study can be replicated. Overall, the results from the presented studies suggest that biometrics have the potential to predict code quality concerns online and thus lower development and evolution costs.

References

A. F. Ackerman, P. J. Fowler, and R. G. Ebenau. Software inspections and the industrial production of software. In Proc. of Symp. on Softw. Validation, 1984. Google ScholarDigital Library
E. H. Alikacem and H. Sahraoui. Generic metric extraction framework. In Proc. of IWSM/MetriKon, 2006.Google Scholar
L. Anthony, P. Carrington, P. Chu, C. Kidd, J. Lai, and A. Sears. Gesture dynamics: Features sensitive to task difficulty and correlated with physiological sensors. Stress, 1418(360), 2011.Google Scholar
http://www.apple.com/watch/.Google Scholar
P. Ayres. Systematic mathematical errors and cognitive load. In Contemporary Educational Psychology, 2001.Google ScholarCross Ref
A. Bacchelli and C. Bird. Expectations, outcomes, and challenges of modern code review. In Proc. of ICSE, 2013. Google ScholarDigital Library
R. Bednarik and M. Tukiainen. An eye-tracking methodology for characterizing program comprehension processes. In Proc. of ETRA, 2006. Google ScholarDigital Library
R. Bednarik, H. Vrzakova, and M. Hradis. What do you want to do next: a novel approach for intent prediction in gaze-based interaction. In Proc. of ETRA, 2012. Google ScholarDigital Library
G. G. Berntson, J. T. J. Bigger, D. L. Eckberg, P. Grossman, P. G. Kaufmann, M. Malik, H. N. Nagaraja, S. W. Porges, J. P. Saul, P. H. Stone, and M. W. van der Molen. Heart rate variability: origins, methods, and interpretive caveats. Psychophysiology, 34(6):623--648, 1997.Google ScholarCross Ref
B. W. Boehm. Software engineering economics. Prentice-Hall, 1981. Google ScholarDigital Library
B. W. Boehm, J. R. Brown, and M. Lipow. Quantitative evaluation of software quality. In Proc. of ICSE, 1976. Google ScholarDigital Library
A. Bosu, M. Greiler, and C. Bird. Characteristics of useful code reviews: An empirical study at microsoft. In Proc. of MSR, 2015. Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
S. Butterworth. On the theory of filter amplifiers. Wireless Engineer, 7:536--541, 1930.Google Scholar
J. Carter and P. Dewan. Are you having difficulty? In Proc. of CSCW, 2010. Google ScholarDigital Library
J. Cohen. A coefficient of agreement for nominal scales. Education and Psychological Measurement, 20:37--46, 1960.Google ScholarCross Ref
A. M. Connor. Mining software metrics for the jazz repository. Journal of Systems and Software, 1(5):194--204, 2011.Google Scholar
D. J. Cornforth, A. Koenig, R. Riener, K. August, A. H. Khandoker, C. Karmakar, M. Palaniswami, and H. F. Jelinek. The role of serious games in robot exoskeleton-assisted rehabilitation of stroke patients. In Serious Games Analytics: Methodologies for Performance Measurement, Assessment, and Improvement. Springer International Publisher, 2015.Google ScholarCross Ref
M. Crosby and J. Stelovsky. How do we read algorithms? a case study. Computer, 23(1), 1990. Google ScholarDigital Library
W. Cunningham. The wycash portfolio management system. OOPS Messenger, 4(2):29--30, 1993. Google ScholarDigital Library
B. Curtis, S. Sheppard, P. Milliman, M. Borst, and T. Love. Measuring the psychological complexity of software maintenance tasks with the Halstead and McCabe metrics. Trans. on Software Engineering, SE-5(2):96--104, 1979. Google ScholarDigital Library
R. G. Ebenau and S. H. Strauss. Software Inspection Process. McGraw-Hill, Inc., 1994. Google ScholarDigital Library
K. O. Elish and M. O. Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649--660, 2008. Google ScholarDigital Library
http://www.empatica.com.Google Scholar
http://techcrunch.com/2011/08/07/oh-what-noble-scribe-hath-penned-these-words/.Google Scholar
S. H. Fairclough, L. Venables, and A. Tattersall. The influence of task demand and learning on the psychophysiological response. International Journal of Psychophysiology, 56, 2005.Google Scholar
J. Feigenspan, S. Apel, J. Liebig, and C. Kastner. Exploring software measures to assess program comprehension. In Proc. of ESEM, 2011. Google ScholarDigital Library
http://findbugs.sourceforge.net/.Google Scholar
T. Fritz, A. Begel, S. C. Müller, S. Yigit-Elliot, and M. Züger. Using psycho-physiological measures to assess task difficulty in software development. In Proc. of ICSE, 2014. Google ScholarDigital Library
E. Giger, M. D'Ambros, M. Pinzger, and H. C. Gall. Method-level bug prediction. In Proc. of ESEM, 2012. Google ScholarDigital Library
http://www.niallkennedy.com/blog/2006/11/google-mondrian.html.Google Scholar
R. Grady and T. Slack. Key lessons in achieving widespread inspection use. Software, 11(4):46--57, 1994. Google ScholarDigital Library
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explorations Newsletter, 11(1):10--18, 2009. Google ScholarDigital Library
Y. Ikutani and H. Uwano. Brain activity measurement during program comprehension with NIRS. In Proc. of SNPD, 2014.Google ScholarCross Ref
K. Kevic, B. M. Walters, T. R. Shaffer, B. Sharif, D. C. Shepherd, and T. Fritz. Tracing software developers' eyes and interactions for change tasks. In Proc. of ESEC/FSE, 2015. Google ScholarDigital Library
A. J. Ko and B. A. Myers. A framework and methodology for studying the causes of software errors in programming systems. Journal of Visual Languages & Computing, 16(1):41--84, 2005. Google ScholarDigital Library
N. A. Kuznetsov, K. D. Shockley, M. J. Richardson, and M. A. Riley. Effect of precision aiming on respiration and postural-respiratory synergy. Neuroscience letters, 502(1):13--17, 2011.Google ScholarCross Ref
J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977.Google ScholarCross Ref
M. Lanza and R. Marinescu. Object-oriented Metrics in Practice: Using Software Metrics to Characterize, Evaluate, and Improve the Design of Object-Oriented Systems. Springer, 2006. Google ScholarDigital Library
T. Lee, J. Nam, D. Han, S. Kim, and H. P. In. Micro interaction metrics for defect prediction. In Proc. of ESEC/FSE, 2011. Google ScholarDigital Library
M. M. Lehman. On understanding laws, evolution, and conservation in the large-program life cycle. Journal of Systems and Software, 1:213--221, 1980. Google ScholarDigital Library
S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. Trans. on Software Engineering, 34(4):485--496, 2008. Google ScholarDigital Library
O. Maimon and L. Rokach, editors. Data Mining and Knowledge Discovery Handbook. Springer, 2006. Google ScholarDigital Library
R. Marinescu. Detection strategies: Metrics-based rules for detecting design flaws. In Proc. of ICSM, 2004. Google ScholarDigital Library
S. McConnell. Code complete. Pearson, 2004.Google Scholar
N. Moha, Y. Guéhéneuc, L. Duchien, and A. Le Meur. Decor: A method for the specification and detection of code and design smells. Trans. on Software Engineering, 36(1), 2010. Google ScholarDigital Library
R. Moser, W. Pedrycz, and G. Succi. Analysis of the reliability of a subset of change metrics for defect prediction. In Proc. of ESEM, 2008. Google ScholarDigital Library
S. C. Müller and T. Fritz. Stuck and frustrated or in flow and happy: Sensing developers' emotions and progress. In Proc. of ICSE, 2015. Google ScholarDigital Library
M. Munro. Product metrics for automatic identification of "bad smell" design problems in java source-code. In Proc. of METRICS, 2005. Google ScholarDigital Library
N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In Proc. of ICSE, 2005. Google ScholarDigital Library
N. Nagappan, T. Ball, and A. Zeller. Mining metrics to predict component failures. In Proc. of ICSE, 2006. Google ScholarDigital Library
N. Nagappan, B. Murphy, and V. Basili. The influence of organizational structure on software quality: An empirical case study. In Proc. of ICSE, 2008. Google ScholarDigital Library
T. Nakagawa, Y. Kamei, H. Uwano, A. Monden, K. Matsumoto, and D. M. German. Quantifying programmers' mental workload during program comprehension based on cerebral blood flow measurement: A controlled experiment. In Companion Proc. of ICSE, 2014. Google ScholarDigital Library
D. Novak, J. Ziherl, A. Olenšek, M. Milavec, J. Podobnik, M. Mihelj, and M. Munih. Psychophysiological response to robotic rehabilitation tasks in stroke. Trans. on Neural Systems and Rehabilitation Engineering, 18(4), 2010.Google Scholar
F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, and D. Poshyvanyk. Detecting bad smells in source code using change history information. In Proc. of ASE, 2013.Google ScholarDigital Library
C. Parnin. Subvocalization - toward hearing the inner thoughts of developers. In Proc. of ICPC, 2011. Google ScholarDigital Library
https://pmd.github.io/.Google Scholar
Y. Qi. Random forest for bioinformatics. In Ensemble Machine Learning. Springer, 2012.Google ScholarCross Ref
S. Radevski, H. Hata, and K. Matsumoto. Real-time monitoring of neural state in assessing and improving software developers' productivity. Proc. of CHASE, 2015. Google ScholarDigital Library
http://www.ifi.uzh.ch/seal/people/mueller/PredictCodeQualityWithBiometrics.Google Scholar
P. Richter, T. Wagner, R. Heger, and G. Weise. Psychophysiological analysis of mental load during driving on rural roads - a quasi-experimental field study. Ergonomics, 41(5), 1998.Google Scholar
P. C. Rigby, D. M. German, and M.-A. Storey. Open source software peer review practices: A case study of the apache server. In Proc. of ICSE, 2008. Google ScholarDigital Library
P. Rodeghero, C. McMillan, P. W. McBurney, N. Bosch, and S. D'Mello. Improving automated source code summarization via an eye-tracking study of programmers. In Proc. of ICSE, 2014. Google ScholarDigital Library
S. Schmidth and H. Walach. Electrodermal activity (EDA) - state-of-the-art measurements and techniques for parapsychological purposes. Journal of Parapsychology, 64(2), 2000.Google Scholar
C. Setz, B. Arnrich, J. Schumm, R. L. Marca, G. Tröster, and U. Ehlert. Discriminating stress from cognitive load using a wearable eda device. Trans. on Information Technology in Biomedicine, 14(2), 2010. Google ScholarDigital Library
J. Siegmund, C. Kästner, S. Apel, C. Parnin, A. Bethmann, T. Leich, G. Saake, and A. Brechmann. Understanding understanding source code with functional magnetic resonance imaging. In Proc. of ICSE, 2014. Google ScholarDigital Library
L. A. Sroufe and E. Waters. Heart rate as a convergent measure in clinical and developmental research. Merrill-Palmer Quarterly of Behavior and Development, 23(1):3--27, 1977.Google Scholar
J. Sweller. Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2):257--285, 1988.Google ScholarCross Ref
J. Sweller, P. Ayres, and S. Kalyuga. Cognitive Load Theory. Springer, 2011.Google ScholarCross Ref
E. van Emden and L. Moonen. Java quality assurance by detecting code smells. In Proc. of WCRE, 2002. Google ScholarDigital Library
J. Veltman and A. W. Gaillard. Physiological workload reactions to increasing levels of task difficulty. Ergonomics, 41(5):656--669, 1998.Google ScholarCross Ref
G. F. Walter and S. W. Porges. Heart rate and respiratory responses as a function of task difficulty: The use of discriminant analysis in the selection of psychologically sensitive physiological responses. Psychophysiology, 13(6), 1976.Google Scholar
R. A. Weast and N. G. Neiman. The effect of cognitive load and meaning on selective attention. In Annual Meeting of the Cognitive Science Society, 2010.Google Scholar
E. J. Weyuker, T. J. Ostrand, and R. M. Bell. Do too many cooks spoil the broth? using the number of developers to enhance defect prediction models. Empirical Software Engineering, 13(5):539--559, 2008. Google ScholarDigital Library
G. F. Wilson. An analysis of mental workload in pilots during flight using multiple psychphysiological measures. International Journal of Aviation Psychology, 12(1), 2002.Google ScholarCross Ref
H. Zhang, X. Zhang, and M. Gu. Predicting defective software components from code complexity measures. In Proc. of PRDC, 2007. Google ScholarDigital Library
T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Proc. of PROMISE, 2007. Google ScholarDigital Library

Recommendations

Code quality improvement using Aquila Optimizer
Abstract
The software industry always requires high‐quality software. Code quality has been improved via refactoring and re‐refactoring and removing code smells. Refactoring is a technique to improve existing code by modifying inner structure, while code ...

This paper sought to improve the accuracy of code smell detection through the Aquila optimization algorithm. The average detection accuracy became 95.18% and 95.68% for precision and recall, respectively. In addition to detecting nine types of code ...
Read More
Prioritising Refactoring Using Code Bad Smells
ICSTW '11: Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops

We investigated the relationship between six of Fowler et al.'s Code Bad Smells (Duplicated Code, Data Clumps, Switch Statements, Speculative Generality, Message Chains, and Middle Man) and software faults. In this paper we discuss how our results can ...
Read More
Investigating code review quality: Do people and participation matter?
ICSME '15: Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Code review is an essential element of any mature software development project; it aims at evaluating code contributions submitted by developers. In principle, code review should improve the quality of code changes (patches) before they are committed to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '16: Proceedings of the 38th International Conference on Software Engineering
May 2016
1235 pages
ISBN:9781450339001
DOI:10.1145/2884781
General Chair:
Laura Dillon
Michigan State University
,
Program Chairs:
Willem Visser
Stellenbosch University, South Africa
,
Laurie Williams
North Carolina State University
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 51
  Total Citations
  View Citations
- 1,917
  Total Downloads
- Downloads (Last 12 months)54
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using (bio)metrics to predict code quality online

ICSE '16: Proceedings of the 38th International Conference on Software Engineering

ABSTRACT

References

Cited By

Recommendations

Code quality improvement using Aquila Optimizer

Prioritising Refactoring Using Code Bad Smells

Investigating code review quality: Do people and participation matter?