Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system

Yu, Zhongxing; Martinez, Matias; Danglot, Benjamin; Durieux, Thomas; Monperrus, Martin

doi:10.1007/s10664-018-9619-4

Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system

Published: 12 May 2018

Volume 24, pages 33–67, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Zhongxing Yu ORCID: orcid.org/0000-0003-4173-3641¹,
Matias Martinez²,
Benjamin Danglot¹,
Thomas Durieux¹ &
…
Martin Monperrus³

1671 Accesses
42 Citations
2 Altmetric
Explore all metrics

Abstract

Among the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. However, test suites are in essence input-output specifications and are thus typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based repair techniques can just overfit to the used test suite, and fail to generalize to other tests. We deeply analyze the overfitting problem in program repair and give a classification of this problem. This classification will help the community to better understand and design techniques to defeat the overfitting problem. We further propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation. The approach uses additional automatically generated tests to strengthen the repair constraint used by synthesis-based repair techniques. We analyze the effectiveness of UnsatGuided: 1) analytically with respect to alleviating two different kinds of overfitting issues; 2) empirically based on an experiment over the 224 bugs of the Defects4J repository. The main result is that automatic test generation is effective in alleviating one kind of overfitting, issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Test case selection and prioritization using machine learning: a systematic literature review

Article 14 December 2021

Automatic software refactoring: a systematic literature review

Article 03 December 2019

A systematic review of fuzzing

Article 31 October 2023

Notes

In this paper, we use “fault” and “bug” interchangeably.
We do not use the techniques that generate assertions from runs of different program versions (Taneja and Xie 2008; Evans and Savoia 2007).
https://github.com/Spirals-Team/test4repair-experiments

References

Almasi MM, Hemmati H, Fraser G, Arcuri A, Benefelds J (2017) An industrial evaluation of unit test generation: Finding real faults in a financial application. In: Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track, IEEE Press, Piscataway, ICSE-SEIP ’17. https://doi.org/10.1109/ICSE-SEIP.2017.27, pp 263–272
Arcuri A, Briand L (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: Proceedings of the 33rd International Conference on Software Engineering. ACM, New York, ICSE ’11. https://doi.org/10.1145/1985793.1985795, pp 1–10
B Le TD, Lo D, Le Goues C, Grunske L (2016) A learning-to-rank based fault localization approach using likely invariants. In: Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, New York, ISSTA 2016. https://doi.org/10.1145/2931037.2931049, pp 177–188
Baresi L, Lanzi PL, Miraz M (2010) Testful: an evolutionary test approach for java. In: 2010 third international conference on Software testing, verification and validation (ICST). IEEE, pp 185– 194
Brumley D, cker Chiueh T, Johnson R, Lin H, Song D (2007) Rich: Automatically protecting against integer-based vulnerabilities. In: In Symposium on Network and Distributed Systems Security
Cadar C, Dunbar D, Engler DR et al (2008) Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs OSDI, vol 8, pp 209–224
Csallner C, Smaragdakis Y (2004) Jcrasher: an automatic robustness tester for java. Softw: Pract Exper 34(11):1025–1050
Google Scholar
Durieux T, Cornu B, Seinturier L, Monperrus M (2017) Dynamic patch generation for null pointer exceptions using metaprogramming. In: 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). https://doi.org/10.1109/SANER.2017.7884635, pp 349–358
Evans RB, Savoia A (2007) Differential testing: A new approach to change detection. In: The 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers. ACM, New York, ESEC-FSE companion ’07. https://doi.org/10.1145/1295014.1295038, pp 549–552
Fraser G, Arcuri A (2011) Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, New York, NY, USA, ESEC/FSE ’11. https://doi.org/10.1145/2025113.2025179, pp 416–419
Gao Q, Xiong Y, Mi Y, Zhang L, Yang W, Zhou Z, Xie B, Mei H (2015) Safe memory-leak fixing for c programs. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. https://doi.org/10.1109/ICSE.2015.64, vol 1, pp 459–470
Godefroid P, Klarlund N, Sen K (2005) Dart: directed automated random testing. In: ACM Sigplan notices, vol 40. ACM, pp 213–223
Goues CL, Nguyen T, Forrest S, Weimer W (2012) Genprog: a generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72
Article Google Scholar
Gu Z, Barr ET, Hamilton DJ, Su Z (2010) Has the bug really been fixed?. In: Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1. ACM, ICSE ’10. https://doi.org/10.1145/1806799.1806812, pp 55–64
Islam M, Csallner C (2010) Dsc+mock: A test case + mock class generator in support of coding against interfaces. In: Proceedings of the Eighth International Workshop on Dynamic Analysis. ACM, New York, WODA ’10. https://doi.org/10.1145/1868321.1868326, pp 26–31
Jha S, Gulwani S, Seshia SA, Tiwari A (2010) Oracle-guided component-based program synthesis. In: Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1. ACM, New York, NY, USA, ICSE ’10. https://doi.org/10.1145/1806799.1806833, pp 215–224
Jones JA, Harrold MJ (2005) Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, ASE ’05. https://doi.org/10.1145/1101908.1101949, pp 273–282
Just R, Jalali D, Ernst M D (2014a) Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In: Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), San Jose, pp 437–440
Just R, Jalali D, Inozemtseva L, Ernst MD, Holmes R, Fraser G (2014b) Are mutants a valid substitute for real faults in software testing?. In: FSE 2014, Proceedings of the ACM SIGSOFT 22nd Symposium on the Foundations of Software Engineering, Hong Kong, pp 654–665
Kim D, Nam J, Song J, Kim S (2013) Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 802–811
Laghari G, Murgia A, Demeyer S (2016) Fine-tuning spectrum based fault localisation with frequent method item sets. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, ASE 2016. https://doi.org/10.1145/2970276.2970308, pp 274–285
Le XBD, Chu DH, Lo D, Le Goues C, Visser W (2017a) S3: Syntax- and semantic-guided repair synthesis via programming by examples. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, New York, ESEC/FSE 2017. https://doi.org/10.1145/3106237.3106309, pp 593–604
Le X BD, Thung F, Lo d, Le Goues C (2017b) Overfitting in semantics-based automated program repair
Liu C, Fei L, Yan X, Han J, Midkiff SP (2006) Statistical debugging: A hypothesis testing-based approach. IEEE Trans Softw Eng 32(10):831–848. https://doi.org/10.1109/TSE.2006.105
Article Google Scholar
Liu X, Zeng M, Xiong Y, Zhang L, Huang G (2017) Identifying patch correctness in test-based automatic program repair. arXiv:http://arXiv.org/abs/170609120
Long F, Rinard M (2015) Staged program repair with condition synthesis. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, New York, ESEC/FSE 2015, pp 166–178. https://doi.org/10.1145/2786805.2786811
Long F, Rinard M (2016) Automatic patch generation by learning correct code. In: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, New York, POPL ’16, pp 298–312. https://doi.org/10.1145/2837614.2837617
Long F, Amidon P, Rinard M (2017) Automatic inference of code transforms for patch generation. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, pp 727–739
Martinez M, Monperrus M (2016a) Astor: A program repair library for java (demo). In: Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, New York, ISSTA 2016, pp 441–444. https://doi.org/10.1145/2931037.2948705
Martinez M, Durieux T, Sommerard R, Xuan J, Monperrus M (2017) Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset. Empir Softw Eng 22(4):1936–1964. https://doi.org/10.1007/s10664-016-9470-4
Article Google Scholar
Mechtaev S, Yi J, Roychoudhury A (2015) Directfix: Looking for simple program repairs. In: Proceedings of the 37th International Conference on Software Engineering. Vol 1, IEEE Press, pp 448–458
Mechtaev S, Yi J, Roychoudhury A (2016) Angelix: scalable multiline program patch synthesis via symbolic analysis. In: Proceedings of the 38th International Conference on Software Engineering. ACM, New York, ICSE ’16, pp 691–701. https://doi.org/10.1145/2884781.2884807
Monperrus M (2017) Automatic Software Repair: a Bibliography. ACM Computing Surveys https://hal.archives-ouvertes.fr/hal-01206501/file/survey-automatic-repair.pdf
Nguyen HDT, Qi D, Roychoudhury A, Chandra S (2013) Semfix: program repair via semantic analysis. In: Proceedings of the 2013 International Conference on Software Engineering, IEEE Press, Piscataway, ICSE ’13, pp 772–781, url http://dl.acm.org/citation.cfm?id=2486788.2486890
Pacheco C, Ernst MD (2007) Randoop: feedback-directed random testing for java. In: Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. ACM, pp 815–816
Park S, Hossain B, Hussain I, Csallner C, Grechanik M, Taneja K, Fu C, Xie Q (2012) Carfast: achieving higher statement coverage faster. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, pp 35
Pearson S, Campos J, Just R, Fraser G, Abreu R, Ernst MD, Pang D, Keller B (2017) Evaluating and improving fault localization. In: Proceedings of the 39th International Conference on Software Engineering. IEEE Press, Piscataway, ICSE ’17, pp 609–620. https://doi.org/10.1109/ICSE.2017.62
Pei Y, Furia CA, Nordio M, Wei Y, Meyer B, Zeller A (2014) Automated fixing of programs with contracts, vol 40. https://doi.org/10.1109/TSE.2014.2312918
Perkins JH, Kim S, Larsen S, Amarasinghe S, Bachrach J, Carbin M, Pacheco C, Sherwood F, Sidiroglou S, Sullivan G, Wong WF, Zibin Y, Ernst MD, Rinard M (2009) Automatically patching errors in deployed software, pp 87–102. https://doi.org/10.1145/1629575.1629585
Prasetya ISWB (2014) T3, a Combinator-Based Random Testing Tool for Java: Benchmarking, Springer International Publishing. Cham, pp 101–110. https://doi.org/10.1007/978-3-319-07785-7_7
Păsăreanu CS, Rungta N (2010) Symbolic pathfinder: symbolic execution of java bytecode. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, ASE ’10, pp 179–180. https://doi.org/10.1145/1858996.1859035
Qi Y, Mao X, Lei Y, Dai Z, Wang C (2014) The strength of random search on automated program repair. In: Proceedings of the 36th International Conference on Software Engineering. ACM, New York, ICSE 2014, pp 254–265. https://doi.org/10.1145/2568225.2568254
Qi Z, Long F, Achour S, Rinard M (2015) An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In: Proceedings of ISSTA. ACM
Sen K, Marinov D, Agha G (2005) Cute: a concolic unit testing engine for c. In: ACM SIGSOFT Software engineering notes, vol 30. ACM, pp 263–272
Shamshiri S, Just R, Rojas JM, Fraser G, McMinn P, Arcuri A (2015) Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). 2015 30Th IEEE/ACM international conference on automated software engineering (ASE), pp 201–211. https://doi.org/10.1109/ASE.2015.86
Shaw A, Doggett D, Hafiz M (2014) Automatically fixing c buffer overflows using program transformations. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp 124–135. https://doi.org/10.1109/DSN.2014.25
Smith EK, Barr ET, Le Goues C, Brun Y (2015) Is the cure worse than the disease? overfitting in automated program repair. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, pp 532–543
Taneja K, Xie T (2008) Diffgen: Automated regression unit-test generation. 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, pp 407–410
Tian Y, Ray B (2017) Automatically diagnosing and repairing error handling bugs in c. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, New York, ESEC/FSE 2017, pp 752–762. https://doi.org/10.1145/3106237.3106300
Tillmann N, De Halleux J (2008) Pex: White box test generation for .net. In: Proceedings of the 2Nd International Conference on Tests and Proofs. Springer-Verlag, Berlin, TAP’08, pp 134–153. http://dl.acm.org/citation.cfm?id=1792786.1792798
Tonella P (2004) Evolutionary testing of classes. SIGSOFT Softw Eng Notes 29 (4):119–128. https://doi.org/10.1145/1013886.1007528
Article Google Scholar
Wei Y, Pei Y, Furia CA, Silva LS, Buchholz S, Meyer B, Zeller A (2010) Automated fixing of programs with contracts. In: Proceedings of the 19th International Symposium on Software Testing and Analysis. ACM, New York, ISSTA ’10, pp 61–72. https://doi.org/10.1145/1831708.1831716
Weimer W, Fry ZP, Forrest S (2013) Leveraging program equivalence for adaptive program repair: Models and first results. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp 356–366. https://doi.org/10.1109/ASE.2013.6693094
Xie T (2006) Augmenting automatically generated unit-test suites with regression oracle checking. In: Proceedings of the 20th European Conference on Object-Oriented Programming, Springer-Verlag, Berlin, ECOOP’06. https://doi.org/10.1007/11785477_23
Xin Q, Reiss SP (2017) Identifying test-suite-overfitted patches through test case generation. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, New York, ISSTA 2017, pp 226–236. https://doi.org/10.1145/3092703.3092718
Xiong Y, Wang J, Yan R, Zhang J, Han S, Huang G, Zhang L (2017) Precise condition synthesis for program repair. In: Proceedings of the 39th International Conference on Software Engineering. IEEE Press, Piscataway, ICSE ’17, pp 416–426. https://doi.org/10.1109/ICSE.2017.45
Xuan J, Martinez M, Demarco F, Clément M, Lamelas S, Durieux T, Le Berre D, Monperrus M (2016) Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2016.2560811, https://hal.archives-ouvertes.fr/hal-01285008/document
Yang J, Zhikhartsev A, Liu Y, Tan L (2017) Better test cases for better automated program repair. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, pp 831–841
Yi J, Tan S H, Mechtaev S, Böhme M, Roychoudhury A (2017) A correlation study between automated program repair and test-suite metrics. Empir Softw Eng. 1–32 https://doi.org/10.1007/s10664-017-9552-y
Yu Z, Hu H, Bai C, Cai KY, Wong WE (2011) Gui software fault localization using n-gram analysis. In: 2011 IEEE 13th International Symposium on High-Assurance Systems Engineering, pp 325–332. https://doi.org/10.1109/HASE.2011.29
Yu Z, Bai C, Cai K Y (2013) Mutation-oriented Test data augmentation for gui software fault localization. Inf Softw Technol 55(12):2076–2098. https://doi.org/10.1016/j.infsof.2013.07.004
Article Google Scholar
Yu Z, Bai C, Cai KY (2015) Does the failing test execute a single or multiple faults?: An approach to classifying failing tests. In: Proceedings of the 37th International Conference on Software Engineering - Volume 1, IEEE Press, Piscataway, ICSE ’15, pp 924–935. http://dl.acm.org/citation.cfm?id=2818754.2818866
Yu Z, Martinez M, Danglot B, Durieux T, Monperrus M (2017) Test Case Generation for Program Repair: A Study of Feasibility and Effectiveness. Technical Report 1703.00198v1, ArXiv:1703.00198
Zhang X, Gupta N, Gupta R (2006) Locating faults through automated predicate switching. In: Proceedings of the 28th International Conference on Software Engineering. ACM, New York, ICSE ’06, pp 272–281. https://doi.org/10.1145/1134285.1134324

Download references

Author information

Authors and Affiliations

Inria Lille - Nord Europe, Avenue du Halley, 59650, Villeneuve-d’Ascq, France
Zhongxing Yu, Benjamin Danglot & Thomas Durieux
University of Valenciennes, Malvache Building, Campus Mont Houy, 59313, Valenciennes Cedex 9, France
Matias Martinez
School of Computer Science and Communication, KTH Royal Institute of Technology, Stockholm, Sweden
Martin Monperrus

Authors

Zhongxing Yu
View author publications
You can also search for this author in PubMed Google Scholar
Matias Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Danglot
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Durieux
View author publications
You can also search for this author in PubMed Google Scholar
Martin Monperrus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongxing Yu.

Additional information

Communicated by: David Lo

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Martinez, M., Danglot, B. et al. Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system. Empir Software Eng 24, 33–67 (2019). https://doi.org/10.1007/s10664-018-9619-4

Download citation

Published: 12 May 2018
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s10664-018-9619-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system

Abstract

Access this article

Similar content being viewed by others

Test case selection and prioritization using machine learning: a systematic literature review

Automatic software refactoring: a systematic literature review

A systematic review of fuzzing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system

Abstract

Access this article

Similar content being viewed by others

Test case selection and prioritization using machine learning: a systematic literature review

Automatic software refactoring: a systematic literature review

A systematic review of fuzzing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation