ABSTRACT
Context: Test-driven development (TDD) is an agile practice claimed to improve the quality of a software product, as well as the productivity of its developers. A previous study (i.e., baseline experiment) at the University of Oulu (Finland) compared TDD to a test-last development (TLD) approach through a randomized controlled trial. The results failed to support the claims. Goal: We want to validate the original study results by replicating it at the University of Basilicata (Italy), using a different design. Method: We replicated the baseline experiment, using a crossover design, with 21 graduate students. We kept the settings and context as close as possible to the baseline experiment. In order to limit researchers bias, we involved two other sites (UPM, Spain, and Brunel, UK) to conduct blind analysis of the data. Results: The Kruskal-Wallis tests did not show any significant difference between TDD and TLD in terms of testing effort (p-value = .27), external code quality (p-value = .82), and developers' productivity (p-value = .83). Nevertheless, our data revealed a difference based on the order in which TDD and TLD were applied, though no carry over effect. Conclusions: We verify the baseline study results, yet our results raises concerns regarding the selection of experimental objects, particularly with respect to their interaction with the order in which of treatments are applied.
We recommend future studies to survey the tasks used in experiments evaluating TDD. Finally, to lower the cost of replication studies and reduce researchers' bias, we encourage other research groups to adopt similar multi-site blind analysis approach described in this paper.
- M. Baker. Over half of psychology studies fail reproducibility test. Nature News, 27, 2015.Google Scholar
- M. T. Baldassarre, J. Carver, O. Dieste, and N. Juristo. Replication types: Towards a shared taxonomy. In Proc. of International Conference on Evaluation and Assessment in Software Engineering, pages 18:1--18:4. ACM, 2014. Google ScholarDigital Library
- V. Bebarta, D. Luyten, and K. Heard. Emergency medicine animal research: does use of randomization and blinding affect the results? Academic Emergency Medicine, 10(6):684--687, 2003.Google ScholarCross Ref
- K. Beck. Test-driven development: by example. Addison-Wesley Professional, 2003. Google ScholarDigital Library
- J. Carver, L. Jaccheri, S. Morasca, and F. Shull. Issues in using students in empirical studies in software engineering education. In Proceedings of International Symposium on Software Metrics, pages 239--251, 2003. Google ScholarDigital Library
- J. C. Carver. Towards reporting guidelines for experimental replications: A proposal. Proceedings of the 1st international workshop on replication in software engineering (RESER), 2010.Google Scholar
- M. Ciolkowski. What do we know about perspective-based reading? an approach for quantitative aggregation in software engineering. In Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, pages 133--144. IEEE Computer Society, 2009. Google ScholarDigital Library
- M. Ciolkowski, D. Muthig, and J. Rech. Using academic courses for empirical validation of software development processes. EUROMICRO Conference, pages 354--361, 2004. Google ScholarDigital Library
- F. Q. B. da Silva, M. Suassuna, A. C. C. França, A. M. Grubb, T. B. Gouveia, C. V. F. Monteiro, and I. E. dos Santos. Replication of empirical studies in software engineering research: a systematic mapping study. Empirical Software Engineering, 19(3), 2014. Google ScholarDigital Library
- H. Erdogmus, M. Morisio, and M. Torchiano. On the effectiveness of the test-first approach to programming. IEEE Transactions on Software Engineering, 31(3):226--237, Mar. 2005. Google ScholarDigital Library
- M. Fowler. Refactoring: improving the design of existing code. Pearson Education India, 2009.Google Scholar
- C. O. Fritz, P. E. Morris, and J. J. Richler. Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1):2, 2012.Google ScholarCross Ref
- D. Fucci and B. Turhan. A Replicated Experiment on the Effectiveness of Test-First Development. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 103--112. IEEE, Oct. 2013.Google ScholarCross Ref
- D. Fucci and B. Turhan. On the role of tests in test-driven development: a differentiated and partial replication. Empirical Software Engineering, 19(2):277--302, 2014. Google ScholarDigital Library
- D. Fucci, B. Turhan, and M. Oivo. Conformance factor in test-driven development: initial results from an enhanced replication. In Proc. of the 18th International Conference on Evaluation and Assessment in Software Engineering. ACM, 2014. Google ScholarDigital Library
- D. Fucci, B. Turhan, and M. Oivo. Impact of process conformance on the effects of test-driven development. In the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurements, pages 1--10. ACM Press, 2014. Google ScholarDigital Library
- D. Fucci, B. Turhan, and M. Oivo. On the effects of programming and testing skills on external quality and productivity in a test-driven development context. In Proc. of the 19th International Conference on Evaluation and Assessment in Software Engineering, EASE '15. ACM, 2015. Google ScholarDigital Library
- B. George and L. Williams. A structured experiment of test-driven development. Information and Software Technology, 46(5):337--342, 2004.Google ScholarCross Ref
- O. S. Gómez, N. Juristo, and S. Vegas. Understanding replication of experiments in software engineering: A classification. Information and Software Technology, 56(8):1033--1048, Aug. 2014.Google ScholarCross Ref
- J. Hannay and M. Jørgensen. The role of deliberate artificial design elements in software engineering experiments. IEEE Trans. on Soft. Eng., 34:242--259, March 2008. Google ScholarDigital Library
- T. P. Hettmansperger and J. W. McKean. Robust nonparametric statistical methods. CRC Press, 2010.Google ScholarCross Ref
- M. Ivarsson and T. Gorschek. A method for evaluating rigor and industrial relevance of technology evaluations. Empirical Software Engineering, 16(3):365--395, Oct. 2010. Google ScholarDigital Library
- N. Juristo and S. Vegas. Using differences among replications of software engineering experiments to gain knowledge. In Empirical Software Engineering and Measurement, 2013 ACM / IEEE International Symposium on, pages 356--366. IEEE, 2009. Google ScholarDigital Library
- N. Juristo, S. Vegas, M. N. Solari, and S. Abrahão. A process for managing interaction between experimenters to get useful similar replications. Information and Software Systems Journal, 2013. Google ScholarDigital Library
- B. Kitchenham. The role of replications in empirical software engineering - a word of warning. Empirical Software Engineering, 13(2):219--221, 2008. Google ScholarDigital Library
- R. MacCoun and S. Perlmutter. Blind analysis: Hide results to seek the truth. Nature, 526:187--189, 2015.Google ScholarCross Ref
- L. Madeyski. The impact of test-first programming on branch coverage and mutation score indicator of unit tests: An experiment. Information and Software Technology, 52(2):169--184, 2010. Google ScholarDigital Library
- E. M. Maximilien and L. Williams. Assessing test-driven development at ibm. In Proceedings of the 25th International Conference on Software Engineering, ICSE '03, pages 564--569, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- S. E. Maxwell and H. D. Delaney. Designing experiments and analyzing data: A model comparison perspective, volume 1. Psychology Press, 2004.Google Scholar
- M. G. Mendonça, J. C. Maldonado, M. C. F. de Oliveira, J. Carver, S. C. P. F. Fabbri, F. Shull, G. H. Travassos, E. N. Höhn, and V. R. Basili. A framework for software engineering experimental replications. In Proc. of International Conference on Engineering of Complex Computer Systems, pages 203--212. IEEE Computer Society, 2008. Google ScholarDigital Library
- M. M. Müller and A. Höfer. The effect of experience on the test-driven development process. Empirical Software Engineering, 12(6):593--615, 2007. Google ScholarDigital Library
- H. Munir, M. Moayyed, and K. Petersen. Considering rigor and relevance when evaluating test driven development: A systematic review. Information and Software Technology, 2014. Google ScholarDigital Library
- M. Pančur and M. Ciglarič. Impact of test-driven development on productivity, code and tests: A controlled experiment. Information and Software Technology, 53(6):557--573, June 2011. Google ScholarDigital Library
- K. Pearson. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240--242, 1895.Google ScholarCross Ref
- S. L. Pfleeger and W. Menezes. Marketing technology to software practitioners. IEEE Software, 17(1):27--33. Google ScholarDigital Library
- Y. Rafique and V. B. Misic. The Effects of Test-Driven Development on External Quality and Productivity: A Meta-Analysis. Software Engineering, IEEE Transactions on, 39(6):835--856, 2013. Google ScholarDigital Library
- M. Shepperd, D. Bowes, and T. Hall. Researcher bias: The use of machine learning in software defect prediction. IEEE Transactions on Software Engineering, 40(6):603--616, 2014.Google ScholarCross Ref
- F. Shull, J. C. Carver, S. Vegas, and N. J. Juzgado. The role of replications in empirical software engineering. Empirical Software Engineering, 13(2):211--218, 2008. Google ScholarDigital Library
- F. Shull, G. Melnik, B. Turhan, L. Layman, M. Diep, and H. Erdogmus. What Do We Know about Test-Driven Development? IEEE Software, 27(6):16--19, 2010. Google ScholarDigital Library
- B. Sigweni and M. Shepperd. Using blind analysis for software engineering experiments. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, page 32, 2015. Google ScholarDigital Library
- B. Turhan, L. Layman, M. Diep, H. Erdogmus, and F. Shull. How effective is test-Driven Development. Making Software: What Really Works, and Why We Believe It, pages 207--217, 2010.Google Scholar
- S. Vegas, C. Apa, and N. Juristo. Crossover designs in software engineering experiments: Benefits and perils. IEEE Transactions on Software Engineering, 42(2):120--135, Feb 2016.Google ScholarDigital Library
- E. Vonesh and V. M. Chinchilli. Linear and nonlinear models for the analysis of repeated measurements. CRC press, 1996.Google Scholar
- S. Wellek and M. Blettner. On the proper use of the crossover design in clinical trials. Deutsches Arztebllatt Intern, 109:276--281, 2012.Google Scholar
- C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in software engineering. Springer Science & Business Media, 2012. Google ScholarDigital Library
Recommendations
A longitudinal cohort study on the retainment of test-driven development
ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and MeasurementBackground: Test-Driven Development (TDD) is an agile software development practice, which is claimed to boost both external quality of software products and developers' productivity.
Aims: We want to study: (i) the TDD effects on the external quality ...
An experimental evaluation of test driven development vs. test-last development with industry professionals
EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software EngineeringTest-Driven Development (TDD) is a software development approach where test cases are written before actual development of the code in iterative cycles. Context: TDD has gained attention of many software practitioners during the last decade since it has ...
Test-Driven Development: Concepts, Taxonomy, and Future Direction
Test-driven development creates software in very short iterations with minimal upfront design. Poised for widespread adoption TDD has become the focus of an increasing number of researchers and developers.
Comments