Automatic generation of valid and invalid test data for string validation routines using web searches and regular expressions

https://doi.org/10.1016/j.scico.2014.04.008Get rights and content
Under an Elsevier user license
open archive

Highlights

  • An approach for finding valid values for string data types on the Internet.

  • A mutation algorithm for regular expressions to produce invalid values for string data types.

  • A testing procedure to identify program errors using the valid and invalid values.

  • An empirical study of the approach on 24 open source case studies.

  • An analysis of the approach against two contemporary test data generation tools.

Abstract

Classic approaches to automatic input data generation are usually driven by the goal of obtaining program coverage and the need to solve or find solutions to path constraints to achieve this. As inputs are generated with respect to the structure of the code, they can be ineffective, difficult for humans to read, and unsuitable for testing missing implementation. Furthermore, these approaches have known limitations when handling constraints that involve operations with string data types.

This paper presents a novel approach for generating string test data for string validation routines, by harnessing the Internet. The technique uses program identifiers to construct web search queries for regular expressions that validate the format of a string type (such as an email address). It then performs further web searches for strings that match the regular expressions, producing examples of test cases that are both valid and realistic. Following this, our technique mutates the regular expressions to drive the search for invalid strings, and the production of test inputs that should be rejected by the validation routine.

The paper presents the results of an empirical study evaluating our approach. The study was conducted on 24 string input validation routines collected from 10 open source projects. While dynamic symbolic execution and search-based testing approaches were only able to generate a very low number of values successfully, our approach generated values with an accuracy of 34% on average for the case of valid strings, and 99% on average for the case of invalid strings. Furthermore, whereas dynamic symbolic execution and search-based testing approaches were only capable of detecting faults in 8 routines, our approach detected faults in 17 out of the 19 validation routines known to contain implementation errors.

Keywords

Test data generation
Web searches
Regular expressions

Cited by (0)