There are several tools how to generate sample data for a given regex. Some include:
However, while they may be sufficient to seed a dataset, it doesn't help much testing code that depends on the regex itself, such as validation.
Assume you have a code generator that generates a model with a property. The user specifies a regex to validate the property. Assume now that the code generator is attempting to generate tests to ensure that validation succeeds and fails appropriately. It seems reasonable for the tool to focus on boundary cases within the regex to avoid generating unnecessary data.
For example, consider a regex ^([a-z]{3,6})$
then the boundary cases include:
- any string consisting only of [a-z] a length equal to 2 (failure)
- any string consisting only of [a-z] a length equal to 3 (success)
- any string consisting only of [a-z] a length equal to 4 (success)
- any string consisting only of [a-z] a length equal to 5 (success)
- any string consisting only of [a-z] a length equal to 6 (success)
- any string consisting only of [a-z] a length equal to 7 (failure)
- any string not consisting of [a-z] (failure)
- any string not starting with [a-z] but ends with [a-z] (failure)
- any string starting with [a-z] but not ending with [a-z] (failure)
The reason focussing on boundary cases is that any string consisting only of [a-z] with a length greater than 6 verifies the upper boundary of the string length defined in the regex. So testing a string of length 7, 8, 9 is really just testing the same (boundary) condition.
This was an arbitrary regex chosen for its simplicity, but any reasonable regex may act as an input.
Does a framework/tools exists that the code generator can use to generate input strings for test cases of the different layers of the systems being generated. The test cases come into their own when the system is no longer generated and modified later in the development cycle.