c# - Generating sample data from regex to verify input strings by focussing on boundary cases defined in regex

Question

There are several tools how to generate sample data for a given regex. Some include:

REX
Fare

However, while they may be sufficient to seed a dataset, it doesn't help much testing code that depends on the regex itself, such as validation.

Assume you have a code generator that generates a model with a property. The user specifies a regex to validate the property. Assume now that the code generator is attempting to generate tests to ensure that validation succeeds and fails appropriately. It seems reasonable for the tool to focus on boundary cases within the regex to avoid generating unnecessary data.

For example, consider a regex ^([a-z]{3,6})$ then the boundary cases include:

any string consisting only of [a-z] a length equal to 2 (failure)
any string consisting only of [a-z] a length equal to 3 (success)
any string consisting only of [a-z] a length equal to 4 (success)
any string consisting only of [a-z] a length equal to 5 (success)
any string consisting only of [a-z] a length equal to 6 (success)
any string consisting only of [a-z] a length equal to 7 (failure)
any string not consisting of [a-z] (failure)
any string not starting with [a-z] but ends with [a-z] (failure)
any string starting with [a-z] but not ending with [a-z] (failure)

The reason focussing on boundary cases is that any string consisting only of [a-z] with a length greater than 6 verifies the upper boundary of the string length defined in the regex. So testing a string of length 7, 8, 9 is really just testing the same (boundary) condition.

This was an arbitrary regex chosen for its simplicity, but any reasonable regex may act as an input.

Does a framework/tools exists that the code generator can use to generate input strings for test cases of the different layers of the systems being generated. The test cases come into their own when the system is no longer generated and modified later in the development cycle.

score 1 · Accepted Answer

如果我正确理解您的问题，您希望根据验证正则表达式为系统生成输入，以便您可以自动化单元测试。

但是，这不会违背单元测试的目的吗？如果有人更改了正则表达式，您不希望验证失败吗？

无论如何，简单的答案是从正则表达式生成字符串几乎是不可能的。如果能做到，那将是极其复杂的。例如，考虑这个正则表达式：

(?<=\G\d{0,3})(?>[a-z]+)(?<=(?<foo>foo)|)(?(foo)(?!))

我很容易想到一个匹配（和/或生成匹配）的字符串：

abc123def456ghi789jkl123foo456pqr789stu123vwx456yz

比赛将是：

“ABC”
“定义”
“吉”
“jkl”

但是如何从表达式中生成字符串？没有明确的起点——需要一些极端的（对于计算机而言）智能加上一点创造力来制定解决方案。对人类来说很简单，但对计算机来说却非常非常困难。即使你能想出一个生成匹配字符串的计算机算法，它也很容易看起来像这样：

一个

这会产生一个匹配，但它在执行正则表达式方面做得很差。\d{0,3}从未真正尝试过，并且仅\G用于匹配输入的开头（而不是最后一次匹配的结尾）。(?<=(?<foo>foo))从未测试过（如果是，则会导致不匹配）。

生成不匹配的字符串也很容易：

1

但是，同样，这并没有真正让正则表达式完成它的步伐。

我对计算机理论的了解不足以证明这一点，但我相信这属于P v NP 类问题。生成一个正则表达式来匹配一个复杂字符串的集合相对容易，但是生成一个复杂字符串的集合来匹配一个正则表达式是很困难的。

c# - Generating sample data from regex to verify input strings by focussing on boundary cases defined in regex

1 回答 1

Related

Reference