我正在开发一个 Java 程序,该程序在字典中搜索由一组特定字母组成的单词。我想知道是否可以设置一个正则表达式,让您只使用一个字符,就像它出现在一个字符串中一样频繁。例如,带有字母 SHARE。听到,野兔,海,等等都是有效的。但是 see 或 sarah 无效,因为您分别只有一个 e 或一个 a。
5 回答
正则表达式是关于模式匹配的。找到一个简单的模式可能是不可能的。
如果你真的真的很想要一个正则表达式,这些函数会生成一个:
public static String permutation(String str) {
return "^" + permutation("",str).replaceFirst("\\|", "(") + ")$";
}
private static String permutation(String prefix, String str) {
String s = "";
int n = str.length();
if (n == 0) return "|"+prefix;
else {
for (int i = 0; i < n; i++)
s += permutation(prefix + str.charAt(i)+"?",
str.substring(0, i) + str.substring(i+1, n));
}
return s;
}
对于“分享”,它将返回:
^(s?h?a?r?e?|s?h?a?e?r?|s?h?r?a?e?|s?h?r?e?a?|s?h?e?a?r?|s?h?e?r?a?|s?a?h?r?e?|s?a?h?e?r?|s?a?r?h?e?|s?a?r?e?h?|s?a?e?h?r?|s?a?e?r?h?|s?r?h?a?e?|s?r?h?e?a?|s?r?a?h?e?|s?r?a?e?h?|s?r?e?h?a?|s?r?e?a?h?|s?e?h?a?r?|s?e?h?r?a?|s?e?a?h?r?|s?e?a?r?h?|s?e?r?h?a?|s?e?r?a?h?|h?s?a?r?e?|h?s?a?e?r?|h?s?r?a?e?|h?s?r?e?a?|h?s?e?a?r?|h?s?e?r?a?|h?a?s?r?e?|h?a?s?e?r?|h?a?r?s?e?|h?a?r?e?s?|h?a?e?s?r?|h?a?e?r?s?|h?r?s?a?e?|h?r?s?e?a?|h?r?a?s?e?|h?r?a?e?s?|h?r?e?s?a?|h?r?e?a?s?|h?e?s?a?r?|h?e?s?r?a?|h?e?a?s?r?|h?e?a?r?s?|h?e?r?s?a?|h?e?r?a?s?|a?s?h?r?e?|a?s?h?e?r?|a?s?r?h?e?|a?s?r?e?h?|a?s?e?h?r?|a?s?e?r?h?|a?h?s?r?e?|a?h?s?e?r?|a?h?r?s?e?|a?h?r?e?s?|a?h?e?s?r?|a?h?e?r?s?|a?r?s?h?e?|a?r?s?e?h?|a?r?h?s?e?|a?r?h?e?s?|a?r?e?s?h?|a?r?e?h?s?|a?e?s?h?r?|a?e?s?r?h?|a?e?h?s?r?|a?e?h?r?s?|a?e?r?s?h?|a?e?r?h?s?|r?s?h?a?e?|r?s?h?e?a?|r?s?a?h?e?|r?s?a?e?h?|r?s?e?h?a?|r?s?e?a?h?|r?h?s?a?e?|r?h?s?e?a?|r?h?a?s?e?|r?h?a?e?s?|r?h?e?s?a?|r?h?e?a?s?|r?a?s?h?e?|r?a?s?e?h?|r?a?h?s?e?|r?a?h?e?s?|r?a?e?s?h?|r?a?e?h?s?|r?e?s?h?a?|r?e?s?a?h?|r?e?h?s?a?|r?e?h?a?s?|r?e?a?s?h?|r?e?a?h?s?|e?s?h?a?r?|e?s?h?r?a?|e?s?a?h?r?|e?s?a?r?h?|e?s?r?h?a?|e?s?r?a?h?|e?h?s?a?r?|e?h?s?r?a?|e?h?a?s?r?|e?h?a?r?s?|e?h?r?s?a?|e?h?r?a?s?|e?a?s?h?r?|e?a?s?r?h?|e?a?h?s?r?|e?a?h?r?s?|e?a?r?s?h?|e?a?r?h?s?|e?r?s?h?a?|e?r?s?a?h?|e?r?h?s?a?|e?r?h?a?s?|e?r?a?s?h?|e?r?a?h?s?)$
显然这可以被简化+优化很多,但仍然不是一个好主意。
编辑:较短输出的功能:
public static String permutation(String str) {
return "^(" + permutation("",str) + ")$";
}
private static String permutation(String prefix, String str) {
String s = "";
int n = str.length();
if (n == 0) return prefix;
else {
for (int i = 0; i < n; i++)
if (i != n-1)
s += prefix + str.charAt(i) + "?(" +
permutation("", str.substring(0, i) + str.substring(i+1, n))+")|";
else
s += prefix + str.charAt(i) + "?" +
permutation("", str.substring(0, i) + str.substring(i+1, n));
}
return s;
}
印刷:
^(s?(h?(a?(r?(e?)|e?r?)|r?(a?(e?)|e?a?)|e?a?(r?)|r?a?)|a?(h?(r?(e?)|e?r?)|r?(h?(e?)|e?h?)|e?h?(r?)|r?h?)|r?(h?(a?(e?)|e?a?)|a?(h?(e?)|e?h?)|e?h?(a?)|a?h?)|e?h?(a?(r?)|r?a?)|a?(h?(r?)|r?h?)|r?h?(a?)|a?h?)|h?(s?(a?(r?(e?)|e?r?)|r?(a?(e?)|e?a?)|e?a?(r?)|r?a?)|a?(s?(r?(e?)|e?r?)|r?(s?(e?)|e?s?)|e?s?(r?)|r?s?)|r?(s?(a?(e?)|e?a?)|a?(s?(e?)|e?s?)|e?s?(a?)|a?s?)|e?s?(a?(r?)|r?a?)|a?(s?(r?)|r?s?)|r?s?(a?)|a?s?)|a?(s?(h?(r?(e?)|e?r?)|r?(h?(e?)|e?h?)|e?h?(r?)|r?h?)|h?(s?(r?(e?)|e?r?)|r?(s?(e?)|e?s?)|e?s?(r?)|r?s?)|r?(s?(h?(e?)|e?h?)|h?(s?(e?)|e?s?)|e?s?(h?)|h?s?)|e?s?(h?(r?)|r?h?)|h?(s?(r?)|r?s?)|r?s?(h?)|h?s?)|r?(s?(h?(a?(e?)|e?a?)|a?(h?(e?)|e?h?)|e?h?(a?)|a?h?)|h?(s?(a?(e?)|e?a?)|a?(s?(e?)|e?s?)|e?s?(a?)|a?s?)|a?(s?(h?(e?)|e?h?)|h?(s?(e?)|e?s?)|e?s?(h?)|h?s?)|e?s?(h?(a?)|a?h?)|h?(s?(a?)|a?s?)|a?s?(h?)|h?s?)|e?s?(h?(a?(r?)|r?a?)|a?(h?(r?)|r?h?)|r?h?(a?)|a?h?)|h?(s?(a?(r?)|r?a?)|a?(s?(r?)|r?s?)|r?s?(a?)|a?s?)|a?(s?(h?(r?)|r?h?)|h?(s?(r?)|r?s?)|r?s?(h?)|h?s?)|r?s?(h?(a?)|a?h?)|h?(s?(a?)|a?s?)|a?s?(h?)|h?s?)$
这是一种方法:
- 遍历您的字符串数组以创建一个
MultiMap<String, String>
(如果您使用 Guava 库或HashMap<String, List<String>>
使用 java.util),其中键是已排序的单词,值是该排序字符串的合法单词。这将是您的预处理步骤,因此您只需执行一次。由于您的哈希图已经存在,因此后续搜索将相对较快(与每次循环通过您的字典以匹配某些正则表达式相比,这比使用哈希图要慢得多)。 - 对您的搜索字符串进行排序,并找到该排序字符串的所有子字符串。
- 遍历排序子集,并搜索 HashMap 或 MultiMap 以获取该排序子集字符串的值。跟踪所有的值,你就会得到答案。
我认为这里的问题是正则表达式不适合您所描述的内容,因为您仍然必须为每次搜索(已存储为数组)遍历整个字典。然而,如果您创建哈希图(这一步相对昂贵),您只会循环通过排序的子集列表(这很便宜)。
如果单词中没有出现两次的字母,因为没有 in share
,你可以使用
^(?!([share]).*\\1)[share]+$
这将匹配由 中的部分或全部字母组成的任何单词share
。
如果一个字母出现不止一次,则(?!)
包含对括号中匹配内容的反向引用的负前瞻会阻止匹配。\\1
您可以扩展此原则以处理包含多次出现的字母的单词。
好的,这是一个如何执行此操作的示例。但是,您应该阅读这些有关灾难性回溯的文章:
^(?!.*s.*s)(?!.*h.*h)(?!.*a.*a)(?!.*r.*r)(?!.*e.*e)(?![^share]).*$
如果您想允许 2 个字母“s”之类的共享来允许单词 sashes,您可以这样做。
^(?!.*s.*s.*s)(?!.*h.*h)(?!.*a.*a)(?!.*r.*r)(?!.*e.*e)(?![^share]).*$
单词中少于3个“s”的想法是可以的......
一种不使用模式匹配但解决问题根源的方法是创建一个数组,其中包含目标单词中每个字符的计数:“deaf”将是数组 (1,0,0,1 ,1,1,0,0,...)。
然后,当您遍历字典时,为每个单词准备相同的数组并将其从目标单词的数组中减去 - 如果差异数组中有任何负值,则该单词不能由字母组成目标词。