regex - 匹配两个字符串，其中某些文本是可选的匹配？

Question

我正在尝试编写一个简单的 Java 函数，该函数将获取语言输入列表并查看我从数据库查询中获得的内容是否匹配。我数据库中的所有字符串都已标准化，以便于搜索。这是一个例子。

研究实验室 A 希望参与者具有以下任何语言输入（它们由竖线字符分隔|）：

{English | English, Spanish | Spanish}

换句话说，这个实验室可以接受单语英语、单语西班牙语或双语英语和西班牙语的参与者。这非常简单 - 如果他们的数据库结果返回"English"or "English, Spanish"，"Spanish"我的函数将找到匹配项。

但是，我的数据库还会标记参与者是否仅对某种语言（使用~字符）进行了最少的语言输入。

"English, ~Spanish" = participant hears English and a little Spanish
"English, ~Spanish, Russian" = participant hears English, Russian, and a little Spanish

这是我遇到麻烦的地方。我想将类似的东西"English, ~Spanish"与"English"和匹配"English, Spanish"。

我正在考虑删除/隐藏带有标记的语言~，但如果有一个研究实验室只想要{English, Spanish}，那么"English, ~Spanish"即使它应该匹配，也不会匹配。

我也想不出如何使用正则表达式来完成这项任务。任何帮助将不胜感激！

score 1 · Accepted Answer

试试这个

\b(English[, ~]+Spanish|Spanish|English)\b

代码

try {
    if (subjectString.matches("(?im)\\b(English[, ~]+Spanish|Spanish|English)\\b")) {
        // String matched entirely
    } else {
        // Match attempt failed
    } 
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

解释

"\\b" +               // Assert position at a word boundary
"(" +                // Match the regular expression below and capture its match into backreference number 1
                        // Match either the regular expression below (attempting the next alternative only if this one fails)
      "English" +          // Match the characters “English” literally
      "[, ~]" +            // Match a single character present in the list “, ~”
         "+" +                // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      "Spanish" +          // Match the characters “Spanish” literally
   "|" +                // Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      "Spanish" +          // Match the characters “Spanish” literally
   "|" +                // Or match regular expression number 3 below (the entire group fails if this one fails to match)
      "English" +          // Match the characters “English” literally
")" +
"\\b"                 // Assert position at a word boundary

更新

更通用的形式是这样的：

(?-i)\b([A-Z][a-z]+[, ~]+[a-z]+|[A-Z][a-z]+)\b

顺便说一句，这样做你可能会搞砸，因为这种模式会匹配一个全大写的单词。在生成 RegEx 模式时使用此语法可能会有更好的选择。

(A[, ~]+B|A|B)

其中A,B将是语言的名称。我认为这将是一个更好的方法。

regex - 匹配两个字符串，其中某些文本是可选的匹配？

1 回答 1

Related

Reference