regex - 正则表达式将短语间单词作为另一个引用短语返回

Question

这是我的正则表达式...

(?<=")[^"]+(?=")|[-+@]?([\w]+(-*\w*)*)

这是我的测试代码...

"@One one" @two three four "fi-ve five" six se-ven "e-ight" "nine n-ine nine"

我不希望在结果中返回双引号，但这似乎使它返回其他引用短语之间的部分作为引用短语本身。这是当前结果（不包括单引号）...

'@One one'
' @two three four '
'fi-ve five'
' six se-ven '
'e-ight'
' '
'nine n-ine nine'

而我真的希望它将这些作为单独的结果返回（不包括单引号）......

'@One one'
'@two'
'three'
'four'
'fi-ve five'
'six'
'se-ven'
'e-ight'
'nine n-ine nine'

有什么想法会使双引号仅适用于短语本身，而不适用于引号之间的单词吗？谢谢。

score 1 · Accepted Answer

您遇到的问题是正则表达式没有“记忆”——也就是说，它们不记得最后一个引号是打开还是关闭（这与正则表达式不适合解析 HTML/XML 的原因相同）。但是，如果您可以假设引用遵循标准规则，即引号和被引用的文本之间没有空格（而如果引号和相邻单词之间有空格，则该单词不是引用），那么您可以使用负面环视(?!\s)并(?<!\s)确保这些地方没有空间：

(?<=")(?!\s)[^"]+(?<!\s)(?=")|[-+@]?([\w]+(-*\w*)*)

澄清假设是什么（使用下划线标记有问题的空格）：

"This is a quote"_this text is not a quote_"another quote"
^               ^ ^                      ^ ^             ^
  no space here   |                      |    none here
  between word    ⌞  but there is here   ⌟
  and mark

编辑：此外，您可以通过删除组和使用字符类来简化正则表达式：

(?<=")(?!\s)[^"]+(?!\s)(?=")|[-+@]?[\w]+[-\w]*

这使得（无论如何对我来说）更容易获得结果：

>> str = "\"@One one\" @two three four \"fi-ve five\" six se-ven \"e-ight\" \"nine n-ine nine\""
=> "\"@One one\" @two three four \"fi-ve five\" six se-ven \"e-ight\" \"nine n-ine nine\""
>> rex = /(?<=")(?!\s)[^"]+(?!\s)(?=")|[-+@]?[\w]+[-\w]*/
=> /(?<=")(?!\s)[^"]+(?!\s)(?=")|[-+@]?[\w]+[-\w]*/
>> str.scan rex
=> ["@One one", "@two", "three", "four", "fi-ve five", 
    "six", "se-ven", "e-ight", "nine n-ine nine"]

score 0 · Accepted Answer

描述

这并不完美，因为捕获组 0 确实包含包含前导/尾随空格和引号的匹配项，但捕获组 1 将获取引号内的文本，而组 2 获取单个单词。无论单个引号周围的空格如何，这都将起作用。

(?!\Z)(?:\s*"([^"]*)"|\s*(\S*))

在此处输入图像描述

例子

现场示例：http ://www.rubular.com/r/HrHJIlMieb

示例文本

注意 5 到 6 之间潜在的困难边缘情况

"@One one" @two three four "fi-ve five"six se-ven "e-ight" "nine n-ine nine"

捕获组

[0] => Array
    (
        [0] => "@One one"
        [1] =>  @two
        [2] =>  three
        [3] =>  four
        [4] =>  "fi-ve five"
        [5] =>  six
        [6] =>  se-ven
        [7] =>  "e-ight"
        [8] =>  "nine n-ine nine"
    )

[1] => Array
    (
        [0] => @One one
        [1] => 
        [2] => 
        [3] => 
        [4] => fi-ve five
        [5] => 
        [6] => 
        [7] => e-ight
        [8] => nine n-ine nine
    )

[2] => Array
    (
        [0] => 
        [1] => @two
        [2] => three
        [3] => four
        [4] => 
        [5] => six
        [6] => se-ven
        [7] => 
        [8] => 
    )

score 0 · Accepted Answer

当您一次搜索一个东西时，您的代码就可以工作。我不确定这是在什么上下文中使用的，但是您可以关闭任何全局标志，它只会匹配第一次出现。然后把前面的绳子剪掉，然后再跑一次，依此类推。

编辑：你把它们按什么顺序排列有关系吗？两个单独的正则表达式怎么样？

第一："([^"]*)"

这将匹配您要保留的所有带引号的字符串，用捕获进行正则表达式替换，您可以捕获所有它们并用空字符串替换它们。

第二：只需匹配之后留下的每个单词。

regex - 正则表达式将短语间单词作为另一个引用短语返回

3 回答 3

描述

例子

Related

Reference