python-3.x - Python regex - 用前面的符号识别单词

Question

我正在尝试使用 re.split() 和正则表达式将目标句子拆分为复合片段以供稍后使用

(@?\w+)(\W+)

理想情况下，这将在生成的列表中拆分单词和非单词字符，将两者保留为单独的列表项，但可以在单词之前的“@”符号除外。如果单词前有 @ 符号，我想将其作为一个有凝聚力的项目保留在拆分中。我的例子如下。

我的测试语句如下：

这是对专有名词@Ryan 的测试

所以代码行是：

re.split(r'(@?\w+)(\W+)', "这是对专有名词@Ryan的测试")

我要生成的列表将包含“@Ryan”作为单个项目，但相反，它看起来像这样

['','this','','','is','','','a','','','test','','','of','' , '', '适当的', ' ', '', '名词', '@', 'Ryan']

由于第一个容器有 @ 符号，我原以为会先评估它，但显然不是这样。我曾尝试使用前瞻或从 \W+ 容器中删除 @ 无济于事。

https://regex101.com/r/LeezvP/1

score 4 · Accepted Answer

使用您显示的示例，您能否尝试以下操作（使用 Python 3.8.5 编写和测试）。考虑到您需要删除列表中的空/空项目。这将给出@与单词一起的输出。

##First split the text/line here and save it to list named li.
li=re.split(r'(@?\w+)(?:\s+)', "this is a test of proper nouns @Ryan")
li
['', 'this', '', 'is', '', 'a', '', 'test', '', 'of', '', 'proper', '', 'nouns', '@Ryan']

##Use filter to remove nulls in list li.
list(filter(None, li))
['this', 'is', 'a', 'test', 'of', 'proper', 'nouns', '@Ryan']

简单的解释是，使用 split 函数制作 1 个捕获组，其中有一个可选的@后跟单词和 1 个非捕获组，其中有一个或多个出现的空格。这会将空元素放在列表中，因此要删除它们，请使用过滤器功能。

注意：根据 OP 的评论，可能需要空值/空格，因此在这种情况下，可以参考以下代码；对 OP 有效：

li=re.split(r'(@?\w+)(\s+|\W+)', "this is a test of proper nouns @Ryan")

score 2 · Accepted Answer

您还可以使用re.findall|进行匹配，并使用与所需部分匹配的替代方法。

(?:[^@\w\s]+|@(?!\w))+|\s+|@?\w+

解释

(?:非捕获组
- [^@\w\s]+匹配除@word char 或 whitespace char之外的任何字符的 1 次以上
- |或者
- @(?!\w)不直接跟在单词 char 后面时匹配 @
)+关闭组并匹配1次以上
|或者
\s+匹配 1+ 个空格字符以将它们作为单独的匹配项保留在结果中
|或者
@?\w+直接匹配@1+ 单词字符

正则表达式演示

例子

import re

pattern = r"(?:[^@\w\s]+|@(?!\w))+|\s+|@?\w+"

print(re.findall(pattern, "this is a test of proper nouns @Ryan"))

# Output
# ['this', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'of', ' ', 'proper', ' ', 'nouns', ' ', '@Ryan']

print(re.findall(pattern, "this @Ryan #$@test@123@4343@@$%$test@1#$#$@@@1@@@@"))

# Output
# ['this', ' ', '@Ryan', ' ', '#$', '@test', '@123', '@4343', '@@$%$', 'test', '@1', '#$#$@@', '@1', '@@@@']

score 1 · Accepted Answer

正则表达式，@?\w+|\b(?!$)应该满足您的要求。

regex101处的说明：

1st Alternative @\w
    @ matches the character @ literally (case sensitive)
    ? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
    \w matches any word character (equivalent to [a-zA-Z0-9_])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative \b(?!$)
    \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
    Negative Lookahead (?!$)
        Assert that the Regex below does not match
        $ asserts position at the end of a line

python-3.x - Python regex - 用前面的符号识别单词

3 回答 3

Related

Reference