python - 用于重复标点和符号的 Python 正则表达式

Question

我需要一个匹配重复（多个）标点和符号的正则表达式。基本上所有重复的非字母数字和非空白字符，例如 ...、???、!!!、###、@@@、+++ 等。它必须是重复的相同字符，所以不是像“！？@”这样的序列。

我曾尝试过 [^\s\w]+ ，虽然它涵盖了所有 !!!, ???, $$$ 案例，但这给了我比我想要的更多的东西，因为它也会匹配 "!?@" .

有人可以启发我吗？谢谢。

score 2 · Accepted Answer

我想你正在寻找这样的东西：

[run for run, leadchar in re.findall(r'(([^\w\s])\2+)', yourstring)]

例子：

In : teststr = "4spaces    then(*(@^#$&&&&(2((((99999****"

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)]
Out: ['&&&&', '((((', '****']

这将为您提供运行列表，不包括该字符串中的 4 个空格以及像 '*(@^' 这样的序列

如果这不是您想要的，您可以使用示例字符串以及您想要查看的确切输出来编辑您的问题。

score 2 · Accepted Answer

试试这个模式：

([.\?#@+,<>%~`!$^&\(\):;])\1+

\1指的是第一个匹配的组，即括号中的内容。

您需要根据需要扩展标点和符号列表。

score 1 · Accepted Answer

编辑：@Firoze Lafeer 发布了一个使用单个正则表达式完成所有操作的答案。如果有人有兴趣将正则表达式与过滤函数结合起来，我会保留它，但对于这个问题，使用 Firoze Lafeer 的答案会更简单、更快捷。

在我看到 Firoze Lafeer 的答案之前写的答案如下，不变。

一个简单的正则表达式不能做到这一点。经典的精辟总结是“正则表达式不能算”。讨论在这里：

如何使用正则表达式检查字符串是否为回文？

对于 Python 解决方案，我建议将正则表达式与一些 Python 代码结合起来。正则表达式会抛出所有不包含某种标点符号的内容，然后 Python 代码会检查以抛出错误匹配项（包含标点符号但不是所有相同字符的匹配项）。

import re
import string

# Character class to match punctuation.  The dash ('-') is special
# in character classes, so put a backslash in front of it to make
# it just a literal dash.
_char_class_punct = "[" + re.escape(string.punctuation) + "]"

# Pattern: a punctuation character followed by one or more punctuation characters.
# Thus, a run of two or more punctuation characters.
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+')

def all_same(seq, basis_case=True):
    itr = iter(seq)
    try:
        first = next(itr)
    except StopIteration:
        return basis_case
    return all(x == first for x in itr)

def find_all_punct_runs(text):
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)]


# alternate version of find_all_punct_runs() using re.finditer()
def find_all_punct_runs(text):
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

我按照我的方式编写all_same()，以便它在迭代器上和在字符串上一样好。Python 内置all()返回一个空序列，True这不是我们all_same()对True.all()

这使用 Python 的内部（正则表达式引擎或）尽可能多地完成工作，all()因此它应该非常快。对于较大的输入文本，您可能需要重写find_all_punct_runs()以使用re.finditer()而不是re.findall(). 我举了一个例子。该示例还返回生成器表达式而不是列表。你总是可以强制它列出一个列表：

lst = list(find_all_punct_runs(text))

score 0 · Accepted Answer

我会这样做：

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})'
>>> print [m.group(0) for m in re.finditer(reg,st)]

或者

>>> print [g for g,l in re.findall(reg, st)]

任一打印：

['...', '???', '###', '@@@', '+++']

python - 用于重复标点和符号的 Python 正则表达式

4 回答 4

Related

Reference