python - 如何在 Python 的正则表达式中使用列表作为变量

Question

如何在正则表达式中使用列表变量？问题在这里：

re.search(re.compile(''.format('|'.join(map(re.escape, kand))), corpus.raw(fileid)))

错误是

TypeError: unsupported operand type(s) for &: 'str' and 'int'

简单的 re.search 效果很好，但我需要列表作为 re.search 中的第一个属性：

for fileid in corpus.fileids():
    if re.search(r'[Чч]естны[й|м|ого].труд(а|ом)', corpus.raw(fileid)):
        dict_features[fileid]['samoprezentacia'] = 1
    else:
        dict_features[fileid]['samoprezentacia'] = 0

if re.search(re.compile('\b(?:%s)\b'.format('|'.join(map(re.escape, kand))), corpus.raw(fileid))):
    dict_features[fileid]['up'] = 1
else:
    dict_features[fileid]['up'] = 0

返回 dict_features

顺便说一下 kand 是列表：

kand = [line.strip() for line in open('kand.txt', encoding="utf8")]

在输出 kand 是 ['apple', 'banana', 'peach', 'plum', 'pineapple', 'kiwi']

编辑：我在 Windows 7 上使用带有 WinPython 的 Python 3.3.2 完整错误堆栈：

Traceback (most recent call last):
  File "F:/Python/NLTK packages/agit_classify.py", line 59, in <module>
    print (regexp_features(agit_corpus))
  File "F:/Python/NLTK packages/agit_classify.py", line 53, in regexp_features
    if re.search(re.compile(r'\b(?:{0})\b'.format('|'.join(map(re.escape, kandidats_all))), corpus.raw(fileid))):
  File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 214, in compile
    return _compile(pattern, flags)
  File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 281, in _compile
    p = sre_compile.compile(pattern, flags)
  File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_compile.py", line 494, in compile
    p = sre_parse.parse(p, flags)
  File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 748, in parse
    p = _parse_sub(source, pattern, 0)
  File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 360, in _parse_sub
    itemsappend(_parse(source, state))
  File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 453, in _parse
    if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'

score 2 · Accepted Answer

您得到实际异常的原因是括号不匹配。让我们将其分解以使其更清晰：

re.search(
    re.compile(
        ''.format('|'.join(map(re.escape, kand))), 
        corpus.raw(fileid)))

换句话说，您将字符串 ,corpus.raw(fileid)作为第二个参数传递给re.compile，而不是作为第二个参数传递给re.search.

换句话说，您试图将它用作flags值，它应该是一个整数。当re.compile尝试使用&字符串上的运算符来测试每个标志位时，它会引发一个TypeError.

如果你克服了这个错误，re.search那么它本身就会引发 a TypeError，因为你只传递了一个参数而不是两个参数。

这正是您不应该编写过于复杂的表达式的原因。调试起来非常痛苦。如果您在单独的步骤中编写此代码，则很明显：

escaped_kand = map(re.escape, kand)
alternation = '|'.join(escaped_kand)
whatever_this_was_supposed_to_do = ''.format(alternation)
regexpr = re.compile(whatever_this_was_supposed_to_do, corpus.raw(fileid))
re.search(regexpr)

这也很明显，你正在做的一半工作一开始就不需要。

首先，re.search采用模式，而不是编译的正则表达式。如果它恰好与已编译的正则表达式一起工作，那只是一个意外。因此，表达式的整个部分是无用的。只需传递模式本身。

或者，如果您有充分的理由编译正则表达式，正如re.compile解释的那样，结果正则表达式对象“可用于使用其match()和search()方法进行匹配”。所以使用编译对象的search方法，而不是顶级re.search函数。

其次，我不知道你期望''.format(anything)做什么，但它不可能返回除''.

score 1 · Accepted Answer

您正在混合新旧字符串格式规则。此外，您需要使用带有正则表达式的原始字符串，或者\b将意味着backspace，而不是单词边界。

'\b(?:%s)\b'.format('|'.join(map(re.escape, kand)))

应该

r'\b(?:{0})\b'.format('|'.join(map(re.escape, kand)))

此外，请注意，\b仅当您的“单词”以字母数字字符（或）开头和结尾时才有效_。

python - 如何在 Python 的正则表达式中使用列表作为变量

2 回答 2

Related

Reference