python - 使用 re.finditer 和 re.match 时的不同行为

Question

我正在研究一个正则表达式来通过一些脚本从页面中收集一些值。我re.match在条件中使用但它返回 false 但如果我使用finditer它返回 true 并且条件主体被执行。我在自己构建的测试器中测试了该正则表达式，它在那里工作，但不在脚本中。这是示例脚本。

result = []
RE_Add0 = re.compile("\d{5}(?:(?:-| |)\d{4})?", re.IGNORECASE)
each = ''Expiration Date:\n05/31/1996\nBusiness Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302\n'
if RE_Add0.match(each):
    result0 = RE_Add0.match(each).group(0)
    print result0
    if len(result0) < 100:
        result.append(result0)
    else:
        print 'Address ignore'
else:
    None

score 3 · Accepted Answer

re.finditer()即使没有匹配项也会返回一个迭代器对象（因此 anif RE_Add0.finditer(each)总是会返回True）。您必须实际迭代对象以查看是否存在实际匹配项。

然后，re.match()只匹配字符串的开头，而不是字符串中的任何地方 asre.search()或re.finditer()do。

第三，该正则表达式可以写为r"\d{5}(?:[ -]?\d{4})".

第四，始终使用带有正则表达式的原始字符串。

score 1 · Accepted Answer

re.match仅在字符串的开头匹配一次。re.finditer在这方面类似于re.search，即它是迭代匹配的。相比：

>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x01057AA0>
>>> re.match('b', 'abc')
>>> re.finditer('a', 'abc')
<callable_iterator object at 0x0106AD30>
>>> re.finditer('b', 'abc')
<callable_iterator object at 0x0106EA10>

ETA：既然你提到了page，我只能推测你在谈论 html 解析，如果是这样的话，请使用 BeautifulSoup 或类似的 html 解析器。不要使用正则表达式。

score 0 · Accepted Answer

尝试这个：

import re

postalCode = re.compile(r'((\d{5})([ -])?(\d{4})?(\s*))$')
primaryGroup = lambda x: x[1]

sampleStr = """
    Expiration Date:
    05/31/1996
    Business Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302  
"""
result = []

matches = list(re.findall(postalCode, sampleStr))
if matches:
    for n,match in enumerate(matches): 
        pc = primaryGroup(match)
        print pc
        result.append(pc)
else:
    print "No postal code found in this string"

这将在任何一个上返回“12345”

12345\n
12345  \n
12345 6789\n
12345 6789    \n
12345 \n
12345     \n
12345-6789\n
12345-6789    \n
12345-\n
12345-    \n
123456789\n
123456789    \n
12345\n
12345    \n

我只在一行的末尾匹配它，因为否则它在您的示例中也匹配“23901”（来自街道地址）。

python - 使用 re.finditer 和 re.match 时的不同行为

3 回答 3

Related