python - 确保两个正则表达式找不到相同的结果

Question

我正在尝试从字符串中解析出所有日期（可能以不同的形式编写）。问题是可能有一个日期以这种形式写d/m -y，例如 22/11 -12。但也可以用这种形式写一个d/m没有指定年份的日期。如果我在此字符串中找到包含较长形式的日期，我不希望再次以较短的形式找到它。这是我的代码失败的地方，它两次找到第一个日期（一次有年份，一次没有年份）。

我真的有两个问题：（1）这样做的“正确”方式是什么。看来我是从错误的角度来解决这个问题的。（2）如果我坚持这种方式，这条线datestring.replace(match.group(0), '')怎么没有删除日期，所以我再也找不到了？

这是我的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

dformats = (
    '(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})',
    '(?P<day>\d{1,2})/(?P<month>\d{1,2})',
    '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
            )


def get_dates(datestring):
    """Try to extract all dates from certain strings.

    Arguments:
    - `datestring`: A string containing dates.
    """
    global dformats

    found_dates = []

    for regex in dformats:
        matches = re.finditer(regex, datestring)
        for match in matches:
            # Is supposed to make sure the same date is not found twice
            datestring.replace(match.group(0), '')

            found_dates.append(match)
    return found_dates

if __name__ == '__main__':
    dates = get_dates('1/2 -13, 5/3 & 2012-11-22')
    for date in dates:
        print date.groups()

score 2 · Accepted Answer

两种方式：

使用单个正则表达式并使用 | 操作员将您的所有案例合并在一起：

expr = re.compile ( r"expr1|expr2|expr3" )
只查找单个实例，然后为下一次搜索传递一个“起始位置”。请注意，这会使事情复杂化，因为无论选择哪种格式，您都希望始终从最早的匹配开始。即，遍历所有三个匹配项，找出最早的匹配项，进行替换，然后以递增的起始位置再次执行。无论如何，这使得选项 1 更容易。

补充几点：

确保您使用的是“原始字符串”：在每个字符串的前面添加一个“r”。否则 '\' 字符可能会被吃掉并且不会传递给 RE 引擎
考虑使用“sub”和一个回调函数代替“repl”参数来进行替换，而不是 finditer。在这种情况下，“repl”被传递了一个匹配对象，并且应该返回替换字符串。
如果未选择该替代方案，则“单个” re 中的匹配组将具有值 None ，从而可以轻松检测使用了哪个替代方案。
除非您打算修改该变量，否则不应说“全局”。

这是一些完整的工作代码。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

expr = re.compile(
    r'(?P<day1>\d{1,2})/(?P<month1>\d{1,2}) -(?P<year>\d{2})|(?P<day2>\d{1,2})/(?P<month2>\d{1,2})|(?P<year3>\d{4})-(?P<month3>\d{2})-(?P<day3>\d{2})')


def get_dates(datestring):
    """Try to extract all dates from certain strings.

    Arguments:
    - `datestring`: A string containing dates.
    """

    found_dates = []
    matches = expr.finditer(datestring)
    for match in matches:
        if match.group('day1'):
            found_dates.append({'day': match.group('day1'),
                                 'month': match.group('month1') })
        elif match.group('day2'):
            found_dates.append({'day': match.group('day2'),
                                'month': match.group('month2')})
        elif match.group('day3'):
            found_dates.append({'day': match.group('day3'),
                                'month': match.group('month3'),
                                'year': match.group('year3')})
        else:
            raise Exception("wtf?")
    return found_dates

if __name__ == '__main__':
    dates = get_dates('1/2 -13, 5/3 & 2012-11-22')
    for date in dates:
        print date

score 2 · Accepted Answer

您可以negative look ahead在您的第二个正则表达式中使用仅匹配那些dates没有跟随的那些-year： -

dformats = (
    r'(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})',
    r'(?P<day>\d{1,2})/(?P<month>\d{1,2})(?!\s+-(?P<year>\d{2}))',
    r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
)

因此，在正则first表达式中匹配的日期将不会在第二个中匹配。

score 1 · Accepted Answer

您可以sub代替find：

def find_dates(s):

    dformats = (
        '(?P<day>\d{1,2})/(?P<month>\d{1,2}) -(?P<year>\d{2})',
        '(?P<day>\d{1,2})/(?P<month>\d{1,2})',
        '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
    )    

    dates = []
    for f in dformats:
        s = re.sub(f, lambda m: dates.append(m.groupdict()), s)
    return dates

python - 确保两个正则表达式找不到相同的结果

3 回答 3

Related

Reference