python - 正则表达式不从引用的回复中删除文本

Question

我正在尝试解析电子邮件回复的文本并删除引用的文本（以及它后面的任何内容，包括签名）

此代码正在返回：消息测试 2013 年 6 月 25 日星期二晚上 10:01，Catie Brand <

我希望它只返回消息测试

我错过了什么正则表达式？

def format_mail_plain(value, from_address):
    res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'\s+wrote:', re.IGNORECASE  | re.MULTILINE),
           re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
           re.compile(r'from:\s*$', re.IGNORECASE),
           re.compile(r'^>.*$', re.IGNORECASE | re.MULTILINE)]

    whitespace_re = re.compile(r'\s+')

    lines = list(line.rstrip() for line in value.split('\n'))

    result = ''
    for line_number, line in zip(range(len(lines)), lines):
        for reg_ex in res:
            if reg_ex.search(line):
                return result

        if not whitespace_re.match(line):
            if '' is result:
                result += line
            else:
                result += '\n' + line

    return result




************************ Sample Text *****************************
message tests 
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX < 
conversations+yB1oupeCJzMOBj@xxxx.com> wrote: 
> ** 
>    [image: Krow] <http://www.krow.com/>


************************ Result **********************************
message tests
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX <

我宁愿结果是：

************************ Result **********************************
message tests

score 1 · Accepted Answer

在您的示例输入中，On.*?wrote不匹配，因为On ... wrote:跨越两行。

我更改了您的代码以替换On.*wrote:\s*为空字符串。

def format_mail_plain(value, from_address):
    value = re.compile(r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL).sub('', value)
    res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
           re.compile(r'^from:', re.IGNORECASE),
           re.compile(r'^>')]

    lines = filter(None, [line.rstrip() for line in value.split('\n')])

    result = []
    for line in lines:
        result.append(line)
        for reg_ex in res:
            if reg_ex.search(line):
                result.pop()
                break

    return '\n'.join(filter(None, result))

score 0 · Accepted Answer

The regex that you are expecting to catch 'On Tue, Jun 25 ...' is

re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL)

That won't match because the 'wrote' in your sample text has already been split to another line by the time the regex sees the string. Since you want to stop processing the message after you have seen that string, replace it with something that will otherwise trigger your processing loop to exit, before you split the string. I would suggest the leading quote character '>'. falsetru caught this first, I incorporated the replacement idea into my answer.

Your regular expressions seem to be written to not use alternation at all. Was that at an attempt at improving performance?

I would reduce the number of regular expressions, eliminate lines of whitespace from being processed at the list generation stage, and use substrings to test singe and two-character regular expressions. Try this:

def format_mail_plain(value, from_address):
    on_wrote_regex = re.compile(
        r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL)
    value = on_wrote_regex.sub('>', value)
    res = [re.compile(r'from:\s*(' + re.escape(from_address) +)|$, re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'\s+wrote:', re.IGNORECASE),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE)]

    result = ''
    for line in (text_line.rstrip() 
                 for text_line in value.split('\n') 
                 if text_line.strip()):
        if line[0] == '>':
            return result

        for reg_ex in res:
            if reg_ex.search(line):
                return result

        if '' is result:
            result += line
        else:
            result += '\n' + line

    return result

python - 正则表达式不从引用的回复中删除文本

2 回答 2

Related

Reference