python - 使用正则表达式解析 Whois 数据 - 忽略字段重复

Question

我正在尝试解析 whois 查询的结果。我有兴趣检索 route、descr 和 origin 字段，如下所示：

route:          129.45.67.8/91
descr:          FOO-BAR
descr:          Information 2
origin:         AS5462
notify:         foo@bar.net
mnt-by:         AS5462-MNT
remarks:        For abuse notifications please file an online case @ http://www.foo.com/bar
changed:        foo@bar.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.foo.net/bar
remarks:        ****************************

route:          123.45.67.8/91
descr:          FOO-BAR
origin:         AS3269
mnt-by:         BAR-BAZ
changed:        foo@bar.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

为此，我使用以下代码和正则表达式：

search = "FOO-BAR"

with open(FILE, "r") as f:
    content = f.read()
    r = re.compile(r'route:\s+(.*)\ndescr:\s+(.*' + search + '.*).*\norigin:\s+(.*)', re.IGNORECASE)
    res = r.findall(content)
    print res

对于仅包含一个 descr 字段的结果，它确实可以按预期工作，但是它会忽略包含多个 descr 字段的结果。

在这种情况下，我得到以下结果：

[('123.45.67.8/91', 'FOO-BAR', 'AS3269')]

预期的结果是有 route 字段，如果有多个 descr line 和 origin 字段，则为第一个 descr 字段。

[('129.45.67.8/91', 'FOO-BAR', 'AS5462'), ('123.45.67.8/91', 'FOO-BAR', 'AS3269')]

解析包含一个 AND 多个 descr 行的结果的正确正则表达式是什么？

score 2 · Accepted Answer

我已经很接近你的要求了：

import re

search = "FOO-BAR"

with open('whois', "r") as f:
    content = f.read()
    r = re.compile(     r''                 # 
            'route:\s+(.*)\n'               # 
            '(descr:\s+(?!FOO-BAR).*\n)*'   # Capture 0-n lines with descr: field but without FOO-BAR 
            'descr:\s+(FOO-BAR)\n'          # Capture at least one line with descr: and FOO-BAR
            '(descr:\s+(?!FOO-BAR).*\n)*'   # Capture 0-n lines with descr: field but without FOO-BAR
            'origin:\s+(.*)',               #
            re.IGNORECASE)  
    #r = re.compile('(route:\n)((descr:)(?!FOO-BAR)(.*)\n)*((descr:)(FOO-BAR)\n)?((descr:)(?!FOO-BAR)(.*)\n)*')
    res = r.findall(content)
    print res

结果：

>>> [('129.45.67.8/91', '', 'FOO-BAR', 'descr:          Information 2\n', 'AS5462'),
     ('123.45.67.8/91', '', 'FOO-BAR', '', 'AS3269')]

稍微清洁一下，你就可以得到你的结果

python - 使用正则表达式解析 Whois 数据 - 忽略字段重复

1 回答 1

Related

Reference