python - re.findall 如何返回重复匹配的匹配项

Question

我在 html 中有 IP:PORT 列表，当我使用 findall 搜索所有 ip 时，我得到所有 ip 的列表，因为 IP 是唯一的，一些端口是相同的，我通过示例列表获得 100 个 IP 和只有 87 个端口。如何找到所有重复的端口？

proxies = re.findall("[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}",html)

ports = re.findall("[0-9]{1,3}\,[0-9]{1,3}\,[0-9]{1,3}\,[0-9]{1,3}",html)
#ports are coded to looks like this 47,46,47,46

print len(proxies)
print len(ports)

score 2 · Accepted Answer

没有看到源文件，我只能说一些基本的观点。

端口号不限于 3 位，因此您排除了任何超过 999 的端口
端口号是否仅显示为 4 个端口的列表？你说格式是一个列表IP:PORT，但这不是你要检查的。

编辑：

更仔细地查看页面的来源。有些条目没有 4 个端口号。

<tr>
    <td class="t_ip">151.9.233.6</td>
    <td class="t_port">50,42</td>
    <td class="t_country"><img src="/images/flags/it.png" alt="it" />Italy</td>
    <td class="t_anonymity">

            High

    </td>
    <td class="t_https">-</td>
    <td class="t_checked">00:02:16</td>
    <td class="t_check">
        <a href="" class="a_check" >check</a>
    </td>
</tr>

似乎检查和获取该元素的内容会容易class="t_ip"得多class="t_port"。

<td class="t_ip">(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>
<td class="t_port">((\d,?)+)</td>

注意： IP 地址表达式将匹配无效的 IP 地址。

score 0 · Accepted Answer

不确定这会对您有多大帮助，但只是另一种选择：

txt = """
<tr>
    <td class="t_ip">151.9.233.6</td>
    <td class="t_port">50,42</td>
    <td class="t_country"><img src="/images/flags/it.png" alt="it" />Italy</td>
    <td class="t_anonymity">

            High

    </td>
    <td class="t_https">-</td>
    <td class="t_checked">00:02:16</td>
    <td class="t_check">
        <a href="" class="a_check" >check</a>
    </td>
</tr>    
"""

txt = [line.strip() for line in txt.split('\n')]

#clstaglen = len('</td>') => 5
getVals = lambda startTxt: [line[len(startTxt):len(line)-5] for line in txt if line.startswith(startTxt)]

print getVals('<td class="t_ip">')
print getVals('<td class="t_port">')

python - re.findall 如何返回重复匹配的匹配项

2 回答 2

Related

Reference