python - Python 3 HTML 解析器

Question

我相信每个人都会呻吟，并告诉我查看文档（我有），但我只是不明白如何实现以下相同：

curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'

到目前为止，我在 python3 中的所有内容是：

import urllib.request

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')

for lines in f.readlines():
    print(lines)

f.close()

说真的，任何建议（请不要告诉我阅读http://docs.python.org/release/3.0.1/library/html.parser.html，因为我已经学习python 1天了，很容易混淆) 一个简单的例子会很棒！！！

score 4 · Accepted Answer

这是基于上面 larsmans 的回答。

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
    if b'align="center">' in line:
        print(next(f).decode().rstrip())
f.close()

解释：

for line in f遍历类文件对象 f 中的行。Python 让您可以像遍历列表中的项目一样遍历文件中的行。

if b'align="center">' in line在当前行中查找字符串 'align="center">'。表示这b是字节缓冲区，而不是字符串。似乎urllib.reqquest.urlopen将结果解释为二进制数据，而不是 unicode 字符串，并且未修饰的'align="center">'将被解释为 unicode 字符串。（这就是TypeError上面的来源。）

next(f)获取文件的下一行，因为您的原始 awk 脚本在 'align="center">' 之后打印了该行，而不是当前行。该decode方法（字符串在 Python 中具有方法）获取二进制数据并将其转换为可打印的 unicode 对象。该rstrip()方法去除任何尾随空格（即每行末尾的换行符。

score 3 · Accepted Answer

# no need for .readlines here
for ln in f:
    if 'align="center">' in ln:
        print(ln)

但一定要阅读Python 教程。

score 0 · Accepted Answer

我可能会使用正则表达式来获取 ip 本身：

import re
import urllib

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]

这将打印格式的第一个字符串：1-3digits, period, 1-3digits,...

我认为您正在寻找该行，您可以简单地扩展 findall() 表达式中的字符串来处理它。（有关更多详细信息，请参阅 python 文档以获取更多信息）。顺便说一句，匹配字符串前面的 r 使其成为原始字符串，因此您不需要在其中转义 python 转义字符（但您仍然需要转义 RE 转义字符）。

希望有帮助

python - Python 3 HTML 解析器

3 回答 3

Related

Reference