python-2.7 - 通过 Python 2.7 从 html 读取多语言字符串

Question

我是 python 2.7 的新手，我正在尝试从 html 文件中提取一些信息。更具体地说，我想阅读一些包含多语言信息的文本信息。我让我的脚本跳跃以使事情更清楚。

import urllib2
import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

print data[0]['content'].encode("utf-8")

我正在采取的结果是

BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text

问题出在第一个字符串中。有什么方法可以打印出我正在阅读的内容吗？还有什么方法可以找到每个脚本语言的确切编码？

PS：我想提一下，该站点是完全随机选择的，因为它代表了我遇到的问题。

先感谢您！

score 1 · Accepted Answer

您输出结果的终端有问题。该脚本工作正常，如果您将数据输出到文件，您将正确获得它。

例子：

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

with open("test.txt", "w") as myfile:
    myfile.write(data[0]['content'].encode("utf-8"))

测试.txt：

BBC中文网，主页，bbcchinese.com, email news, newsletter, subscription, full text

您使用的是哪个操作系统和终端？

python-2.7 - 通过 Python 2.7 从 html 读取多语言字符串

1 回答 1

Related

Reference