python - python utf-8 问题

Question

这是我的脚本

# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2

res = urllib2.urlopen('http://tazeh.net')
html = res.read()

soup = BeautifulSoup(''.join(html))

title = soup.findAll('title')
print title

当我在终端运行这个脚本时，我得到了这样的错误文本

$ python test.py

[<title>ŮžŘ§Ű&OElig;ÚŻŘ§Ů&Dagger; ŘŽŘ¨ŘąŰ&OElig; ŘŞŘŮ&bdquo;Ű&OElig;Ů&bdquo;Ű&OElig; ŘŞŘ§Ř˛Ů&Dagger;</title>]

UTF-8 编码和波斯语的这个标题

我是python的新手，怎么了？

score 3 · Accepted Answer

如果我添加（就像建议在不太有用的地方做的评论之一）：

html = html[:10000].decode("utf-8")

（切片是因为解码在页面更远的偏移处失败）

前：

soup = BeautifulSoup(html)

它打印：

[<title>پایگاه خبری تحلیلی تازه</title>]

score 1 · Accepted Answer

''.join(html)没必要。该变量html已经是一个字符串。

但是，该页面似乎没有以 UTF-8 正确编码。

python - python utf-8 问题

2 回答 2

Related

Reference