python - Python：编码错误 - 网页内容

Question

我正在尝试获取网页的内容并对其进行解析，而不是保存在 mysql db 中。

我实际上是为编码 utf8 的网页做的。

但是当我尝试使用 8859-9 编码网页时，我得到了错误。

我获取页面内容的代码：

def getcontent(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Magic Browser')]
    opener.addheaders = [('Accept-Charset', 'utf-8')]   
    #print chardet.detect(response).get('encoding)
    response = opener.open(url).read()
    opener.close()
    return response



url     = "http://www.meb.gov.tr/duyurular/index.asp?ID=4"
contentofpage = getcontent(url)
print contentofpage
print chardet.detect(contentofpage)
print contentofpage.encode("utf-8")

页面内容的输出： ... E�itim Teknolojileri Genel M�d�rl� ...

{'confidence': 0.7789909202570836, 'encoding': 'ISO-8859-2'}


Traceback (most recent call last):
  File "meb.py", line 18, in <module>
    print contentofpage.encode("utf-8")
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not     in range(128)

实际上页面是土耳其页面，编码是 8859-9。

当我尝试使用默认编码时，我看到的都是��而不是一些字符。我如何将页面内容转换为 utf-8 或土耳其语 (iso-8859-9)

另外当我使用 unicode(contentofpage)

它得到

回溯（最后一次调用）：文件“meb.py”，第 20 行，打印 unicode（contentofpage）UnicodeDecodeError：'ascii'编解码器无法解码位置 458 的字节 0xee：序数不在范围内（128）

有什么帮助吗？

score 4 · Accepted Answer

我认为您想解码而不是编码，因为它已经编码。

print contentofpage.decode("iso-8859-9")

产生一个样本，如：

Eğitim Teknolojileri Genel Müdürlüğü

python - Python：编码错误 - 网页内容

1 回答 1

Related

Reference