python - 在 python 中检测和更改网站编码

Question

我的网站编码有问题。我制作了一个程序来抓取一个网站，但我没有成功改变阅读内容的编码。我的代码是：

import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding

我使用了外部库（BSXPath 是 BeautifulSoap 的扩展）和 document.originalEncoding 打印网站的编码，而不是我试图更改的 utf-8 编码。有人有什么建议吗？

谢谢

score 0 · Accepted Answer

好吧，不能保证 HTTP 标头呈现的编码与 HTML 本身中指定的编码相同。这可能是由于服务器端的配置错误或 HTML 中的字符集定义错误而发生的。确实没有自动检测编码或检测正确编码的方法。我建议手动检查 HTML 以获取正确的编码（例如，可以轻松检测到 iso-8859-1 与 utf-8），然后以某种方式在您的应用程序中手动对编码进行硬编码。

python - 在 python 中检测和更改网站编码

1 回答 1

Related

Reference