python - 我只想下载这个网址……但它给了我一个错误！...unicode..（Python）

Question

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

请看一下 unicode 部分。我已经尝试了这两个选项...但不起作用。

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

当我尝试更长的编码方法时也是如此......

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

score 5 · Accepted Answer

您的 html 数据是来自互联网的字符串，该字符串已经使用某种编码进行了编码。在将其编码为之前utf-8，您必须先对其进行解码。

Python隐含地试图解码它（这就是你得到一个UnicodeDecodeErrornot的原因UnicodeEncodeError）。

您可以通过在尝试将其重新编码为.utf-8

例子：

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

使用正确的编码页面首先被编码，而不是'some_encoding'.

在解码之前，您必须知道字符串使用的是哪种编码。

score 3 · Accepted Answer

不解码？htmlSource = htmlSource.decode('utf8')

decode 意思是“从 utf8 编码解码 htmlSource”

编码意味着“将htmlSource编码为utf8编码”

由于您正在提取现有数据（从网站抓取），因此您需要对其进行解码，并且当您插入 mysql 时，您可能需要根据您的 mysql db/table/fields 排序规则将其编码为 utf8。

score 1 · Accepted Answer

1

可能您想解码Utf8，而不是对其进行编码：

htmlSource =  htmlSource.decode('utf8')

于 2009-11-27T12:59:51.673 回答

python - 我只想下载这个网址……但它给了我一个错误！...unicode..（Python）

3 回答 3

Related

Reference