1

I'm now on Ubuntu 13.04 and Python 2.7.4 and tried to run a script including the following lines:

html = unicode(html, 'cp932').encode('utf-8')
html1, html2 = html.split(some_text) # this line spits out the error

However, when I ran the above script on Ubuntu 13.04, it spitted out an error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 106: ordinal not in range(128). However, this exactly same script can always be executed successfully on OS X 10.8 and Python 2.7.3. So I wonder why the error occurred only one of the two platforms...

The first thought came to my mind, especially after reading this post (UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1) was that the dichotomy arose because I'm in a different LANG environment, where I use jp_JP.UTF-8 on OS X but en_US.UTF-8 on Ubuntu. So I also tried to add one more line os.environ['LANG'] = 'jp_JP.UTF-8' to the aforementioned scrip, but still got the same error.

One more strange phenomenon is that when I attempt to run the script from within IPython shell on Ubuntu and go into debug mode immediately after the error happens, and then run the line which originally triggered the error, I don't get the error any more...

So what's happening here? And what am I missing?

Thanks in advance.

4

1 回答 1

1

你没有给我们足够的信息来确定,但很有可能这是你的问题:

如果some_text是一个unicode对象,那么这一行:

html1, html2 = html.split(some_text) # this line spits out the error

…正在调用splita str,并传递一个unicode参数。每当您在同一个调用中混合使用时strunicodePython 2.x 都会通过自动调用unicode. str所以,这相当于:

html1, html2 = unicode(html).split(some_text) # this line spits out the error

…相当于:

html1, html2 = html.decode(sys.getdefaultencoding()).split(some_text) # this line spits out the error

...如果 中有任何非 ASCII 字符,这将失败html,正如您所看到的。


简单的解决方法是显式编码some_text为 UTF-8:

html1, html2 = html.split(some_text.encode('utf-8'))

但就个人而言,我什至不会尝试str在同一个程序中使用来自 3 个不同字符集的对象。为什么不只是decode/encode在最边缘,而只是处理unicode介于两者之间的对象?

于 2013-06-28T00:48:14.833 回答