1428

我在处理从不同网页(在不同站点上)获取的文本中的 unicode 字符时遇到问题。我正在使用 BeautifulSoup。

问题是错误并不总是可重现的。它有时适用于某些页面,有时它会通过抛出UnicodeEncodeError. 我已经尝试了几乎所有我能想到的东西,但我还没有找到任何可以始终如一地工作而不会引发某种与 Unicode 相关的错误的东西。

导致问题的代码部分之一如下所示:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

这是运行上面的代码片段时在某些字符串上产生的堆栈跟踪:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

我怀疑这是因为某些页面(或更具体地说,来自某些站点的页面)可能已编码,而其他页面可能未编码。所有网站都位于英国,并提供用于英国消费的数据 - 因此不存在与内部化或处理以英语以外的任何文本编写的文本相关的问题。

有没有人对如何解决这个问题有任何想法,以便我可以始终如一地解决这个问题?

4

34 回答 34

1473

您需要阅读 Python Unicode HOWTO。这个错误是第一个例子

基本上,停止使用str从 unicode 转换为编码文本/字节。

相反,正确使用.encode()对字符串进行编码:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

或完全使用 unicode。

于 2012-03-30T12:21:31.160 回答
470

这是一个经典的python unicode 痛点!考虑以下:

a = u'bats\u00E0'
print a
 => batsà

到目前为止一切都很好,但是如果我们调用 str(a),让我们看看会发生什么:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

哦,dip,这对任何人都没有好处!要修复错误,请使用 .encode 显式编码字节并告诉 python 使用什么编解码器:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

瞧\u00E0!

问题是,当您调用 str() 时,python 使用默认字符编码来尝试对您给它的字节进行编码,在您的情况下,这些字节有时是 unicode 字符的表示。要解决这个问题,你必须告诉 python 如何使用 .encode('whatever_unicode') 来处理你给它的字符串。大多数时候,使用 utf-8 应该没问题。

有关此主题的精彩说明,请参阅此处的 Ned Batchelder 的 PyCon 演讲:http: //nedbatchelder.com/text/unipain.html

于 2012-03-30T12:25:08.317 回答
237

我找到了优雅的解决方法来删除符号并继续将字符串保留为字符串,如下所示:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

请务必注意,使用忽略选项很危险,因为它会默默地从使用它的代码中删除任何 unicode(和国际化)支持,如下所示(convert unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
于 2014-08-20T10:13:15.407 回答
182

well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
于 2016-09-02T13:10:37.427 回答
101

甚至导致打印失败的一个微妙问题是您的环境变量设置错误,例如。这里 LC_ALL 设置为“C”。在 Debian 中,他们不鼓励设置它:Debian wiki on Locale

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
于 2013-12-02T17:58:15.743 回答
34
于 2015-08-13T12:07:04.140 回答
30
于 2019-01-30T20:21:40.810 回答
28

实际上,我发现在大多数情况下,仅删除这些字符要简单得多:

s = mystring.decode('ascii', 'ignore')
于 2013-11-01T13:44:36.433 回答
28

对我来说,有效的是:

BeautifulSoup(html_text,from_encoding="utf-8")

希望这可以帮助某人。

于 2015-01-26T14:53:34.127 回答
19

Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

Testing it:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

Results:

1
test
98°
98

UPDATE: My original answer was written for Python 2. For Python 3:

def safeStr(obj):
    try: return str(obj).encode('ascii', 'ignore').decode('ascii')
    except: return ""

Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.

Suggestion: you might want to name this function toAscii instead? That's a matter of preference...

Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:

from six import PY2, iteritems 

CHAR_SWAP = { u'\u201c': u'"'
            , u'\u201D': u'"' 
            , u'\u2018': u"'" 
            , u'\u2019': u"'" 
}

def toAscii( text ) :    
    try:
        for k,v in iteritems( CHAR_SWAP ): 
            text = text.replace(k,v)
    except: pass     
    try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
    except UnicodeEncodeError:
        return text.encode('ascii', 'replace').decode('ascii')
    except: return ""

if __name__ == '__main__':     
    print( toAscii( u'testin\u2019' ) )
于 2017-09-26T19:23:10.130 回答
16

Add line below at the beginning of your script ( or as second line):

# -*- coding: utf-8 -*-

That's definition of python source code encoding. More info in PEP 263.

于 2016-08-08T10:17:31.143 回答
10

I always put the code below in the first two lines of the python files:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
于 2018-02-27T14:40:22.633 回答
8

Alas this works in Python 3 at least...

Python 3

Sometimes the error is in the enviroment variables and enconding so

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

where errors are ignored in encoding.

于 2019-04-15T21:49:34.243 回答
7

Simple helper functions found here.

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')
于 2015-12-31T07:57:50.447 回答
7

Just add to a variable encode('utf-8')

agent_contact.encode('utf-8')
于 2018-01-31T05:50:36.710 回答
7

It works for me:

export LC_CTYPE="en_US.UTF-8"
于 2021-07-07T10:12:19.040 回答
6

Please open terminal and fire the below command:

export LC_ALL="en_US.UTF-8"
于 2018-12-25T01:37:23.473 回答
5

I just used the following:

import unicodedata
message = unicodedata.normalize("NFKD", message)

Check what documentation says about it:

unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

Solves it for me. Simple and easy.

于 2016-11-27T21:59:51.920 回答
4

Below solution worked for me, Just added

u "String"

(representing the string as unicode) before my string.

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
于 2017-11-01T21:58:30.773 回答
4

Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:

import sys
import io

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
于 2021-03-09T16:18:40.077 回答
3

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned's presentation.

I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

于 2016-03-12T03:14:47.003 回答
3

We struck this error when running manage.py migrate in Django with localized fixtures.

Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.

The issue was simply that the Django container (we use docker) was missing the LANG env var.

Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.

于 2018-04-11T07:26:07.393 回答
3

Update for python 3.0 and later. Try the following in the python editor:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

This sets the system`s default locale encoding to the UTF-8 format.

More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.

于 2018-12-25T12:36:33.220 回答
3

The recommended solution did not work for me, and I could live with dumping all non ascii characters, so

s = s.encode('ascii',errors='ignore')

which left me with something stripped that doesn't throw errors.

于 2020-03-26T19:34:18.723 回答
3

In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works

f = open("results.txt", "w")
  f.write(data_that_causes_this_error.encode('utf-8'))
  f.close()
于 2020-06-07T11:43:50.527 回答
3

In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me : export PYTHONIOENCODING=UTF-8

于 2021-08-23T07:00:21.863 回答
2

Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.

However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.

I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.

于 2019-01-03T09:03:49.877 回答
2

Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.

Simple tip to avoid :

try: 
    data=str(data)
except:
    data = data #Don't convert to String

The above example will solve Encode error also.

于 2019-07-05T06:11:36.090 回答
0

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:

unic = u''
packet_data = unic
于 2018-03-15T17:43:40.150 回答
0

I had this issue trying to output Unicode characters to stdout, but with sys.stdout.write, rather than print (so that I could support output to a different file as well).

From BeautifulSoup's own documentation, I solved this with the codecs library:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)
于 2019-01-27T01:27:47.240 回答
0

This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.

于 2019-10-14T08:08:53.063 回答
0

This will work:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

Output:

>>>bats
于 2020-05-14T08:35:52.707 回答
0

You can set the character encoding to UTF-8 before running your script:

export LC_CTYPE="en_US.UTF-8"

This should generally resolve the issue.

于 2021-05-07T14:27:06.883 回答
0

You can you use unicodedata for avoid UnicodeEncodeError. here is an example:

import unicodedata

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = unicodedata.normalize("NFKD", agent_telno) #it will remove all unwanted character like '\xa0'
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
于 2022-01-07T15:24:33.097 回答