python - Python转换非标准字符

Question

我有一个从包含一些非标准字符的网页中提取的列表。

列表示例：

[<td class="td-number-nowidth"> 10Â 115 </td>, <td class="td-number-nowidth"> 4Â 635 (46%) </td>, <td class="td-number-nowidth"> 5Â 276 (52%) </td>, ...]

带帽子的 A 应该是逗号。有人可以建议如何转换或替换这些，以便我可以在列表中的第一个值中获得值 10115 吗？

源代码：

from urllib import urlopen
from bs4 import BeautifulSoup
import re, string
content = urlopen('http://www.worldoftanks.com/community/accounts/1000395103-FrankenTank').read()
soup = BeautifulSoup(content)

BattleStats = soup.find_all('td', 'td-number-nowidth')
print BattleStats

谢谢，弗兰克

score 3 · Accepted Answer

Content-Encoding该网站是否在其标题中说明了编码？你必须得到它，并使用.decode方法解码列表中的那些字符串。它就像encoded_string.decode("encoding")。encoding可以是任何东西，utf-8成为其中之一。

score 2 · Accepted Answer

您可以使用.decode带有errors='ignore'参数的方法。

>>> s = '[ 10Â 115 , 4Â 635 (46%) , 5Â 276 (52%) , ...]'
>>> s.decode('ascii', errors='ignore')
u'[ 10 115 , 4 635 (46%) , 5 276 (52%) , ...]'

这是help(''.decode)：

decode(...)
    S.decode([encoding[,errors]]) -> object

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.

score 0 · Accepted Answer

你有尝试吗？

这可能会奏效。

a =  ['10Â 115', '4Â 635 (46%)', '5Â 276 (52%)']
for b in a:
    b.replace("\xc3\x82 ", '')

输出：

10115
4635 (46%)
5276 (52%)

根据它的恒定性（如果它始终只是一个带点的 a），可能会有更好的方法（将任何从 \ 替换为带有空白字符的空格）。

score 0 · Accepted Answer

BeautifulSoup 自动处理字符编码。问题在于打印到您的控制台似乎不支持某些 Unicode 字符。在这种情况下，它是'NO-BREAK SPACE' (U+00A0)：

>>> L = soup.find_all('td', 'td-number-nowidth')
>>> L[0]
<td class="td-number-nowidth"> 10 123 </td>
>>> L[0].get_text()
u' 10\xa0123 '

请注意，文本是 Unicode。检查是否print u'<\u00a0>'适用于您的情况。

PYTHONIOENCODING您可以在运行脚本之前通过更改环境变量来操纵使用的输出编码。因此，您可以将输出重定向到指定utf-8编码的文件，并使用ascii:backslashreplace控制台中的调试运行值，而无需更改脚本。bash 中的示例：

$ python -c 'print u"<\u00a0>"' # use default encoding
< >
$ PYTHONIOENCODING=ascii:backslashreplace python -c 'print u"<\u00a0>"'
<\xa0>
$ PYTHONIOENCODING=utf-8 python -c 'print u"<\u00a0>"' > output.txt

要打印相应的数字，您可以在不可破坏的空间上拆分以稍后处理项目：

>>> [td.get_text().split(u'\u00a0')
...  for td in soup.find_all('td', 'td-number-nowidth')]
[[u' 10', u'115 '], [u' 4', '635 (46%) '], [u' 5', u'276 (52%) ']]

或者你可以用逗号替换它：

>>> [td.get_text().replace(u'\u00a0', ', ').encode('ascii').strip()
...  for td in soup.find_all('td', 'td-number-nowidth')]
['10, 115', '4, 635 (46%)', '5, 276 (52%)']

python - Python转换非标准字符

4 回答 4

Related

Reference