python - 帮助在 Python 中替换非 ASCII 字符

Question

我有一堆使用 Python 中的 HTTPLIB2 包下载的 HTML 文件。' ' 显示为 'Â '。

<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004&nbsp;</font> is the desired format.

如何在 Python 中'Â '替换为？' '非常感谢！

score 1 · Accepted Answer

你有一个编码问题。不要试图删除这些字符，而是查找页面的编码，然后在读取文件时，使用codecs模块而不是open()，使用正确的字符编码。

score 0 · Accepted Answer

filtered_content = filter(lambda x: x in string.printable, content)

这解决了我的问题。谢谢！

score -1 · Accepted Answer

s.replace('Â ', '&nbsp;');

但是，虽然我没有使用 HTTPLIB2，但我很确定如果在下载 HTML 文件时更改了它们的源，就会出现问题。可能存在解码问题。你使用的是什么版本的 Python？如果是 Python 3，内容将是字节序列，而不是字符串，因此您必须指定正确的代码页来将字节解码为。

编辑：如果您不仅限于使用 httplib2，也许您可以尝试使用Python 2.6 标准库中的urllib、urllib2或模块？httplib

3 回答 3