2

我正在尝试使用 urllib2 获取页面的 html 并通过 beautifulsoup 对其进行解析,但我遇到了 html 的问题,Â到处&amp都有符号/字母,例如这里是一个代码片段:

<p>Total&amp;2 £100.00.<br/>Total&amp;2 £100.00<br/>Total&amp;2 £100.00</p>

我无法删除Â使用条或更换...

获取 html 的代码是:

html = urllib2.urlopen("http://www.websitehere.com", timeout=10).read().decode('UTF-8')
soup = BeautifulSoup(html)

有谁能帮忙吗?

编辑

我尝试了各种解码,还尝试了位于以下位置的所有内容:如何使 python 解释器正确处理字符串操作中的非 ASCII 字符?但仍然没有:/

谢谢-Hyflex

4

1 回答 1

1

I have a suspicion that this is related the the parser that BS will use to read the HTML. They document it here but if you're like me (on OSX) you might be stuck with something that requires a bit of work:

You'll notice that in the BS4 documentation page above, they point out that by default BS4 will use the Python built-in HTML parser. Assuming you are in OSX, the Apple-bundled version of Python is 2.7.2 which is not lenient for character formatting. I hit this same problem, so I upgraded by version of Python to work around it. Doing this in a virtualenv will minimize disruption to other projects.

If doing that sounds like a pain, you can switch over to the LXML parser:

pip install lxml

And then try:

soup = BeautifulSoup(html, "lxml")

Depending on your scenario, that might be good enough. I found this annoying enough to warrant upgrading my version of Python. Using virtualenv, you can migrate your packages fairly easily

于 2013-09-29T03:22:04.793 回答