python - BeautifulSoup 无法解析网页？

Question

我现在正在使用漂亮的汤来解析网页，我听说它非常有名并且很好，但它似乎不能正常工作。

这就是我所做的

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()

我认为这很简单。我打开网页并将其传递给beautifulsoup。但这是我得到的：

Warning (from warnings module):

File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

...

HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94

我认为 CNN 网站应该设计得很好，所以我不太确定发生了什么。有人对此有想法吗？

score 10 · Accepted Answer

From the docs:

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Your code works as is (on Python 2.7, Python 3.3) if you install more robust parser on Python 2.7 (such as lxml or html5lib):

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen # py3k

from bs4 import BeautifulSoup # $ pip install beautifulsoup4

url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())

HTMLParser.py - more robust SCRIPT tag parsing bug might be related.

score 8 · Accepted Answer

您不能使用 BeautifulSoup 或任何 HTML 解析器来读取网页。您永远无法保证网页是格式良好的文档。让我解释一下在这个给定的情况下发生了什么。

在那个页面上有这个 INLINE javascript：

var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>";

You can see that it's creating a string that will put a script tag onto the page. Now, if you're an HTML parser, this is a very tricky thing to deal with. You go along reading your tokens when suddenly you hit a <script> tag. Now, unfortunately, if you did this:

<script>
alert('hello');
<script>
alert('goodby');

Most parsers would say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgot to close the first one! And the parser would think both are valid scripts.

So, in this case, BeautifulSoup sees a <script> tag, and even though it's inside a javascript string, it looks like it could be a valid starting tag, and BeautifulSoup has a seizure, as well it should.

If you look at the string again, you can see they do this interesting piece of work:

... "</" + "script>";

This seems odd right? Wouldn't it be better to just do str = " ... </script>" without doing an extra string concatination? This is actually a common trick (by silly people who write script tags as strings, a bad practice) to make the parser NOT break. Because if you do this:

var a = '</script>';

in an inline script, the parser will come along and really just see </script> and think the whole script tag has ended, and will throw up the rest of the contents of that script tag onto the page as plain text. This is because you can technically put a closing script tag anywhere, even if your JS syntax is invalid. From a parser point of view, it's better to get out of the script tag early rather than try to render your html code as javascript.

So, you can't use a regular HTML parser to parse web pages. It's a very, very dangerous game. There is no guarantee you'll get well formed HTML. Depending on what you're trying to do, you could read the content of the page with a regex, or try getting a fully rendered page content with a headless browser

score 2 · Accepted Answer

you need to use html5lib parser with BeautifulSoup

To install the reqd parser use pip:

pip install html5lib

then use that parser this way

import mechanize
br = mechanize.Browser()
html = br.open("http://google.com/",timeout=100).read()
soup = BeautifulSoup(html,'html5lib')
a_s = soup.find_all('a')
for i in range(0,len(a_s)):
 print a_s[i]['href']

score 1 · Accepted Answer

One of the Simplest thing you can do is, specify the content as "lxml". you can do it by adding "lxml" to the urlopen() function as a parameter

page = urllib2.urlopen("[url]","lxml")

Then your code will be as follow.

import urllib2from bs4 import BeautifulSoup page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1","lxml") soup = BeautifulSoup(page) print soup.prettify()

So far i didn't get any trouble from this approach :)

python - BeautifulSoup 无法解析网页？

4 回答 4

Related

Reference