您不能使用 BeautifulSoup 或任何 HTML 解析器来读取网页。您永远无法保证网页是格式良好的文档。让我解释一下在这个给定的情况下发生了什么。
在那个页面上有这个 INLINE javascript:
var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>";
You can see that it's creating a string that will put a script tag onto the page. Now, if you're an HTML parser, this is a very tricky thing to deal with. You go along reading your tokens when suddenly you hit a <script>
tag. Now, unfortunately, if you did this:
<script>
alert('hello');
<script>
alert('goodby');
Most parsers would say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgot to close the first one! And the parser would think both are valid scripts.
So, in this case, BeautifulSoup sees a <script>
tag, and even though it's inside a javascript string, it looks like it could be a valid starting tag, and BeautifulSoup has a seizure, as well it should.
If you look at the string again, you can see they do this interesting piece of work:
... "</" + "script>";
This seems odd right? Wouldn't it be better to just do str = " ... </script>"
without doing an extra string concatination? This is actually a common trick (by silly people who write script tags as strings, a bad practice) to make the parser NOT break. Because if you do this:
var a = '</script>';
in an inline script, the parser will come along and really just see </script>
and think the whole script tag has ended, and will throw up the rest of the contents of that script tag onto the page as plain text. This is because you can technically put a closing script tag anywhere, even if your JS syntax is invalid. From a parser point of view, it's better to get out of the script tag early rather than try to render your html code as javascript.
So, you can't use a regular HTML parser to parse web pages. It's a very, very dangerous game. There is no guarantee you'll get well formed HTML. Depending on what you're trying to do, you could read the content of the page with a regex, or try getting a fully rendered page content with a headless browser