python - BeautifulSoup 无法正确读取文档

Question

我正在尝试抓取 NBA 球员的统计数据，目的是对他们进行一些机器学习，我发现这些“可打印的球员文件”有一堆漂亮而整洁的统计数据。不幸的是，我正在尝试使用 BeautifulSoup 来解析 html，但它根本不起作用。例如：

from bs4 import BeautifulSoup
import codecs
import urllib2

url = 'http://www.nba.com/playerfile/ray_allen/printable_player_files.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

with open('ray_allen.txt', 'w') as f:
    f.write(soup.prettify())
    f.close()

给我一个看起来像这样的文件：

<html>
 <head>
  <!--no description was found-->
  <!--no title was found-->
  <!--no keywords found-->
  <!--not article-->
  <script>
   var site = "nba";
var page = "player";
  </script>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script language="Javascript">
   &lt;!--
var flashinstalled = 0;
var flashversion = 0;
MSDetect = "false";
if (navigator.plugins &amp;&amp; navigator.plugins.length) {
    x = navigator.plugins["Shockwave Flash"];
    if (x) {
        flashinstalle   d       =       2   ;   

           i   f       (   x   .   d   e   s   c   r   i   p   t   i   o   n   )       {   

               y       =       x   .   d   e   s   c   r   i   p   t   i   o   n   ;   

               f   l   a   s   h   v   e   r   s   i   o   n       =       y   .   c   h   a   r   A   t   (   y   .   i   n   d   e   x   O   f   (   '   .   '   )   -   1   )   ;   

           }   

       }       e   l   s   e   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       1   ;   

       i   f       (   n   a   v   i   g   a   t   o   r   .   p   l   u   g   i   n   s   [   "   S   h   o   c   k   w   a   v   e       F   l   a   s   h       2   .   0   "   ]   )       {   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       2   ;   

           f   l   a   s   h   v   e   r   s   i   o   n       =       2   ;   

       }   
[...]

然后继续进行另外 3000 多行，然后完成（我添加了 [...]）：

[...]
   &lt;   /   b   o   d   y   &gt;   

   &lt;   /   h   t   m   l   &gt;
  </script>
 </head>
</html>

我还尝试了“http://www.basketball-reference.com/players/a/allenra02.html”，但那个给了我这个错误：

回溯（最后一次调用）：文件“test.py”，第 9 行，在 f.write(soup.prettify()) UnicodeEncodeError：'ascii' 编解码器无法在位置 6167 编码字符 u'\xb7'：序数不在范围内(128)

也许我应该使用其他东西来解析 html？还是这些问题之一很容易解决？我在这里读到的内容似乎表明使用 BeautifulSoup 对我来说应该让事情变得容易而不是困难！

编辑：行：

print soup.prettify()

适用于终端的第二页，所以当它尝试写入文件时会发生一些事情——这不是 BeautifulSoup 的问题

score 4 · Accepted Answer

这与在 4.0.3 中修复的错误 972466呈现相同的症状。我建议升级到 Beautiful Soup 4 的最新版本。

score 3 · Accepted Answer

这看起来像是 BeautifulSoup 4 中的一个错误。

from bs4 import BeautifulSoup我通过更改为使用 BeautifulSoup 3（在 Ubuntu 中打包）尝试了您的代码from BeautifulSoup import BeautifulSoup，并且它按预期工作。当我使用 v4（运行代码不变）时，我重现了您的问题。该错误似乎在解析器中而不是在prettify因为打印soup对象显示相同的问题。

请在https://bugs.launchpad.net/beautifulsoup/将其作为错误提交。同时，使用版本 3。

python - BeautifulSoup 无法正确读取文档

2 回答 2

Related

Reference