python - 尝试使用 beautifulsoup 分析 HTML 时出现的一个奇怪问题

Question

我正在尝试编写一些python代码来从官方网站收集音乐排行榜数据，但是在收集广告牌数据时遇到了麻烦。我选择beautifulsoup 来处理HTML

我的环境：python-2.7 beautifulsoup-3.2.0

首先我分析HTML

>>> import BeautifulSoup, urllib2, re
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
>>> soup = BeautifulSoup.BeautifulSoup(html)

然后我尝试收集我想要的数据，例如艺术家姓名

HTML：

<div class="listing chart_listing">

<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider">
  <header>
    <span class="chart_position position-down">11</span>
            <h1>Ho Hey</h1>
        <p class="chart_info">
      <a href="/artist/418560/lumineers">The Lumineers</a>            <br>
      The Lumineers          </p>

艺术家名字是 The Lumineers

>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\
... .find("p", {"class":"chart_info"}).a.string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'find'

无类型！似乎它无法 grep 我想要的数据，也许我的规则是错误的，所以我尝试 grep 一些基本标签。

>>> print str(soup.find("div"))
None
>>> print str(soup.find("a"))
None
>>> print str(soup.find("title"))
<title>The Hot 100 : Page 2  | Billboard</title>
>>> print str(soup)
......entire HTML.....

我很困惑，为什么它不能像 div 一样 grep 基本标签，a？他们确实在那里。我的代码有什么问题？当我尝试用这些来分析其他图表时没有任何问题。

score 1 · Accepted Answer

这似乎是 Beautifulsoup 3 问题。如果你 prettify() 输出：

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup3(html)
print soup.prettify()

您可以在输出的末尾看到：

        <script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script>
</body>
</html>
  </script>
 </head>
</html>

使用两个 html 结束标签，BeautifulSoup3 似乎被此数据中的 Javascript 内容混淆了。

如果您使用：

from bs4 import BeautifulSoup as soup4
import urllib2, re

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
soup = soup4(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

你得到'The Lumineers'作为输出。

如果你不能切换到bs4，我建议你将html变量写到一个文件out.txt中，然后更改脚本以读取in.txt并将输出复制到输入并切掉块。

from BeautifulSoup import BeautifulSoup as soup3
import re

html = open('in.txt').read()
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

我的第一个猜测是删除<head> ... </head>并且效果很好。

之后，您可以以编程方式解决该问题：

from BeautifulSoup import BeautifulSoup as soup3
import urllib2, re

htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read()
head_start = htmlorg.index('<head')
head_end = htmlorg.rindex('</head>')
head_end = htmlorg.index('>', head_end)
html = htmlorg[:head_start] + htmlorg[head_end+1:]
soup = soup3(html)
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

python - 尝试使用 beautifulsoup 分析 HTML 时出现的一个奇怪问题

1 回答 1

Related

Reference