python - python中的网页抓取

Question

我想使用 python从这个请愿书中刮掉所有 ~62000 个名字。我正在尝试使用 beautifulsoup4 库。

但是，它只是行不通。

到目前为止，这是我的代码：

import urllib2, re
   from bs4 import BeautifulSoup

   soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())

divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]

我究竟做错了什么？另外，我想以某种方式访问下一页以将下一组名称添加到列表中，但我现在不知道该怎么做。任何帮助表示赞赏，谢谢。

score 1 · Accepted Answer

你可以尝试这样的事情：

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')

# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')

results = []
while True:
    # Read the web page in XML mode
    soup = BeautifulSoup(html.read(), "xml")

    try:
        for s in soup.find_all("signature"):
            # Scrape the names from the XML
                    firstname = s.find('firstname').contents[0]
            lastname = s.find('lastname').contents[0]
            results.append(str(firstname) + " " + str(lastname))
    except:
        pass

    # Find the next page to scrape
    prev = soup.find("prev_signature")

    # Check if another page of result exists - if not break from loop   
    if prev == None:
        break

    # Get the previous URL
    url = prev.contents[0]

    # Open the next page of results
    html = urllib2.urlopen(url)
    print("Extracting data from {}".format(url))

# Print the results
print("\n")
print("====================")   
print("= Printing Results =")
print("====================\n")
print(results)

请注意，虽然那里有很多数据需要查看，但我不知道这是否违反了网站的服务条款，因此您需要检查一下。

score 0 · Accepted Answer

在大多数情况下，简单地抓取一个站点是非常不考虑的。您在很短的时间内在网站上放置了相当大的负载，从而减慢了合法用户的请求。更不用说窃取他们所有的数据了。

考虑另一种方法，例如（礼貌地）要求数据转储（如上所述）。

或者，如果您确实需要刮擦：

使用计时器间隔您的请求
巧妙地刮

我快速浏览了那个页面，在我看来他们使用 AJAX 来请求签名。为什么不简单地复制他们的 AJAX 请求，它很可能会使用某种 REST 调用。通过这样做，您只需请求您需要的数据，就可以减轻他们服务器上的负载。实际处理数据也会更容易，因为它的格式很好。

重新编辑，我查看了他们的robots.txt文件。它不允许/xml/请尊重这一点。

score 0 · Accepted Answer

不工作是什么意思？空列表或错误？

如果您收到一个空列表，那是因为文档中不存在类“name_location”。还可以在findAll上查看 bs4 的文档

python - python中的网页抓取

3 回答 3

Related

Reference