python - 使用 BeautifulSoup 和 Python 抓取多个页面

Question

我的代码成功地从 [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ]中抓取 tr align=center 标签，并将 td 元素写入文本文件。

但是，我希望能够在上面的站点上抓取多个页面。

例如，对于上面的 url，当我单击“第 2 页”的链接时，整个 url 不会改变。我查看了页面源代码，并看到了一个 javascript 代码来前进到下一页。

如何更改我的代码以从所有可用的列出页面中抓取数据？

我的代码仅适用于第 1 页：

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('\n', '').replace('\t', '').strip())

    acct.write(", ".join(stack) + '\n')

score 53 · Accepted Answer

这里的技巧是在您单击链接以查看其他页面时检查进出页面更改操作的请求。检查的方法是使用 Chrome 的检查工具（通过按F12）或在 Firefox 中安装 Firebug 扩展。我将在这个答案中使用 Chrome 的检查工具。请参阅下面的我的设置。

在此处输入图像描述

现在，我们希望看到的是GET对另一个页面的POST请求或更改页面的请求。该工具打开时，单击页码。在很短的时间内，只会出现一个请求，它是一种POST方法。所有其他元素将快速跟随并填满页面。请参阅下文了解我们正在寻找的内容。

在此处输入图像描述

点击上面的POST方法。它应该调出一个带有标签的子窗口。单击Headers选项卡。此页面列出了请求标头，几乎是对方（例如站点）需要您才能连接的标识内容（其他人可以比我更好地解释这一点）。

每当 URL 包含页码、位置标记或类别等变量时，站点通常会使用查询字符串。长话短说，它类似于允许站点提取您需要的信息的 SQL 查询（实际上，有时是 SQL 查询）。如果是这种情况，您可以检查查询字符串参数的请求标头。向下滚动一点，你应该找到它。

在此处输入图像描述

如您所见，查询字符串参数与我们 URL 中的变量匹配。在下面一点，你可以在它Form Data下面看到pageNum: 2。这是关键。

POST请求通常被称为表单请求，因为这些请求是在您提交表单、登录网站等时发出的请求。基本上，几乎所有您必须提交信息的地方。大多数人看不到的是POST请求有一个他们遵循的 URL。一个很好的例子是，当您登录到一个网站时，很简单地，您的地址栏在确定之前会变成某种乱七八糟的 URL /index.html。

上述段落的基本意思是您可以（但不总是）将表单数据附加到您的 URL，它会在执行时POST为您执行请求。要知道您必须附加的确切字符串，请单击view source。

在此处输入图像描述

通过将其添加到 URL 来测试它是否有效。

在此处输入图像描述

瞧，它有效。现在，真正的挑战是：自动获取最后一页并抓取所有页面。你的代码就在那里。剩下要做的唯一事情是获取页面数量，构建要抓取的 URL 列表，并对其进行迭代。

修改后的代码如下：

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
    num_pages = int(page_count_links[-1].get_text())
except IndexError:
    num_pages = 1

# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
    for url_ in url_list:
        print "Processing {}...".format(url_)
        r_new = rq.get(url_)
        soup_new = bsoup(r_new.text)
        for tr in soup_new.find_all('tr', align='center'):
            stack = []
            for td in tr.findAll('td'):
                stack.append(td.text.replace('\n', '').replace('\t', '').strip())
            acct.write(", ".join(stack) + '\n')

我们使用正则表达式来获得正确的链接。然后使用列表推导，我们构建了一个 URL 字符串列表。最后，我们遍历它们。

结果：

Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]

在此处输入图像描述

希望有帮助。

编辑：

纯粹出于无聊，我想我刚刚为整个班级目录创建了一个刮板。此外，当只有一个页面可用时，我更新了上面和下面的代码以不出错。

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list

with open("results.txt","wb") as acct:
    for class_url in classes_url_list:
        base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
        r = rq.get(base_url)

        soup = bsoup(r.text)
        # Use regex to isolate only the links of the page numbers, the one you click on.
        page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
        try:
            num_pages = int(page_count_links[-1].get_text())
        except IndexError:
            num_pages = 1

        # Add 1 because Python range.
        url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

        # Open the text file. Use with to save self from grief.
        for url_ in url_list:
            print "Processing {}...".format(url_)
            r_new = rq.get(url_)
            soup_new = bsoup(r_new.text)
            for tr in soup_new.find_all('tr', align='center'):
                stack = []
                for td in tr.findAll('td'):
                    stack.append(td.text.replace('\n', '').replace('\t', '').strip())
                acct.write(", ".join(stack) + '\n')

python - 使用 BeautifulSoup 和 Python 抓取多个页面

1 回答 1

Related

Reference