php - 如何在python中抓取具有静态url的网站

Question

我想抓取一个基于 PHP 的网站，它有一个搜索框，我们可以在该搜索框中输入一个数字，当我们单击提交按钮或按 Enter 但 URL 没有改变时，它会根据输入的数字呈现结果。就像它为每个结果显示 foo.com/res_17.php 一样，但对于像上千条记录一样爬行，记录应该可以通过唯一 ID 访问，例如 foo.com/res_17.php?id=1001, foo.com/res_17.php ?id=1002 - foo.com/res_17.php?id=3450 这样我就可以使用 while 循环访问它们我该如何做到这一点任何解决方案请帮忙。

score 0 · Accepted Answer

我给了你我的剧本

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://en.wikipedia.org/wiki/Andrew_Ng")
bsObj = BeautifulSoup(html)

for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",
            href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

输出显示为所有 Andrew Ng 维基百科的文章。

php - 如何在python中抓取具有静态url的网站

1 回答 1

Related

Reference