python - 当网页有它时，是否可以使该刮板在额外的页面中起作用？

Question

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

我的代码只从文件的每个 url 打开一个页面，有时会有更多页面，在这种情况下，下一页的模式将是 &page=x

这是我正在谈论的那些页面：

http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track&page=7

score 1 · Accepted Answer

您可以从 next_page 链接中读取href属性并将其添加到您的urls列表中（是的，您应该将元组更改为列表）。它可能是这样的：

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2
import urlparse

with open('urls.txt') as inf:
    urls = [line.strip() for line in inf]
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

        next_page = soup.find_all('a', {'class': 'nextlink'}):
        if next_page:
            next_page = next_page[0]
            urls.append(urlparse.urljoin(url, next_page['href']))

score 0 · Accepted Answer

你可以创建一些从页面获取所有链接并跟随它们的东西，scrapy免费做的事情

您可以创建一个蜘蛛，它将跟踪页面上的所有链接。假设有其他页面的分页链接，你的爬虫会自动跟随它们。

你可以用beautifulsoup解析页面上的所有链接来完成同样的事情，但是如果scrapy已经免费做到了，为什么还要这样做呢？

score -1 · Accepted Answer

我不确定我是否理解您的问题，但您可能会考虑创建一些与您的“下一个”模式匹配的正则表达式（http://www.tutorialspoint.com/python/python_reg_expressions.htm），并在找到的页面上的 URL。当站点内链接高度一致时，我经常使用这种方法。

python - 当网页有它时，是否可以使该刮板在额外的页面中起作用？

3 回答 3

Related

Reference