0

我在教程的帮助下制作了一个网络爬虫,它从给定的 url 获取所有链接,你可以给它传递一个数字,它对应于链接的步骤/深度。现在,当您定义一个scraperOut = scraper(url,3)当前为 3 的数字时,爬虫会深入 3 步并将链接附加到同一个列表。我的问题是如何以及在代码中修改什么,以便我可以选择单独打印每个列表,而不是全部附加在一个列表中,或者例如只打印第二步列表?整个代码如下所示:

import urllib
import re
import time
from threading import Thread
import MySQLdb
import mechanize
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import urlparse

url = "http://www.adbnews.com/area51/"

def scraper(root,steps):
    urls = [root]
    visited = [root]
    counter = 0
    while counter < steps:
        step_url = scrapeStep(urls)
        urls = []
        for u in step_url:
            if u not in visited:
                urls.append(u)
                visited.append(u)
        counter +=1

    return visited

def scrapeStep(root):
    result_urls = []
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Firefox')]

    for url in root:
        try:
            br.open(url)
            for link in br.links():
                newurl = urlparse.urljoin(link.base_url, link.url)
                result_urls.append(newurl)
        except:
            print "error"
    return result_urls

d = {}
threadlist = []

def getReadableArticle(url):
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Firefox')]

    html = br.open(url).read()

    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

    soup = BeautifulSoup(readable_article)

    final_article = soup.text

    links = soup.findAll('img', src=True)

    return readable_title
    return final_article

def dungalo(urls):
    article_text = getReadableArticle(urls)[0]
    d[urls] = article_text

def getMultiHtml(urlsList):
    for urlsl in urlsList:
        try:
            t = Thread(target=dungalo, args=(urls1,))
            threadlist.append(t)
            t.start()
        except:
            nnn = True

    for g in threadlist:
        g.join()

    return d


scraperOut = scraper(url,3)

for s in scraperOut:
    print s

#print scraperOut
4

1 回答 1

0

我认为,如果您更改代码的部分内容:

    return readable_title
    return final_article

读书:

    print readable_title
    return final_article

你会得到很多你想要的东西,并且你的代码有更多的机会工作!注意使用原始代码,您将永远不会返回final_article,因为它readable_title首先返回。

于 2013-08-02T11:10:52.973 回答