1

我正在使用递归函数抓取我的域的所有 URL。但它什么也没输出,没有任何错误。

#usr/bin/python

from bs4 import BeautifulSoup
import requests
import tldextract


def scrape(url):

    for links in url:
        main_domain = tldextract.extract(links)
        r = requests.get(links)
        data = r.text
        soup = BeautifulSoup(data)
    
        for href in soup.find_all('a'):
            href = href.get('href')
            if not href:
                continue
            link_domain = tldextract.extract(href)
        
            if link_domain.domain == main_domain.domain :
                problem.append(href)
    
            elif not href == '#' and link_domain.tld == '':
                new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
                problem.append(new)

        return len(problem)
        return scrape(problem)
        

problem = ["http://xyzdomain.com"]  
print(scrape(problem))

当我创建一个新列表时,它可以工作,但我不想每次都为每个循环创建一个列表。

4

2 回答 2

0

您需要构造您的代码,使其符合递归模式,因为您当前的代码不符合 - 您也不应该调用与库同名的变量,例如href = href.get()因为这通常会在库变成变量时停止工作,您的当前的代码只会返回 len() ,因为之前无条件地达到此返回: return scrap(problem).:

def Recursive(Factorable_problem)
    if Factorable_problem is Simplest_Case:
        return AnswerToSimplestCase
    else:
        return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))

例如:

def Factorial(n):
    """ Recursively Generate Factorials """
    if n < 2:
        return 1
    else:
        return n * Factorial(n-1)
于 2013-07-20T08:33:10.643 回答
0

您好,我已经制作了一个非递归版本,似乎可以获取同一域上的所有链接。

下面的代码我已经使用代码中包含的问题进行了测试。当我用递归版本解决问题时,下一个问题是达到递归深度限制,所以我重写了它,让它以迭代方式运行,代码和结果如下:

from bs4 import BeautifulSoup
import requests
import tldextract


def print_domain_info(d):
    print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)

SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
    # Get a link from the stack of links
    link = problem.pop()
    # Check we haven't been to this address before
    if link in SEARCHED_URLS:
        continue
    # We don't want to come back here again after this point
    SEARCHED_URLS.append(link)
    # Try and get the website
    try:
        req = requests.get(link)
    except:
        # If its not working i don't care for it
        print "borked website found: {0}".format(link)
        continue
    # Now we get to this point worth printing something
    print "Trying to parse:{0}".format(link)
    print "Status Code:{0}  Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
    # Get the domain info
    dInfo = tldextract.extract(link)
    print_domain_info(dInfo)
    # I like utf-8
    data = req.text.encode("utf-8")
    print "Lenght Of Data Retrived:{0}".format(len(data))  # More info
    soup = BeautifulSoup(data)  # This was here before so i left it.
    print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
    FOUND_THIS_ITERATION = []  # Getting the same links over and over was boring
    found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS]  # Find me all the links i don't got
    for href in found_links: 
        href = href.get('href') # You wrote this seems to work well
        if not href:
            continue
        link_domain = tldextract.extract(href) 
        if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
            if href not in FOUND_THIS_ITERATION: # I'ma check you out next time 
                print "Check out this link: {0}".format(href)
                print_domain_info(link_domain)
                FOUND_THIS_ITERATION.append(href)
                problem.append(href)
            else: # I got you already
                print "DUPE LINK!"
        else: 
            print "Not on same domain moving on" 

    # Count down
    print "We have {0} more sites to search".format(len(problem))
    if problem:
        continue
    else:
        print "Its been fun"
        print "Lets see the URLS we've visited:"
        for url in SEARCHED_URLS:
            print url

经过大量其他新城市网站的日志记录后,哪个打印出来了!

发生的事情是脚本弹出一个尚未访问的网站列表的值,然后它获取页面上位于同一域中的所有链接。如果这些链接指向我们尚未访问的页面,我们会将链接添加到要访问的链接列表中。在我们这样做之后,我们弹出下一页并再次执行相同的操作,直到没有页面可以访问。

认为这是您正在寻找的东西,如果这不能以您想要的方式工作,或者如果有人可以改进,请在评论中回复我们。

于 2013-07-20T09:29:47.723 回答