0

我正在尝试解析一个 html 结果,获取一些 url,然后解析访问这些 url 的输出。

我正在使用 django 1.5 /python 2.7:

视图.py

    #mechanize/beautifulsoup config options here.
     beautifulSoupObj = BeautifulSoup(mechanizeBrowser.response().read()) #read the raw response
     getFirstPageLinks = beautifulSoupObj.find_all('cite') #get first page of urls

url_data = UrlData(NumberOfUrlsFound, getDomainLinksFromGoogle)
    #url_data = UrlData(5, 'myapp.com')
    #return HttpResponse(MaxUrlsToGather)

    print url_data.url_list()

    return render(request, 'myapp/scan/process_scan.html', {
        'url_data':url_data,'EnteredDomain':EnteredDomain,'getDomainLinksFromGoogle':getDomainLinksFromGoogle,
        'NumberOfUrlsFound':NumberOfUrlsFound,
        'getFirstPageLinks' : getFirstPageLinks,
    })

urldata.py

class UrlData(object):

def __init__(self, num_of_urls, url_pattern):
    self.num_of_urls = num_of_urls
    self.url_pattern = url_pattern


def url_list(self):
    # Returns a list of strings that represent the urls you want based on num_of_urls
    # e.g. asite.com/?search?start=10
    urls = []
    for i in xrange(self.num_of_urls):
        urls.append(self.url_pattern + '&start=' + str((i + 1) * 10) + ',')
    return urls

模板:

{{ getFirstPageLinks }}
    {% if url_data.num_of_urls > 0 %} 
        {% for url in url_data.url_list %}
            {{ url }}
        {% endfor %}
    {% endif %}

这输出:

[<cite>www.google.com/webmasters/</cite>, <cite>www.domain.com</cite>, <cite>www.domain.comblog/</cite>, <cite>www.domain.comblog/projects/</cite>, <cite>www.domain.comblog/category/internet/</cite>, <cite>www.domain.comblog/category/goals/</cite>, <cite>www.domain.comblog/category/uncategorized/</cite>, <cite>www.domain.comblog/twit/2013/01/</cite>, <cite>www.domain.comblog/category/dog-2/</cite>, <cite>www.domain.comblog/category/goals/personal/</cite>, <cite>www.domain.comblog/category/internet/tech/</cite>] 

由以下方式生成:getFirstPageLinks

https://www.google.com/search?q=site%3Adomain.com&start=10, https://www.google.com/search?q=site%3Adomain.com&start=20,

它由以下生成:url_data模板变量

当前的问题是:我需要遍历每个url输入url_data并获得输出,就像getFirstPageLinks输出它一样。

我怎样才能做到这一点?

谢谢你。

4

0 回答 0