1

我正在尝试从一个网页收集数据,该网页有一堆我需要从中获取数据的选择列表。这是页面:- http://www.asusparts.eu/partfinder/Asus/All In One/E 系列/

这就是我到目前为止所拥有的:

import glob, string
from bs4 import BeautifulSoup
import urllib2, csv

for file in glob.glob("http://www.asusparts.eu/partfinder/*"):

##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'

##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)

##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'


##-identify the id of select list which contains the E-series-##  
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')

##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]

for option in option_tags:
    open(url + option['value'])


html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")

soup = BeautifulSoup(html)

all = soup.find('div', id="accordion")

我不确定我是否走对了路?因为所有的选择菜单都让人困惑。基本上我需要从所选结果中获取所有数据,例如图像、价格、描述等。它们都包含在一个包含所有结果的 div 标签中,该标签名为“手风琴”,所以这仍然会收集所有数据吗?还是我需要更深入地搜索这个 div 中的标签?此外,我更喜欢按 id 而不是类搜索,因为我可以一次性获取所有数据。我将如何从上面的内容中做到这一点?谢谢。我也不确定 glob 功能是否正确使用?

编辑

这是我编辑的代码,没有错误返回但是我不确定它是否返回了 e 系列的所有模型?

import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup


##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'


base_url = 'http://www.asusparts.eu/' + url

print base_url

##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')

#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'

##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)

print soup

##-identify the id of select list which contains the E-series-##  
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')

print option_tags 

##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]

print option_tags


for option in option_tags:
    url + option['redirectvalue']

print " " + url + option['redirectvalue']
4

1 回答 1

1

首先,我想指出您在发布的代码中遇到的几个问题。首先,该glob模块通常不用于发出 HTTP 请求。它对于遍历指定路径上的文件子集很有用,您可以在其 docs中阅读有关它的更多信息。

第二个问题是:

for file in glob.glob("http://www.asusparts.eu/partfinder/*"):

你有一个缩进错误,因为后面没有缩进代码。这将引发错误并阻止执行其余代码。

另一个问题是您正在为变量使用一些 python 的“保留”名称。你不应该使用诸如all或之类的词file来表示变量名。

最后,当您循环时option_tags

for option in option_tags:
    open(url + option['value'])

open语句将尝试打开一个本地文件,其路径为url + option['value']. 这可能会引发错误,因为我怀疑您在该位置会有一个文件。此外,您应该知道您没有对这个打开的文件做任何事情。

好了,评论就说这么多。我查看了华硕页面,我想我知道你想要完成什么。据我了解,您想在华硕页面上为每个计算机型号抓取零件列表(图像、文本、价格等)。每个型号都有位于唯一 URL 的部件列表(例如:http ://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220)。这意味着您需要能够为每个模型创建这个唯一的 URL。更复杂的是,每个零件类别都是动态加载的,例如,在您单击“冷却”链接之前,不会加载“冷却”部分的零件。这意味着我们有一个两部分的问题:1) 获取所有有效的(品牌、类型、系列、型号)组合和 2) 弄清楚如何加载给定模型的所有部件。

我有点无聊,决定编写一个简单的程序来处理大部分繁重的工作。这不是最优雅的东西,但它会完成工作。步骤 1) 在 中完成get_model_information()。步骤 2) 得到了处理,parse_models()但不太明显。看看华硕网站,每当你点击一个部件子部分时,JavaScript 函数getProductsBasedOnCategoryID()就会运行,它会对格式化的 ajax 进行调用PRODUCT_URL(见下文)。响应是一些 JSON 信息,用于填充您单击的部分。

import urllib2
import json
import urlparse
from bs4 import BeautifulSoup

BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
               '44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
               'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']


def get_options(url, select_id):
    """
    Gets all the options from a select element.
    """
    r = urllib2.urlopen(url)
    soup = BeautifulSoup(r)
    select = soup.find('select', id=select_id)
    try:
        options = [option for option in select.strings]
    except AttributeError:
        print url, select_id, select
        raise
    return options[1:]  # The first option is the menu text


def get_model_information():
    """
    Finds all the models for each family, all the families and models for each
    type, and all the types, families, and models for each brand.

    These are all added as tuples (brand, type, family, model) to the list
    models.
    """
    model_info = []

    print "Getting brands"
    brand_options = get_options(BASE_URL, 'mySelectList')

    for brand in brand_options:
        print "Getting types for {0}".format(brand)
        # brand = brand.replace(' ', '%20')  # URL encode spaces
        brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
        types = get_options(brand_url, 'mySelectListType')

        for _type in types:
            print "Getting families for {0}->{1}".format(brand, _type)
            bt = '{0}/{1}'.format(brand, _type)
            type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
            families = get_options(type_url, 'myselectListFamily')

            for family in families:
                print "Getting models for {0}->{1}->{2}".format(brand,
                                                                _type, family)
                btf = '{0}/{1}'.format(bt, family)
                fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
                models = get_options(fam_url, 'myselectListModel')

                model_info.extend((brand, _type, family, m) for m in models)

    return model_info


def parse_models(model_information):
    """
    Get all the information for each accessory type for every
    (brand, type, family, model). accessory_info will be the python formatted
    json results. You can parse, filter, and save this information or use
    it however suits your needs.
    """

    for brand, _type, family, model in model_information:
        for accessory in ACCESSORIES:
            r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
                                                 accessory=accessory,
                                                 brand=brand,))
            accessory_info = json.load(r)
            # Do something with accessory_info
            # ...


def main():
    models = get_model_information()
    parse_models(models)


if __name__ == '__main__':
    main()

最后,一个旁注。我已经放弃urllib2requests图书馆。我个人认为提供了更多的功能和更好的语义,但你可以使用任何你想要的东西。

于 2013-04-15T17:55:13.183 回答