1

我有一个站点地图,上面有大约 21 个网址,每个网址都包含大约 2000 个网址。我正在尝试编写一些东西,让我能够解析每个原始 21 个 url 并获取它们包含的 2000 个 url,然后将其附加到列表中。

几天来,我一直在用头撞墙,试图让它发挥作用,但它一直返回“无”列表。我现在只使用 python 大约 3 周,所以我可能会遗漏一些非常明显的东西。任何帮助都会很棒!

storage = []
storage1 = []

for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)

def parser(any):
    tree = ET.parse(urlopen(any))
    root = tree.getroot()
    for i in range(len(storage)):
        x = (root[i][0]).text
        storage1.append(x)

storage2 = [parser(x) for x in storage]

我还尝试使用带计数器的 while 循环,但它总是在前 2000 个 url 之后停止。

4

4 回答 4

1

parser() never returns anything, so it defaults to returning None, hence why storage2 contains a list of Nones. Perhaps you want to look at what's in storage1?

于 2013-06-17T18:03:01.653 回答
1

If you don't declare a return for a function in python, it automatically returns None. Inside parser you're adding elements to storage1, but aren't returning anything. I would give this a shot instead.

storage = []

for x in range(21):
    url = 'first part of the url' + str(x) + '.xml'
    storage.append(url)

def parser(any):
    storage1 = []
    tree = ET.parse(urlopen(any))
    root = tree.getroot()
    for i in range(len(storage)):
        x = (root[i][0]).text
        storage1.append(x)
    return storage1

storage2 = [parser(x) for x in storage]

EDIT: As Amber said, you should also see that all your elements were actually being stored in storage1.

于 2013-06-17T18:04:22.893 回答
1

如果我正确理解您的问题,您的程序有两个阶段:

  1. 您生成 21 个 URL 的初始列表
  2. 您在每个 URL 处获取页面,并从页面中提取其他 URL。

您的第一步可能如下所示:

initial_urls = [('http://...%s...' % x) for x in range(21)]

然后,要从页面填充大量 URL 列表,您可以执行以下操作:

big_list = []

def extract_urls(source):
    tree = ET.parse(urlopen(any))
    for link in get_links(tree):
        big_list.append(link.attrib['href'])

def get_links(tree):
    ... - define the logic for link extraction here

for url in initial_urls:
    extract_urls(url)

print big_list

请注意,您必须自己编写从文档中提取链接的过程。

希望这可以帮助!

于 2013-06-17T18:11:53.487 回答
0

您必须在解析器函数中返回 storage1

def parser(any):
    tree = ET.parse(urlopen(any))
    root = tree.getroot()
    for i in range(len(storage)):
        x = (root[i][0]).text
        storage1.append(x)
    return storage1

我想这就是你想要的。

于 2013-06-17T18:08:38.033 回答