python - 网页爬虫返回元素列表

Question

我正在尝试构建一个刮板，通过 mechanize 和 lxml 从多个网页上的表格中刮取信息。下面的代码返回一个元素列表，我试图找到一种从这些元素中获取文本的方法（添加 .text 不适用于列表对象）

代码如下：

import mechanize
import lxml.html as lh
import csv

br = mechanize.Browser()
response = br.open("http://localhost/allproducts")

output = csv.writer(file(r'output.csv','wb'), dialect='excel')

for link in br.links(url_regex="product"):
    follow = br.follow_link(link)
    url = br.response().read()
    find = lh.document_fromstring(url)
    find = find.findall('.//td')
    print find
    output.writerows([find])

如果我在上面的代码末尾添加以下内容，来自 tds 的文本出现在 csv 文件中，但来自每个 td 的文本出现在单独的行上，我希望格式与上面的代码相同使用文本而不是元素列表（每页的所有信息都在一行上）

for find in find:
    print find.text
    output.writerows([find.text])

我从一堆其他示例中获取了代码，因此也非常感谢任何一般性建议

score 0 · Accepted Answer

你离得太近了！您的代码有 2 个问题：

1) find 是对象列表，而不是字符串列表。这里有一些python来验证这一点：

>>> type(find)
<type 'list'>
>>> find
[<Element td at 0x101401e30>, <Element td at 0x101401e90>, <Element td at 0x101401ef0>, <Element td at 0x101401f50>, <Element td at 0x101401fb0>, <Element td at 0x101404050>, <Element td at 0x1014040b0>, <Element td at 0x101404110>, <Element td at 0x101404170>, <Element td at 0x1014041d0>, <Element td at 0x101404230>, <Element td at 0x101404290>, <Element td at 0x1014042f0>, <Element td at 0x101404350>, <Element td at 0x1014043b0>, <Element td at 0x101404410>]
>>> type(find[0])
<class 'lxml.html.HtmlElement'>

我们会说find变量指向一个<class 'lxml.html.HtmlElement'>对象列表。这种类型的结构不应直接传递给output.writerows. 相反，此函数将采用文本项列表。

2) 迭代find对象时，您正在重新分配变量 name find。迭代时切勿使用与您正在迭代的项目名称相同的名称！

for item in find:
    print item.text
    output.writerows([item.text])

把它们放在一起，你应该有这样的东西：

for link in br.links(url_regex="product"):
    follow = br.follow_link(link)
    url = br.response().read()
    find = lh.document_fromstring(url)
    find = find.findall('.//td')
    print find
    results = []  # Create a place to store the text names
    for item in find:
        results.append(item.text)  # Store the text name of the item in the results list.
    output.writerows(results)  # Now, write the results out.  # EDITED to use correct variable here.

专业提示

您甚至可以使用列表推导将结果生成为单行，如下所示：

results = [item.text for item in find]
output.writerows(results)

这将用一行替换 3 行 python。

python - 网页爬虫返回元素列表

1 回答 1

Related

Reference