0

我正在使用 scrapy 来提取部分地址,我需要有关 . 这是代码(抱歉,如果这是无效代码,不确定如何正确粘贴到问题中)。

<div class="result">
<h3>
<a href="/provider/service/xxxxx/">service name</a>
</h3>
<p>
"blah blah"
</p>
<strong>Physical Address</strong>
    "123 address street, someplace,  somewhere"
<br/>
<strong>Postcode</strong>
    "xxx"
<br/>
<strong>District/town</strong>
    "someplace"
<br/>
<strong>Region</strong>
    "someplace bigger"
<br/>
<strong>Phone</strong>
    "xx xxx xxxx"
<br/><strong>Fax Number</strong>
    "xx xxx xxxx"
<br/>
<!--strong>Email</strong-->
    <a href="#" onclick="window.location=('mail'+'to:'+'xxxxx'+''+'@'+'xxxx.xx.xx'+''); return false;">
"xxxxx"
<strong></strong>
"xxxxx.xx.xx"
</a>
<a rel="nofollow" class="printlist-add" href="/provider/print-list/add/xxxx/">Add to print list</a>        
</div>
<hr/>

这是我的蜘蛛

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from test.items import TestItem

class NewSpider(BaseSpider):
name = "my_spider"

download_delay = 2

allowed_domains = ["website.com"]
start_urls = [
    "http://website.com/site1"
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//*[@id="search-results"]/div')
    items = []
    for site in sites:
        item = WebhealthItem()
        item['practice'] = site.select('h3/a/text()').extract()
        item['url'] = site.select('h3/a/@href').extract()
        item['address1'] = site.select('strong[text() = "Physical Address"]/following-sibling::text()[1]')
        items.append(item)
    return items

该行item['address1'] = site.select('strong[text()="Physical Address"]/following-sibling::text()[1]')返回一个字符串值[<HtmlXPathSelector xpath='strong[text()="Physical Address"]/following-sibling::text()[1]' data=u'\n\t\t\t 123 address street, someplace, some'>]。最后几个字符被剪掉。

当我添加.extract()这些值时,它们会显示在 cmd 中,[u'\n\t\t\t 123 address street, someplace, somewhere']但它们不会出现在输出表中。

我一直在寻找解决方案,并且尝试过.select('text()').extract(),但这也不对。

一如既往地非常感谢任何帮助。

附言。关于如何将页面源代码放入此论坛的问题的建议也将不胜感激。谢谢

4

2 回答 2

1

使用您的示例 URL,我建议您使用类似的内容,选择div具有“结果”类的 s:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('id("search-results")/div[@class="result"]')
    items = []
    for result in results:
        item = WebhealthItem()
        item['practice'] = result.select('h3/a/text()').extract()[0]
        item['url'] = result.select('h3/a/@href').extract()[0]
        item['address1'] = map(
                unicode.strip,
                result.select('strong[text() = "Physical Address"]/following-sibling::text()[1]').extract()
            )[0]
        items.append(item)
    return items
于 2013-10-09T08:24:38.553 回答
1
def caiqinghua_array_string_strip(array_string):
if(array_string == []):
    return ''
else:
    #print 'item::: ', array_string[0].strip()
    string = array_string[0].replace('\\r\\n', '')
    return string.strip()

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//*[@id="search-results"]/div')
    items = []
    for site in sites:
        item = WebhealthItem()
        item['practice'] = site.select('h3/a/text()').extract()
        item['url'] = site.select('h3/a/@href').extract()
        address = site.select('strong[text() = "Physical Address"]/following-sibling::text()[1]')
        item['address1'] = caiqinghua_array_string_strip(address)
        items.append(item)
    return items

希望它可以帮助你。对了,建议你把items=[]改成items_list=[]或者其他,因为items是scrapy的关键词,以后可能会发生冲突。

于 2013-10-09T23:56:08.243 回答