0

我想从http://www.stfrancismedical.org/asp/job-summary.asp?cat=4抓取信息,但我不知道如何,因为我只知道递归抓取。有没有办法使用循环来抓取或获取每个作业的所有信息?

或者任何其他想法都会很棒。

4

1 回答 1

1

该页面的结构有点奇怪。一个表,其所有行都在同一级别深度。这使得xpath同时提取每个作业的所有数据变得更加困难。我的方法是使用模块运算符并item为每个循环填充对象。

无论如何,该页面没有链接,因此使用蜘蛛非常简单。

第一步,创建项目:

scrapy startproject stfrancismedical
cd stfrancismedical

第二步,创建蜘蛛:

scrapy genspider -t basic stfrancismedical_spider 'stfrancismedical.org'

第三步,创建一个item包含所有字段的作业:

vim stfrancismedical/items.py

带有新内容,例如:

from scrapy.item import Item, Field

class StfrancismedicalItem(Item):
    department = Field()
    employment = Field()
    shift = Field()
    weekends_holidays = Field()
    biweekly_hours = Field()
    description = Field()
    requirements = Field()

第四步,编辑蜘蛛:

vim stfrancismedical/spiders/stfrancismedical_spider.py

有内容:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from stfrancismedical.items import StfrancismedicalItem

rn = ('department', 'employment', 'shift', 'weekends_holidays',
        'biweekly_hours', 'description', 'requirements')

class StfrancismedicalSpiderSpider(BaseSpider):
    name = "stfrancismedical_spider"
    allowed_domains = ["stfrancismedical.org"]
    start_urls = ( 
        'http://www.stfrancismedical.org/asp/job-summary.asp?cat=4',
    )   


    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        for i, tr in enumerate(hxs.select('/html/body/div/table//tr[count(./td)=2]')):
            if (i % 7 == 0): 
                if (i > 0): items.append(item)
                item = StfrancismedicalItem()
            idx = i % 7 
            item[rn[idx]] = tr.select('./td[2]//text()').extract()[0]
        else:
            items.append(item)
        return items

并像这样运行它:

scrapy crawl stfrancismedical_spider -o stfrancismedical.json -t json

这将创建一个stfrancismedical.json包含数据的新文件:

[{"requirements": "Skilled in Cath Lab nursing, 2 years experience and patient recovery experience. A Current valid NJ RN license with a current ACLS certification.", "description": "Responsible for the delivery of individualized patient care to assigned patients utilizing the nursing process of assessment, planning, implementation and evaluation.", "shift": "Day - Evening - Night", "biweekly_hours": "Varied", "weekends_holidays": "No", "department": "Cardiac Care", "employment": "Pool"},
{"requirements": "Requirements: A Current valid NJ RN license with a current ACLS & BLS certification.", "description": "Responsible for the delivery of individualized patient care to assigned critical care patients utilizing the nursing process of assessment, planning, implementation and evaluation. ", "shift": "Evening", "biweekly_hours": "72", "weekends_holidays": "Yes", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "ACLS, NJ License required.\u00a0 Balloon pump certification preferred.", "description": "Provide comprehensive Nursing care to critically ill patients.\u00a0 ", "shift": "Day", "biweekly_hours": "72 - 11am - 11pm", "weekends_holidays": "Yes", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "ACLS, NJ License required.\u00a0 Balloon pump certification preferred.", "description": "Provide comprehensive Nursing care to critically ill patients. ", "shift": "Evening - Night", "biweekly_hours": "72 - 7pm - 7am", "weekends_holidays": "No", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "Associates Degree in Nursing, Healthcare, or equivalent experience: BSN preferred.", "description": "Must be detail oriented and able to follow detailed procedures to ensure accuracy.\u00a0 Must demonstrate excellent follow up skills.\u00a0 Ability to coordinate and priortize multiple duties.\u00a0 Understands interactions amongst clinical areas and their roles within hospital.\u00a0 Advanced knowledge in computer skills, including knowledge of Microsoft Word, Excel and PowerPoint.\u00a0", "shift": "Day", "biweekly_hours": "80", "weekends_holidays": "No", "department": "Nursing Education", "employment": "Full-Time"},
...
于 2013-10-20T18:27:37.860 回答