8

I want to scrape data from a website which has TextFields, Buttons etc.. and my requirement is to fill the text fields and submit the form to get the results and then scrape the data points from results page.

I want to know that does Scrapy has this feature or If anyone can recommend a library in Python to accomplish this task?

(edited)
I want to scrape the data from the following website:
http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType

My requirement is to select the values from ComboBoxes and hit the search button and scrape the data points from the result page.

P.S. I'm using selenium Firefox driver to scrape data from some other website but that solution is not good because selenium Firefox driver is dependent on FireFox's EXE i.e Firefox must be installed before running the scraper.

Selenium Firefox driver is consuming around 100MB memory for one instance and my requirement is to run a lot of instances at a time to make the scraping process quick so there is memory limitation as well.

Firefox crashes sometimes during the execution of scraper, don't know why. Also I need window less scraping which is not possible in case of Selenium Firefox driver.

My ultimate goal is to run the scrapers on Heroku and I have Linux environment over there so selenium Firefox driver won't work on Heroku. Thanks

4

4 回答 4

18

基本上,您有很多工具可供选择:

这些工具有不同的用途,但可以根据任务混合在一起。

Scrapy 是一个强大且非常智能的工具,用于抓取网站、提取数据。但是,当涉及到操作页面时:点击按钮、填写表格——它变得更加复杂:

  • 有时,通过直接在scrapy中进行底层表单操作来模拟填写/提交表单很容易
  • 有时,您必须使用其他工具来帮助刮擦——比如 mechanize 或 selenium

如果您使您的问题更具体,这将有助于了解您应该使用或选择什么样的工具。

看一个有趣的scrapy&selenium mix的例子。在这里,selenium 任务是点击按钮并为 scrapy 项目提供数据:

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider


class ElyseAvenueItem(Item):
    name = Field()


class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = ["ehealthinsurance.com"]
    start_urls = [
    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:
            el.click()

        time.sleep(10)

        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item

        self.driver.close()

更新:

这是有关如何在您的情况下使用scrapy的示例:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider


class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()


class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')

        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type = document_class.select('.//@value').extract()[0]
                doc_type_name = document_class.select('.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
                                  method="POST",
                                  formdata=formdata,
                                  callback=self.parse_page,
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough = row.select('.//td[2]/div/font/text()').extract()
            block = row.select('.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

保存spider.py并通过运行scrapy runspider spider.py -o output.jsonoutput.json你会看到:

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...

希望有帮助。

于 2013-05-28T08:06:26.310 回答
3

如果您只是想提交表单并从结果页面中提取数据,我会选择:

  • 发送 post 请求的请求
  • 从结果页面中提取所选数据的漂亮汤

Scrapy 的附加价值确实在于其跟踪链接和爬取网站的能力,如果您确切地知道要搜索的内容,我认为它不是适合这项工作的工具。

于 2013-05-28T07:13:01.380 回答
2

我个人会使用mechanize,因为我没有任何使用 scrapy 的经验。但是,为屏幕抓取而构建的名为 scrapy 的库应该可以完成这项任务。我只想和他们一起去看看哪个做得最好/最简单。

于 2013-05-28T07:05:21.863 回答
0

我用过 Scrapy、Selenium 和 BeautifulSoup。Scrapy 将帮助您完成工作。对我来说,BeautifulSoup 更有用,因为我可以使用 prettify()、find_all() 等函数,这对我理解 html 内容有很大帮助。

我不推荐 Selenium,因为它会减慢你的进程。它首先加载浏览器,它的内容,然后继续进行抓取,这导致它比其他包花费更长的时间。

于 2019-10-07T18:18:39.140 回答