1

我正在尝试加载第二个管道以将条目写入 mysql 数据库。在 de log 我看到他正在加载,但之后没有任何反应。甚至没有记录。这是我的管道:

# Mysql
import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request

class MySQLStorePipeline(object):
  def __init__(self):
    self.conn = MySQLdb.connect(host="localhost", user="***", passwd="***", db="***", charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()

def process_item(self, item, spider):

        CurrentDateTime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        Md5Hash = hashlib.md5(item['link']).hexdigest()
        try:
                self.cursor.execute("""INSERT INTO apple (article_add_date, article_date, article_title, article_link, article_link_md5, article_summary, article_image_url, article_source) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)""", (CurrentDateTime, item['date'], item['title'], item['link'], Md5Hash, item['summary'], item['image'], item['sourcesite']))
                self.conn.commit()

        except MySQLdb.Error, e:
                print "Error %d: %s" % (e.args[0], e.args[1])

        return item

这是我的日志:

 scrapy crawl macnn_com
2013-06-20 08:15:53+0200 [scrapy] INFO: Scrapy 0.16.4 started (bot: HungryFeed)
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Enabled item pipelines: MySQLStorePipeline, CleanDateField
2013-06-20 08:15:54+0200 [macnn_com] INFO: Spider opened
2013-06-20 08:15:54+0200 [macnn_com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-06-20 08:15:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-06-20 08:15:55+0200 [macnn_com] DEBUG: Crawled (200) <GET http://www.macnn.com> (referer: None)
2013-06-20 08:15:55+0200 [macnn_com] DEBUG: Crawled (200) <GET http://www.macnn.com/articles/13/06/19/compatibility.described.as.experimental/> (referer: http://www.macnn.com)
2013-06-20 08:15:55+0200 [macnn_com] DEBUG: Scraped from <200 http://www.macnn.com/articles/13/06/19/compatibility.described.as.experimental/>
*** lot of scraping data ***
*** lot of scraping data ***
*** lot of scraping data ***
2013-06-20 08:15:56+0200 [macnn_com] INFO: Closing spider (finished)
2013-06-20 08:15:56+0200 [macnn_com] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 5711,
         'downloader/request_count': 17,
         'downloader/request_method_count/GET': 17,
         'downloader/response_bytes': 281140,
         'downloader/response_count': 17,
         'downloader/response_status_count/200': 17,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 6, 20, 6, 15, 56, 685286),
         'item_scraped_count': 16,
         'log_count/DEBUG': 39,
         'log_count/INFO': 4,
         'request_depth_max': 1,
         'response_received_count': 17,
         'scheduler/dequeued': 17,
         'scheduler/dequeued/memory': 17,
         'scheduler/enqueued': 17,
         'scheduler/enqueued/memory': 17,
         'start_time': datetime.datetime(2013, 6, 20, 6, 15, 54, 755766)}
2013-06-20 08:15:56+0200 [macnn_com] INFO: Spider closed (finished)

当然,我不必提及我在 settings.py 中加载了管道,例如:

ITEM_PIPELINES = [
        'HungryFeed.pipelines.CleanDateField',
        'HungryFeed.pipelines.MySQLStorePipeline'
]

我在这里想念什么吗?

这是我的第一个管道:

class CleanDateField(object):

        def process_item(self,item,spider):
                from dateutil import parser
                rawdate = item['date']

                #text replace per spider so parser can recognize better the datetime
                if spider.name == "macnn_com":
                        rawdate = rawdate.replace("updated","").strip()

                dt = parser.parse(rawdate)
                articledate = dt.strftime("%Y-%m-%d %H:%M:%S")
                item['date'] = articledate

                return item
4

0 回答 0