10

进一步详细说明这个问题的标题:我正在从电影网站上抓取信息。我目前有一个MySQL填充有movie titles,movie urls等的数据库。我现在urls要从数据库中获取这些并将它们设置为我start_urls在一个新的spider. 每个url都是[插入短片]网页的链接,传达了更多信息。我感兴趣的信息是:

  • 分销商(即福克斯)
  • 评级(即 Pg-13)
  • 导向器
  • 类型(即喜剧)
  • 演员
  • 生产者/秒

其中,发行商、评级、导演和流派将与每个电影网页中的一个“事物”相关联(一个评级、一个导演等)。当然,会有多个演员,并且取决于多个制片人(大牌电影/大多数电影)。这是我遇到问题的地方。我想建立一个pipeline' which puts each piece of info in an appropriatewithin myMySQLdatabase. So, a table for director, a table for rating, etc. Each table will also have电影标题`。我可以这样陈述问题本身:

我无法协调如何构建pipeline适当的spider. 我不确定我是否可以从一个蜘蛛返回多个东西并将它们发送到不同的pipelines(创建不同的项目来处理single属性,以及不同的项目来处理“多个”属性)或者是否使用相同的管道并以某种方式指定什么去哪里(不确定我是否只能在刮掉后返回一件事)。我将展示我的代码,希望问题会变得更清楚。*注意:它还没有完成 - 我只是想用如何做到这一点来填补空白

蜘蛛:

  class ActorSpider(BaseSpider):
  import sys; sys.path.append("/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages")
  import MySQLdb
  db = MySQLdb.connect(db = 'testdb', user='testuser', passwd='test')
  dbc = db.cursor()
  name = 'ActorSpider'
  allowed_domains = ['movie website']
  #start_urls = #HAVE NOT FILLED THIS IN YET- WILL BE A SELECT STATEMENT, GATHERING ALL URLS

  def parse(self, response):

      hxs = HtmlXPathSelector(response)

      #Expect only singular items (ie. one title, one rating, etc.)

      single_info = SingleItem()
      title = hxs.select('[title tags here]').extract()
      distributor = hxs.select('[distributor tags here]').extract()
      rating = hxs.select('[rating tags here]').extract()
      director = hxs.select('[director tags here]').extract()
      genre = hxs.select('[genre tags here]').extract()

      single_items = []
      single_info['title'] = title
      single_info['distributor'] = distributor
      single_info['rating'] = rating
      single_info['director'] = director
      single_info['genre'] = genre        
      single_items.append(single_info) #Note: not sure if I want to return this or the single info

      #return single_items


      #Multiple items in a field

      multi_info = MultiItem()
      actors = hxs.select('[actor tags here]').extract()
      producers = hxs.select('[producer tags here]').extract()

      actor_items= []
      for i in range(len(actors)):
          multi_info['title'] = title
          multi_info['actor'] = actors[i]
          actor_items.append(multi_info)

     #return actor_items - can I have multiple returns in my code to specify which pipeline is used, or which table this should be inserted into

      producer_items = []
      for i in range(len(producers)):
          multi_info['title'] = title
          multi_info['producer'] = producers[i]
          producer_items.append(multi_info)
      #return producer_items - same issue - are multiple returns allowed? Should I try to put both the 'single items' and 'multiple items' in on big 'items' list?  Can scrapy figure that out or how would I go about specifying?

我已经对一些可能不清楚的问题发表了评论——我不确定如何指导所有内容,以便最终出现在适当的表格中。阅读管道时,这可能会更清楚,即:

 class IndMoviePipeline(object):

     def __init__(self):
        'initiate the database connnection'
        self.conn = MySQLdb.connect(user='testuser', passwd='test', db='testdb', host='localhost', charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):

         try:
             if 'producer' in item:
                  self.cursor.execute("""INSERT INTO Producers (title, producer) VALUES (%s, %s)""", (item['title'], item['producer']))
             elif 'actor' in item:
                  self.cursor.execute("""INSERT INTO Actors (title, actor) VALUES (%s, %s)""", (item['title'], item['actor']))
             else:
                  self.cursor.execute("""INSERT INTO Other_Info (title, distributor, rating, director, genre) VALUES (%s, %s, %s, %s, %s)""", (item['title'], item['distributor'], item['rating'], item['director'], item['genre'])) #NOTE: I will likely change 'Other_Info' table to just populating the original table from which the URLS will be pulled
             self.conn.commit()
         except MySQLdb.Error, e:
             print "Error %d: %s" % (e.args[0], e.args[1])

         return item

我认为这将有助于将其定向item到数据库中的适当位置table。基于此,我认为拥有一个大列表items并将所有内容附加到其中是可行的,因此:

 items = []
 items.append(single_info)

 for i in range(len(producers)):
      multi_info['title'] = title
      multi_info['producer'] = producers[i]
      items.append(multi_info)

 for i in range(len(actors)):
      multi_info['title'] = title
      multi_info['actor'] = actors[i]
      items.append(multi_info)

只是pipeline用这些if陈述来解决这一切。不过,我不确定这是否是最好的方法,并且非常感谢您的建议。

4

1 回答 1

13

从概念上讲,scrapy Items 通常是指被抓取的单个“事物”(在您的情况下,是一部电影),并且具有表示构成该“事物”的数据的字段。所以考虑拥有:

class MovieItem(scrapy.item.Item):
  title = Field()
  director = Field()
  actors = Field()

然后当你刮掉这些物品时:

item = MovieItem()

title = hxs.select('//some/long/xpath').extract()
item['title'] = title

actors = hxs.select('//some/long/xpath').extract()
item['actors'] = actors

return item

蜘蛛解析方法应该总是返回或产生 scrapy.item.Item 对象或 scrapy.http.Request 对象。

从那里开始,您如何处理 MovieItems 取决于您。您可以为 MovieItem 的每个属性设置一个管道,但不建议这样做。相反,我建议使用单个 MySQLPersistancePipeline 对象,该对象具有持久化 MovieItem 的每个字段的方法。所以像:

class MySQLPersistancePipeline(object):
  ...
  def persist_producer(self, item):
    self.cursor.execute('insert into producers ...', item['producer'])

  def persist_actors(self, item):
    for actor in item['actors']:
      self.cursor.execute('insert into actors ...', actor)

  def process_item(self, item, spider):
    persist_producer(item)
    persist_actors(item)
    return item
于 2013-08-27T23:22:51.980 回答