进一步详细说明这个问题的标题:我正在从电影网站上抓取信息。我目前有一个MySQL
填充有movie titles
,movie urls
等的数据库。我现在urls
要从数据库中获取这些并将它们设置为我start_urls
在一个新的spider
. 每个url
都是[插入短片]网页的链接,传达了更多信息。我感兴趣的信息是:
- 分销商(即福克斯)
- 评级(即 Pg-13)
- 导向器
- 类型(即喜剧)
- 演员
- 生产者/秒
其中,发行商、评级、导演和流派将与每个电影网页中的一个“事物”相关联(一个评级、一个导演等)。当然,会有多个演员,并且取决于多个制片人(大牌电影/大多数电影)。这是我遇到问题的地方。我想建立一个pipeline' which puts each piece of info in an appropriate
表within my
MySQLdatabase. So, a table for director, a table for rating, etc. Each table will also have
电影标题`。我可以这样陈述问题本身:
我无法协调如何构建pipeline
适当的spider
. 我不确定我是否可以从一个蜘蛛返回多个东西并将它们发送到不同的pipelines
(创建不同的项目来处理single
属性,以及不同的项目来处理“多个”属性)或者是否使用相同的管道并以某种方式指定什么去哪里(不确定我是否只能在刮掉后返回一件事)。我将展示我的代码,希望问题会变得更清楚。*注意:它还没有完成 - 我只是想用如何做到这一点来填补空白
蜘蛛:
class ActorSpider(BaseSpider):
import sys; sys.path.append("/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages")
import MySQLdb
db = MySQLdb.connect(db = 'testdb', user='testuser', passwd='test')
dbc = db.cursor()
name = 'ActorSpider'
allowed_domains = ['movie website']
#start_urls = #HAVE NOT FILLED THIS IN YET- WILL BE A SELECT STATEMENT, GATHERING ALL URLS
def parse(self, response):
hxs = HtmlXPathSelector(response)
#Expect only singular items (ie. one title, one rating, etc.)
single_info = SingleItem()
title = hxs.select('[title tags here]').extract()
distributor = hxs.select('[distributor tags here]').extract()
rating = hxs.select('[rating tags here]').extract()
director = hxs.select('[director tags here]').extract()
genre = hxs.select('[genre tags here]').extract()
single_items = []
single_info['title'] = title
single_info['distributor'] = distributor
single_info['rating'] = rating
single_info['director'] = director
single_info['genre'] = genre
single_items.append(single_info) #Note: not sure if I want to return this or the single info
#return single_items
#Multiple items in a field
multi_info = MultiItem()
actors = hxs.select('[actor tags here]').extract()
producers = hxs.select('[producer tags here]').extract()
actor_items= []
for i in range(len(actors)):
multi_info['title'] = title
multi_info['actor'] = actors[i]
actor_items.append(multi_info)
#return actor_items - can I have multiple returns in my code to specify which pipeline is used, or which table this should be inserted into
producer_items = []
for i in range(len(producers)):
multi_info['title'] = title
multi_info['producer'] = producers[i]
producer_items.append(multi_info)
#return producer_items - same issue - are multiple returns allowed? Should I try to put both the 'single items' and 'multiple items' in on big 'items' list? Can scrapy figure that out or how would I go about specifying?
我已经对一些可能不清楚的问题发表了评论——我不确定如何指导所有内容,以便最终出现在适当的表格中。阅读管道时,这可能会更清楚,即:
class IndMoviePipeline(object):
def __init__(self):
'initiate the database connnection'
self.conn = MySQLdb.connect(user='testuser', passwd='test', db='testdb', host='localhost', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
if 'producer' in item:
self.cursor.execute("""INSERT INTO Producers (title, producer) VALUES (%s, %s)""", (item['title'], item['producer']))
elif 'actor' in item:
self.cursor.execute("""INSERT INTO Actors (title, actor) VALUES (%s, %s)""", (item['title'], item['actor']))
else:
self.cursor.execute("""INSERT INTO Other_Info (title, distributor, rating, director, genre) VALUES (%s, %s, %s, %s, %s)""", (item['title'], item['distributor'], item['rating'], item['director'], item['genre'])) #NOTE: I will likely change 'Other_Info' table to just populating the original table from which the URLS will be pulled
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
我认为这将有助于将其定向item
到数据库中的适当位置table
。基于此,我认为拥有一个大列表items
并将所有内容附加到其中是可行的,因此:
items = []
items.append(single_info)
for i in range(len(producers)):
multi_info['title'] = title
multi_info['producer'] = producers[i]
items.append(multi_info)
for i in range(len(actors)):
multi_info['title'] = title
multi_info['actor'] = actors[i]
items.append(multi_info)
只是pipeline
用这些if
陈述来解决这一切。不过,我不确定这是否是最好的方法,并且非常感谢您的建议。