我在 python 中使用爬虫框架“scrapy”,并使用 pipelines.py 文件将我的项目以 json 格式存储到文件中。执行此操作的代码在 import json 下面给出
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
# Writing it to a file
json.dump(d,self.file)
return item
问题是当我运行我的爬虫两次(比如说)然后在我的文件中我得到重复的抓取项目。我尝试通过首先从文件中读取然后将数据与要写入的新数据匹配来阻止它,但是从读取的数据文件是 json 格式,所以我用 json.loads() 函数对其进行了解码,但它不起作用:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
self.temp = json.loads(file.read())
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
# Writing it to a file
if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
json.dump(d,self.file)
return item
.
请提出一种方法来做到这一点。
注意:请注意,我必须以“追加”模式打开文件,因为我可能会抓取一组不同的链接,但是使用相同的 start_url 运行两次爬虫应该将相同的数据写入文件两次