Im trying to avoid scraping the same information more than once, i run this spider every morning to scrape jobs from a job board, then i copy them into excel and press the remove duplicates from the list using URL. i would like to do this in scrapy (i can change the txt file to csv). i would be happy to implement middleware to
this is the pipleing that i am trying to use
class CraigslistSamplePipeline(object):
def find_row_by_id(item):
with open('URLlog.txt', 'r') as f: # open my txt file with urls from previous scrapes
urlx = [url.strip() for url in f.readlines()] # extract each url
if urlx == item ["website_url"]: # compare old url to URL being scraped
raise DropItem('Item already in db') # skip record if in url list
return
im sure this code is wrong, can someone please suggest how i can do this, Im very new to this so explaining each line would help me alot. i hope my question makes sense and someone can help me
ive looked at these posts for help, but was not able to solve my problem:
How to Filter from CSV file using Python Script