1

Im trying to avoid scraping the same information more than once, i run this spider every morning to scrape jobs from a job board, then i copy them into excel and press the remove duplicates from the list using URL. i would like to do this in scrapy (i can change the txt file to csv). i would be happy to implement middleware to

this is the pipleing that i am trying to use

class CraigslistSamplePipeline(object):



    def find_row_by_id(item):
        with open('URLlog.txt', 'r') as f:                # open my txt file with urls from previous scrapes
            urlx = [url.strip() for url in f.readlines()] # extract each url
            if urlx == item ["website_url"]:              # compare old url to URL being scraped
            raise DropItem('Item already in db')      # skip record if in url list
        return

im sure this code is wrong, can someone please suggest how i can do this, Im very new to this so explaining each line would help me alot. i hope my question makes sense and someone can help me

ive looked at these posts for help, but was not able to solve my problem:

How to Filter from CSV file using Python Script

Scrapy - Spider crawls duplicate urls

how to filter duplicate requests based on url in scrapy

4

1 回答 1

0

使用in关键字。像这样:

 if item['website_url'] in urlx:
      raise DropItem('Item already in db')

urlx您从每行都是一个 url 的文件中加载。它现在是一个列表。关键字检查网站inurl 是否在列表中urlx。如果是,则返回 true。请记住,在我的示例中,比较区分大小写。您可能想要调用.lower()网站 url 和从文件加载的 url。

有更有效的方法可以做到这一点,但我假设你只想要一些有用的东西。

于 2013-08-01T04:03:23.547 回答