5

I want to scrape some specific webpages on a regular basis (e.g. each hour). This I want to do with python. The scraped results should get inserted into an SQLite table. New info will be scraped but also 'old' information will get scraped again, since the python-script will run each hour.

To be more precise, I want to scrape a sports-result page, where more and more match-results get published on the same page as the tournament proceeds. So with each new scraping I just need the new results to be entered in the SQLite-table, since the older ones already got scraped (and inserted into the table) one hour before (or even earlier).

I also don't want to insert the same result twice, when it gets scraped the second time. So there should be some mechanism to check if one result already got scraped. Can this be done on SQL-level? So, that I scrape the whole page, make an INSERT statement for each result, but only those INSERT statements get executed successfully which were not present in the database before. I'm thinking of something like a UNIQUE keyword or so.

Or am I thinking too much about performance and should solve this by doing a DROP TABLE each time before I start scraping and then just scrape everything from scratch again? I don't talk about really much data. It's just about 100 records (= matches) for 1 tournament and about 50 tournaments a year.

Basically I would just be interested in some kind of best-practice approach.

4

2 回答 2

4

您想要做的是 upsert(如果不存在则更新或插入)。在这里查看如何在 sqlite 中执行此操作:SQLite UPSERT - ON DUPLICATE KEY UPDATE

于 2013-04-18T00:49:19.670 回答
2

如果数据不存在,您似乎想插入数据?也许是这样的:

  1. 检查条目是否存在
  2. 如果没有则插入数据
  3. 如果有更新条目?(是否要更新)

您可以发出 2 个单独的 sql 语句 SELECT 然后 INSERT/UPDATE

或者你可以设置唯一的,我相信 sqllite 会提高IntegrityError

try:
  # your insert here
  pass
except sqlite.IntegrityError:
  # data is duplicate insert
  pass
于 2013-04-17T14:59:16.613 回答