I want to scrape some specific webpages on a regular basis (e.g. each hour). This I want to do with python. The scraped results should get inserted into an SQLite table. New info will be scraped but also 'old' information will get scraped again, since the python-script will run each hour.
To be more precise, I want to scrape a sports-result page, where more and more match-results get published on the same page as the tournament proceeds. So with each new scraping I just need the new results to be entered in the SQLite-table, since the older ones already got scraped (and inserted into the table) one hour before (or even earlier).
I also don't want to insert the same result twice, when it gets scraped the second time. So there should be some mechanism to check if one result already got scraped. Can this be done on SQL-level? So, that I scrape the whole page, make an INSERT
statement for each result, but only those INSERT
statements get executed successfully which were not present in the database before. I'm thinking of something like a UNIQUE
keyword or so.
Or am I thinking too much about performance and should solve this by doing a DROP TABLE
each time before I start scraping and then just scrape everything from scratch again? I don't talk about really much data. It's just about 100 records (= matches) for 1 tournament and about 50 tournaments a year.
Basically I would just be interested in some kind of best-practice approach.