0

我正在使用Python和模块构建一个 twitter 抓取Tweepy器应用程序MySQLdb

它将获取数百万条推文,因此性能是一个问题,我想在将其添加到同一个查询之前检查表中是否存在之前的 tweet_id

表架构是:

  *id* |   tweet_id             |     text
  _____|________________________|______________________________
    1  |   259327533444925056   |     sample tweet1
  _____|________________________|______________________________
    2  |   259327566714923333   |     this is a sample tweet2 

我尝试的代码是,但它执行双重查询:

#check that the tweet doesn't exist first
q = "select count(*) from tweets where tweet_id = " + tweet.id
cur.execute(q)
result = cur.fetchone()
found = result[0]
if found == 0: 
q = "INSERT INTO  lexicon_nwindow (tweet_id,text) VALUES(tweet_id,tweet.text)
cur.execute(q)

使 Tweet_id 唯一并仅插入推文,会引发异常并且效率不高吗?

那么用一个查询来实现这一目标的最佳执行方法是什么?

4

4 回答 4

1

如果将 tweet_id 作为主键(删除字段 Id),则可以使用 INSERT IGNORE 或 REPLACE INTO。1解决了2个问题。

如果要保留 Id 字段,请将其设置为索引/唯一并将其设置为自动增量。如果我知道 tweet_id 可以用作主键,我会避开这种方法。

希望这可以帮助。

哈里

于 2012-10-30T17:55:02.027 回答
0

The answer is profile, don't speculate.

I don't mean to be dismissive. We don't know what will be fastest:

  • SELECT + (in code) conditional INSERT
  • REPLACE INTO
  • INSERT IGNORE
  • INSERT SELECT WHERE NOT EXISTS...)
  • INSERT and (in code) ignore error

We don't know the rate of data, the frequency of duplicates, the server configuration, whether there are multiple writers simultaneously, etc.

Profile, don't speculate.

于 2012-10-31T01:19:00.863 回答
0
#check that the tweet doesn't exist first
q = "select count(*) from tweets where tweet_id = " + tweet.id
cur.execute(q)
result = cur.fetchone()
found = result[0]
if found == 0: 
q = "REPLACE  lexicon_nwindow (tweet_id,text) VALUES(tweet_id,tweet.text)
cur.execute(q)
于 2012-10-30T17:59:23.813 回答
0

使用 INSERT SELECT 而不是 INSERT VALUES 并在您的 SELECT 添加 WHERE 子句以检查您的 tweet.id 是否已在表中

q = "INSERT INTO  lexicon_nwindow (tweet_id,text) 
SELECT " + tweet.id +" ," + tweet.text +" FROM DUAL
WHERE not exists(select 1 from tweets where tweet_id = " + tweet.id +" ) "
于 2012-10-30T19:28:26.587 回答