python - 在实时推文流中跟踪关键字

Question

我安装并试用了 tweepy，我现在正在使用以下功能：

API.public_timeline()

返回已设置自定义用户图标的非受保护用户的 20 个最新状态。公共时间线被缓存了 60 秒，因此更频繁地请求它是一种资源浪费。

但是，我想从完整的直播流中提取与某个正则表达式匹配的所有推文。我可以放入public_timeline()一个while True循环中，但这可能会遇到速率限制问题。无论哪种方式，我都不认为它可以涵盖所有当前的推文。

那怎么可能呢？如果不是所有推文，那么我想提取与某个关键字匹配的尽可能多的推文。

score 2 · Accepted Answer

流式 API 是您想要的。我使用了一个名为 tweetstream 的库。这是我的基本聆听功能：

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

我有一段时间没看，但我很确定这个库只是在访问示例流（而不是 firehose）。HTH。

编辑添加：您说您想要“完整的直播”，也就是消防软管。这在财政和技术上都很昂贵，而且只有非常大的公司才能拥有它。查看文档，您会发现该示例基本上具有代表性。

score 1 · Accepted Answer

看看流式 API。您甚至可以订阅您定义的单词列表，并且只返回与这些单词匹配的推文。

流式 API 速率限制的工作方式不同：每个 IP 获得 1 个连接，每秒最多事件数。如果发生的事件多于该事件，那么无论如何您只会获得最大值，并会通知您由于速率限制而错过了多少事件。

我的理解是流式 API 最适合根据需要将内容重新分发给您的用户的服务器，而不是由您的用户直接访问 - 常设连接很昂贵，并且 Twitter 在太多失败的连接后开始将 IP 列入黑名单并重新连接，之后可能还有您的 API 密钥。

python - 在实时推文流中跟踪关键字

2 回答 2

Related

Reference