0

我有一个脚本通过 pymongo 将推文消耗到我的本地 mongodb 中:

import json
import pymongo
import tweepy

consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)


class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.db = pymongo.MongoClient().test

    def on_data(self, tweet):
        self.db.tweets.insert(json.loads(tweet))

    def on_error(self, status_code):
        return True # Don't kill the stream

    def on_timeout(self):
        return True # Don't kill the stream


sapi = tweepy.streaming.Stream(auth, CustomStreamListener(api))
sapi.filter(locations=[-74, 40, -73, 41])

目前,我得到了完整的推文,这比我实际需要的信息要多得多。如何更改现有脚本,以便仅使用以下信息:

i) Hashtag ii) UserID iii) PlaceID iv) 时间戳?

4

2 回答 2

1

on_data中,解析 json 得到你感兴趣的数据并保存:

def on_data(self, tweet):
    tweet_parsed = json.loads(tweet)
    if 'created_at' in tweet_parsed:
        hashtags = tweet_parsed['entities']['hashtags']
        for hashtag in hashtags:
            # Now get the hashtags.
            hashtag_text = hashtag['text']
        # Now get the user id.
        user_id = tweet_parsed['user']['id']
        # Now get the longitude.            
        longitude = tweet_parsed['coordinates']['coordinates'][0]
        # Now get the latitude.
        latiitude = tweet_parsed['coordinates']['coordinates'][1]
        # Now get the timestamp.
        timestamp = tweet_parsed['created_at']
于 2013-09-12T14:22:04.990 回答
0

on_data: 中,不要将原始tweet对象传递给.insert,而是创建一个新的本地对象,其中仅包含您想要的字段并从推文对象中复制值。

于 2013-09-12T13:21:37.873 回答