python - 使用 PyMongo 从 Twitter Streaming API 存储 JSON 字典

Question

目前，我有很多推文，我打算将它们存储在我实验室的服务器上。但是，我在确定我打算如何执行此操作时遇到了一些问题。

例如，推文具有以下格式：

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Tue Jul 10 17:09:12 +0000 2012",
    "entities": {
        "hashtags": [{
            "indices": [62, 78],
            "text": "thestrongnation"
        }],
        "urls": [],
        "user_mentions": [{
            "id": 376483630,
            "id_str": "376483630",
            "indices": [0, 8],
            "name": "SherryHonig",
            "screen_name": "sahonig"
        }]
    },
    "favorited": false,
    "geo": null,
    "id": 222739261219282945,
    "id_str": "222739261219282945",
    "in_reply_to_screen_name": "sahonig",
    "in_reply_to_status_id": 222695060528037889,
    "in_reply_to_status_id_str": "222695060528037889",
    "in_reply_to_user_id": 376483630,
    "in_reply_to_user_id_str": "376483630",
    "place": {
        "attributes": {},
        "bounding_box": {
            "coordinates": [
                [
                    [-106.645646, 25.837164000000001],
                    [-93.508038999999997, 25.837164000000001],
                    [-93.508038999999997, 36.500703999999999],
                    [-106.645646, 36.500703999999999]
                ]
            ],
            "type": "Polygon"
        },
        "country": "United States",
        "country_code": "US",
        "full_name": "Texas, US",
        "id": "e0060cda70f5f341",
        "name": "Texas",
        "place_type": "admin",
        "url": "http://api.twitter.com/1/geo/id/e0060cda70f5f341.json"
    },
    "retweet_count": 0,
    "retweeted": false,
    "source": "web",
    "text": "@sahonig BOOM !!!! I feel a 1 coming on!!! Awesome! #thestrongnation",
    "truncated": false,
    "user": {
        "contributors_enabled": false,
        "created_at": "Wed Feb 15 14:40:48 +0000 2012",
        "default_profile": false,
        "default_profile_image": false,
        "description": "Living life on 30A & doing it my way. My mind is Stronger than physical challenge. Runner, Crosfit, Fitness Challenges. Proud member of #thestrongnation. ",
        "favourites_count": 17,
        "follow_request_sent": null,
        "followers_count": 215,
        "following": null,
        "friends_count": 184,
        "geo_enabled": true,
        "id": 493181025,
        "id_str": "493181025",
        "is_translator": false,
        "lang": "en",
        "listed_count": 4,
        "location": "Seagrove Beach, FL",
        "name": "30A My Way \u2600",
        "notifications": null,
        "profile_background_color": "c0deed",
        "profile_background_image_url": "http://a0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
        "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
        "profile_background_tile": true,
        "profile_image_url": "http://a0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
        "profile_image_url_https": "https://si0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
        "profile_link_color": "0084B4",
        "profile_sidebar_border_color": "C0DEED",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "protected": false,
        "screen_name": "30A_MyWay",
        "show_all_inline_media": false,
        "statuses_count": 1731,
        "time_zone": "Central Time (US & Canada)",
        "url": null,
        "utc_offset": -21600,
        "verified": false
    }
}

这当然是 Python 中的字典，恰好遵循 JSON 格式。MongoDB 以 JSON 格式方便地接受这些，但问题是，我不希望提供所有信息。Streaming API 为我提供了 20 个字段，而我现在只想搞乱用户 ID、文本和位置。我最初打算解析它并仅提取我想要的文本，但是我找不到可靠的解析器，而且我觉得写一个解析器只是浪费时间，因为它正在开发的条件下。

但是，我正在考虑的另一个解决方案是，由于这些正在被读入 MongoDB，也许我可以只在字典中存储我想要的内容并摆脱其余的内容。唯一出现的问题是 Twitter 收到的文件格式将所有字典放在同一行 - 我觉得无论如何我都必须进行某种提取。

还有其他人有什么建议吗？

score 1 · Accepted Answer

如果必须，您可以使用json.loads（将返回 a listof dicts 如上格式）获取结果并将其放入 Python 结构中（如果还没有的话），以便对其进行操作。（但通常会使用一些可以透明地执行此操作的 Python Twitter 库）

只需创建一个dict您想要的新数据并将其插入 MongoDB，例如：

假设ret= 上面的推文响应

mydata = {
    'name': ret['user']['screen_name'],
    'text': ret['text']
}

print mydata['name'], 'wrote', mydata['text'] # or something

# insert mydata into appropriate MongoDB DB/collection here

python - 使用 PyMongo 从 Twitter Streaming API 存储 JSON 字典

1 回答 1

Related

Reference