3

我正在尝试遵循 Intro to Data Sci coursera 课程。但是我在尝试解析来自 twitter 的 json 响应时遇到了问题

我正在尝试从以下格式的 json 中检索文本。

{u'delete': {u'status': {u'user_id_str': u'702327198', u'user_id': 702327198, u'id': 332772178690981889L, u'id_str': u'332772178690981889'}}}, {u'delete': {u'status': {u'user_id_str': u'864736118', u'user_id': 864736118, u'id': 332770710667792384L, u'id_str': u'332770710667792384'}}}, {u'contributors': None, u'truncated': False, **u'text'**: u'RT @afgansyah_reza: Lagi ngantri. Ada ibu2 & temennya. "Ih dia mukanya mirip banget sama Afgan.", trus ngedeketin gw, "Tuh kan.. Mirip bang\u2026', u'in_reply_to_status_id': None, u'id': 332772350640668672L, u'favorite_count': 0, ....... ]

这是我使用的代码:

def hw():
    data = []
    count=0
    with open('output.txt') as f:
        for line in f:
            encoded_string = line.strip().encode('utf-8')
            data.append(json.loads(encoded_string))

    print data# generates the input to next block
    for listval in data:#individual block
        if "text" in listval:
            print listval["text"]
        else:
            continue

但是,当我运行它时,我得到以下输出和错误

   RT @afgansyah_reza: Lagi ngantri. Ada ibu2 & temennya. "Ih dia mukanya mirip banget sama Afgan.", trus ngedeketin gw, "Tuh kan.. Mirip bang…
RT @Dimaz_CSIX: Kolor pakek pita #laguharlemshake
Traceback (most recent call last):
  File "F:\ProgrammingPoint\workspace-new\PyTest\tweet_sentiment.py", line 41, in <module>
    main()
  File "F:\ProgrammingPoint\workspace-new\PyTest\tweet_sentiment.py", line 36, in main
    hw()
  File "F:\ProgrammingPoint\workspace-new\PyTest\tweet_sentiment.py", line 23, in hw
    print listval["text"]
  File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 13-63: character maps to <undefined>

我是 Python 的新手,任何帮助将不胜感激。

4

3 回答 3

9
于 2013-05-10T22:48:10.883 回答
5

If you are using PyDev Eclipse Plugin try going to Windows->Preferences->General->Workspace and choose at the left lower corner at TEXT FILE ENCODING -> Choose Other = UTF-8

It might work.

于 2013-05-15T02:10:33.687 回答
0

Your json.loads call is converting the UTF-8 encoded json back into a Python Unicode string. When you print it, it attempts to convert the text into your environment's default encoding, which the cp1252.py reference makes clear is Windows code page 1252. You'll have to decide what output format and encode to that before printing. If you want cp1252, give it an error handler other than the default of 'strict'.

http://docs.python.org/2/howto/unicode.html has the full docs, including the various error handler possibilities.

于 2013-05-10T22:48:47.827 回答