python - 如何将 Unicode 字符串作为参数传递给 urllib.urlencode()

Question

我正在使用 Microsoft 的免费翻译服务将一些印地语字符翻译成英语。他们没有为 Python 提供 API，但我从以下网站借用了代码：tinyurl.com/dxh6thr

我正在尝试使用此处描述的“检测”方法：tinyurl.com/bxkt3we

'hindi.txt' 文件以 unicode 字符集保存。

>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>

响应显示翻译器检测到“en”，而不是“hi”（用于印地语）。当我检查编码时，它显示为“字符串”：

>>> type(hindi_string)
<type 'str'>

作为参考，这里是 'hindi.txt' 的内容：

हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।

我不确定使用 string.encode 或 string.decode 是否适用于此。如果是这样，我需要从/到编码/解码什么？将 Unicode 字符串作为 urllib.urlencode 参数传递的最佳方法是什么？如何确保将实际的印地语字符作为参数传递？

谢谢你。

** 附加信息 **

我尝试按照建议使用 codecs.open() ，但出现以下错误：

>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\codecs.py", line 671, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

这是 repr(hindi_string) 输出：

>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"

score 2 · Accepted Answer

您的文件是utf-16，因此您需要在发送内容之前对其进行解码：

hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...

score 0 · Accepted Answer

您可以尝试使用以下方法打开文件codecs.open并对其进行解码utf-8：

import codecs

with codecs.open('hindi.txt', encoding='utf-8') as f:
    hindi_text = f.read()

python - 如何将 Unicode 字符串作为参数传递给 urllib.urlencode()

2 回答 2

Related

Reference