python - 当 url 包含非英语语言时如何使用 pycurl？

Question

这是 pycurl 的 sourceforge 页面上的示例。如果网址包含类似中文。我们应该做什么流程？既然pycurl不支持unicode？

import pycurl
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.python.org/")
c.setopt(pycurl.HTTPHEADER, ["Accept:"])

import StringIO
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
print b.getvalue()

score 1 · Accepted Answer

这是一个演示三个不同问题的脚本：

Python源代码中的非ASCII字符
url 中的非 ASCII 字符
html 内容中的非 ASCII 字符

# -*- coding: utf-8 -*-
import urllib
from StringIO import StringIO
import pycurl

title = u"UNIX时间" # 1
url = "https://zh.wikipedia.org/wiki/" + urllib.quote(title.encode('utf-8')) # 2

c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPHEADER, ["Accept:"])

b = StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()

data = b.getvalue() # bytes
print len(data), repr(data[:200])

html_page_charset = "utf-8" # 3
html_text = data.decode(html_page_charset)
print html_text[:200] # 4

注意：utf-8代码中的所有内容都是完全独立的。

Unicode 文字使用您在文件顶部定义的任何字符编码。确保您的文本编辑器尊重该设置
url 中的路径应在使用utf-8百分比编码（urlencoded）之前进行编码

有几种方法可以找出 html 页面字符集。请参阅 HTML 中的字符编码。requests@Oz123 提到的一些库会自动执行此操作：

# -*- coding: utf-8 -*-
import requests

r = requests.get(u"https://zh.wikipedia.org/wiki/UNIX时间")
print len(r.content), repr(r.content[:200]) # bytes
print r.encoding
print r.text[:200] # Unicode

要将 Unicode 打印到控制台，您可以使用PYTHONIOENCODING环境变量来设置终端理解的字符编码

另请参阅每个软件开发人员绝对、肯定必须了解 Unicode 和字符集（没有任何借口！）和 Python 特定的实用 Unicode的绝对最低要求。

score 0 · Accepted Answer

试试 urllib.quote，它将用转义序列替换非 ASCII 字符：

import urllib

url_to_fetch = urllib.quote(unicode_url)

编辑：只应引用路径，您必须使用 urlparse 拆分完整的 URL，引用路径，然后使用 urlunparse 获取要获取的最终 URL。

score 0 · Accepted Answer

只需将您的网址编码为“utf-8”，一切都会好起来的。来自文档：

在 Python 3 下， bytes 类型包含任意编码的字节字符串。PycURL 将接受 libcurl 指定“字符串”参数的所有选项的字节值：

>>> import pycurl
>>> c = pycurl.Curl()
>>> c.setopt(c.USERAGENT, b'Foo\xa9')
# ok

str 类型保存 Unicode 数据。PycURL 将接受仅包含 ASCII 代码点的 str 值：

>>> c.setopt(c.USERAGENT, 'Foo')
# ok

>>> c.setopt(c.USERAGENT, 'Foo\xa9')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 3: 
ordinal not in range(128)

>>> c.setopt(c.USERAGENT, 'Foo\xa9'.encode('iso-8859-1'))
# ok

[1] http://pycurl.io/docs/latest/unicode.html

python - 当 url 包含非英语语言时如何使用 pycurl？

3 回答 3

Related

Reference