python - 如何在 Python 中将 UTF-8 字符串转换为符合 URL 的字符串？

Question

我确定我不是第一个遇到这个问题的人。但是经过数小时的调试、谷歌搜索和 StackOverflow-ing 却没有找到答案，我决定发布这个问题。如果我错过了什么，请提前抱歉，但现在，我很困惑。

我正在使用 BeautifulSoup 来解析 UTF-8 网站。我正在使用网站上的文本来构建一个 URL 以进一步抓取。我遇到了一些非英文字符的问题。

例如：站点包含字符串Originální formule，我想用它来构建 URL:http://blahblah.com/Originální-formule或http://blahblah.com/origin%C3%A1ln%C3%AD-formule. 问题是，我得到了http://blahblah.com/Origin\xe1ln\xed-formule，这会产生错误。我尝试编码、解码等等，但我仍然无法获得正确的 URL。

顺便说一句，当我时print u'Origin\xe1ln\xed-formule'，字符串打印得很好。它只是编码不成功。

我究竟做错了什么？

score 1 · Accepted Answer

In order to achieve what you are expecting you have to do the following things:

Decompose the url
Get the path part and encode it to utf-8
Quote the path
Join each part to get back a quoted URL

You can perform these with a combination of the following functions:

urlparse.urlparse (docs)
urllib.quote (docs)
urlparse.unparse (docs)

The code will end up like this:

from urlparse import urlparse, urlunparse
from urllib import quote
x = u'http://blahblah.com/Originální-formule'
parsed_url = list(urlparse(x.encode('utf-8')))
parsed_url[2] = quote(parsed_url[2])
urlunparse(parsed_url)

Result: http://blahblah.com/Origin%C3%A1ln%C3%AD-formule

python - 如何在 Python 中将 UTF-8 字符串转换为符合 URL 的字符串？

1 回答 1

Related

Reference