0

所以我有这些不断变化的网址:

http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

但我想去掉变化的第一部分,只剩下:

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

我会使用什么正则表达式来删除所有内容?

我不能使用“startswith()”,因为该 URL 中的“usg”数字会发生变化。

4

4 回答 4

3

为工作使用正确的工具;urlparse使用模块解析查询字符串:

import urlparse

qs = urlparse.urlsplit(inputurl).query
url = urlparse.parse_qs(qs).get('url', [None])[0]

如果url 查询字符串中没有元素,则设置url为,否则设置为 URL 值。Noneurl=

演示:

>>> import urlparse
>>> inputurl = 'http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
>>> qs = urlparse.urlsplit(inputurl).query
>>> urlparse.parse_qs(qs).get('url', [None])[0]
'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
于 2013-11-10T01:58:00.103 回答
1

为什么不只是

print data.split("&url=", 1)[1].split("&", 1)[0]

样品运行

data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
print data.split("&url=", 1)[1].split("&", 1)[0]

输出

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/
于 2013-11-10T01:51:09.943 回答
1

这将正常工作:

url = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"

In [148]: url.split('&url=')[1]
Out[148]: 'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

我会使用urlparse.parse_qs(url)评论中提到的@MartijnPieters。

于 2013-11-10T01:54:33.007 回答
1

请注意,“&url=”右边的不是url。它是一个url 编码的 url。因此,例如,如果原始 url 包含“&”,这将包含“%26”。在不解码的情况下使用它适用于许多 url,但通常不能保证。

正如 Martjin 建议的那样,这将始终正常工作:

import urlparse
data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
o = urlparse.urlparse(data)
q = urlparse.parse_qs(o.query)
print q['url']
于 2013-11-10T02:00:02.320 回答