python - 如何设置正则表达式以在 Python 中删除 url 末尾的时间戳？

Question

我有包含大量 url 的文本文件，但它们最后有时间戳，这对我来说有点多余。

    http://techcrunch.com/2012/02/10/vevo-ceo-tries-to-explain-their-hypocritical-act-of-piracy-at-sundance/)16:55:40
    http://techcrunch.com/2012/04/30/edmodo-hits-7m/)15:18:45

我在想，在 python 中使用正则表达式将帮助我摆脱它，但同时我可以使用Python split and replace可以在末尾删除时间戳的操作，其输出类似于下面给定的 url

    >>> url.split(")")[0]
    http://techcrunch.com/2012/04/30/edmodo-hits-7m

现在我的问题是，在空间和时间方面，正则表达式样式或 python 字符串方法的性能会更好，还是有其他更好的方法。

score 1 · Accepted Answer

我不会将 RegEx 用于这样的任务，这太容易了

for line in lines:
    print line.split(')')[0]

或者如果url包含)：

for line in lines:
    print ')'.join(line.split(')')[:-1])

score 0 · Accepted Answer

0

另一种可能：

for line in lines:
    url = line.rsplit('/', 1)[0]

于 2013-07-24T21:11:07.283 回答

score 0 · Accepted Answer

如果您要删除的部分具有固定长度，为什么不只是

L[:-9]

?

在 PythonL[a:b]中表示 L（列表、字符串、元组）从索引a到索引b（排除）的部分。

如果a省略，则表示从头开始，如果b为负，则表示从结尾开始计数。

所以L[:-9]意思是“L除了最后九个元素之外的所有元素”。

score 0 · Accepted Answer

这应该比遍历每一行更快：

import re

my_str = "http://techcrunch.com/2012/04/30/edmodo-hits-7m/)15:18:45"
re.findall(r'([\w./:\d-]+)/\)\d\d:\d\d:\d\d', my_str)

score 0 · Accepted Answer

import re

f = open('urls.txt')

# If you want to remove the extra / at the end of the url us this regex instead:
# r"^(?P<url>.*[^/])/?\)(?P<timestamp>\d{2}:\d{2}:\d{2})$"
url_timestamp_pattern = re.compile(r"^(?P<url>.*)\)(?P<timestamp>\d{2}:\d{2}:\d{2})$")

for line in f.readlines():
    match = url_timestamp_pattern.match(line)
    if match:
        print(match.group('url'))

python - 如何设置正则表达式以在 Python 中删除 url 末尾的时间戳？

5 回答 5

Related

Reference