python - 如何计算在 python 中打开唯一 URL 的次数？

Question

我正在运行一个Python代码，它读取 URL 列表并使用urlopen单独打开每个 URL 。某些 URL 在列表中重复。该列表的一个示例类似于：

www.example.com/page1
www.example.com/page1
www.example.com/page2
www.example.com/page2
www.example.com/page2
www.example.com/page3
www.example.com/page4
www.example.com/page4
[...]

我想知道是否有一种方法可以实现一个计数器，它可以告诉我代码之前打开了多少次唯一 URL。我想得到一个计数器，它将返回列表中每个 URL 以粗体显示的内容。

www.example.com/page1：0 _
www.example.com/page1 : 1
www.example.com/page2：0 _
www.example.com/page2：1 _
www.example.com/page2：2 _
www.example.com/page3：0 _
www.example.com/page4：0 _
www.example.com/page4：1 _

谢谢！

score 0 · Accepted Answer

为简单起见ioStringIO：

import io
fin = io.StringIO("""www.example.com/page1
www.example.com/page1
www.example.com/page2
www.example.com/page2
www.example.com/page2
www.example.com/page3
www.example.com/page4
www.example.com/page4""")

我们用collections.Counter

from collections import Counter
data = [line.strip() for line in f]
counts = Counter(data)
new_data = []
for line in data[::-1]:
    counts[line] -= 1
    new_data.append((line, counts[line]))
for line in new_data[::-1]:
    fout.write('{} {:d}\n'.format(*line))

这是结果：

fout.seek(0)
print(fout.read())

www.example.com/page1 0
www.example.com/page1 1
www.example.com/page2 0
www.example.com/page2 1
www.example.com/page2 2
www.example.com/page3 0
www.example.com/page4 0
www.example.com/page4 1

编辑

适用于大文件的较短版本，因为它一次只需要一行：

from collections import defaultdict
counts = defaultdict(int)

for raw_line in fin:
    line = raw_line.strip() 
    fout.write('{} {:d}\n'.format(line, counts[line]))
    counts[line] += 1

score 0 · Accepted Answer

使用collections.defaultdict()对象：

from collections import defaultdict

urls = defaultdict(int)

for url in url_source:
    print '{}: {}'.format(url, urls[url])

    # process

    urls[url] += 1

score -2 · Accepted Answer

-2

我认为你不能那样做。删除列表中的重复项。

于 2013-06-08T01:13:22.200 回答

python - 如何计算在 python 中打开唯一 URL 的次数？

3 回答 3

Related

Reference