python - 将 2 列类似计数器的 csv 文件转换为 Python 集合。计数器？

Question

我有一个逗号分隔 ( ,) 制表符分隔 ( \t) 的文件。

68,"phrase"\t
485,"another phrase"\t
43, "phrase 3"\t

有没有一种简单的方法可以将它放入 Python 中Counter？

score 1 · Accepted Answer

我不能放手，偶然发现了我认为是赢家的东西。

在测试中，很明显循环遍历行csv.DictReader是最慢的部分。大约需要 40 秒中的 30 秒。

我把它改成简单csv.reader的，看看我会得到什么。这导致了列表行。我将它包裹在 adict中，看看它是否直接转换。它做了！

然后我可以循环浏览本机字典而不是csv.DictReader.

结果......在 3 秒内完成了 400 万行！

def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.reader(f, delimiter="\t")
        d = dict(csv_reader)
        the_counter = Counter({phrase: int(float(count)) for count, phrase in d.items()})

    return the_counter

score 1 · Accepted Answer

您可以使用字典理解，被认为更pythonic，它可以稍微快一点：

import csv
from collections import Counter


def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        the_counter = Counter({row["title"]: int(float(row["count"])) for row in csv_reader})
    return the_counter

score 0 · Accepted Answer

这是我最好的尝试。它有效，但不是最快的。
~~在 400 万行输入文件上运行大约需要 1.5 分钟。~~
根据 Daniel Mesejo 的建议，现在在 400 万行输入文件上需要大约 40 秒。

_{注意：countcsv中的值可以是科学计数法，需要转换。因此int(float(铸造。}

import csv
from collections import Counter

def convert_counter_like_csv_to_counter(file_to_convert):

    the_counter = Counter()
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        for row in csv_reader:
            the_counter[row["title"]] = int(float(row["count"]))

    return the_counter

python - 将 2 列类似计数器的 csv 文件转换为 Python 集合。计数器？

3 回答 3

Related

Reference