python - 如何从读取的csv中重新计算每个班级的人数

Question

我有一个 CSV，其中第 6 列代表该班级学生人数的计数。我还有一段单独的代码，如果他们出现在不同的脚本上，它会从班级中删除一些学生，我将如何重新计算每个班级的学生人数。请参阅下面的示例数据：

Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....

标识哪些行被删除的列在“更多数据”中结束，但是在删除任何一行后，我如何编写代码来计算该班级剩下的学生人数，基本上计算第 2 列并替换第 6 列中的值。（这些类名都是唯一的）

我希望这是有道理的。任何帮助将不胜感激！亲切的问候 AEA

编辑将上述数据保存为 AEAtest.csv

我尝试运行以下代码：

import csv
import itertools
from operator import itemgetter
import random

def some_condition(line):
    return random.random() < 0.5 # delete lines randomly with 50% probability

def filter_data(data):
    for classname, group in itertools.groupby(data, itemgetter(2)):
        filtered_group = [line for line in group if some_condition(line)]
        new_sum = len(filtered_group)
        for line in filtered_group:
            line[5] = new_sum
            yield line

with open('C:\AEAtest.csv') as f_in, open('C:\AEAtest_MOD.csv', 'w') as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)
    writer.writerows(filter_data(reader))

输出如下：

Jan-20,Data,Class xpv,4,11yo+,2,more data....

Jan-20,Data,Class xpv,4,11yo+,2,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-30,Data,Class tn2,4,10yo+,7,more data....

Jan-50,Data,Class 22zn,2,10yo+,3,more data....

Jan-50,Data,Class 22zn,2,10yo+,3,more data....

Jan-50,Data,Class 22zn,2,10yo+,3,more data....

我想知道额外的行现在是如何出现的，有趣的是，上面的最后一行文本是第 23 行，然后是另外两个空行。

有关修复此错误的任何帮助？亲切的问候 AEA

score 5 · Accepted Answer

我认为你可以itertools.groupby在你的 csv 数据上使用，按类名分组。然后，当您遍历每个组时，您可以更正计数，如果有任何行被删除。

from itertools import groupby
from operator import itemgetter

def filter_data(data):
    for classname, group in itertools.groupby(data, itemgetter(2)):
        filtered_group = [line for line in group if some_condition(line)]
        new_count = len(filtered_group)
        for line in filtered_group:
            line[5] = new_count
            yield line

some_condition在给定函数的情况下，您可以使用它来打印过滤后的数据：

import csv
import random

def some_condition(line):
    return random.random() < 0.5 # delete lines randomly with 50% probability

data = """Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-20,Data,Class xpv,4,11yo+,4,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-30,Data,Class tn2,4,10yo+,12,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....
Jan-50,Data,Class 22zn,2,10yo+,6,more data....""".splitlines()

for line in filter_data(csv.reader(data)):
    print(line)

可能您希望读取和写入实际文件，而不是解析字符串并打印修改后的结果。下面是一些（未经测试的）代码，展示了您可以如何做到这一点：

with open('myfile.csv', 'rb') as f_in, open('myfile_filtered.csv', 'wb') as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)
    writer.writerows(filter_data(reader))

请注意，在 Python 3 中，文件应该以文本模式而不是二进制模式打开，但您还需要传递额外的参数newline=""以让csv模块自己处理行尾。

python - 如何从读取的csv中重新计算每个班级的人数

1 回答 1

Related

Reference