我正在尝试优化一些代码以更快地运行。对于此文本文件矩阵:
TAG, DESC, ID1, ID2, ID3, ID4,
1, "details", 0, 1, NA, 1, 2
2, "details", 2, 1, NA, 0, 1
3, "details", 1, NA, NA, 0, 2
...
这是一个包含约 10,000 列和约 2M 行的大文件。我想做的是计算所有 ID 的总和(TAG=1 时为 4)和给定最大值为 2 的频率(因此 4/8 = 0.5),然后将这些值附加为新列。NA 缺少数据并且实际上为零。此代码有效,但速度很慢:
tab_dict =csv.DictReader(open(path), delimiter=",")
tab_reader = [row for row in tab_dict]
for t in tab_reader:
idlist = [i for i in t.keys()]
idlist.remove('TAG') #exclude columns that do not contain numbers for summing
idlist.remove('DESC')
rowsum = 0
for i in idlist:
try: rowsum+= int(t[i]) #try/except to handle "NA"s
except: TypeError
t["ROWSUM"] = rowsum # create the new columns
t["ROWFREQ"] = float(rowsum)/ float(2*len(idlist))
关于如何加快速度的任何建议?谢谢