蟒蛇大师,
过去,我一直使用 Perl 来处理非常大的文本文件以进行数据挖掘。最近我决定切换,因为我相信 Python 让我更容易浏览我的代码并弄清楚发生了什么。与 Perl 相比,Python 的不幸(或者可能是幸运?)的事情是存储和组织数据极其困难,因为我无法通过自动激活创建散列的散列。我也无法总结字典词典的元素。
也许我的问题有一个优雅的解决方案。
我有数百个包含数百行数据的文件(所有文件都可以放入内存)。目标是结合这两个文件,但有一定的标准:
对于每个级别(仅在下面显示一个级别),我需要为在所有文件中找到的每个缺陷类创建一行。并非所有文件都具有相同的缺陷。
对于每个级别和缺陷类别,总结在所有文件中找到的所有 GEC 和 BEC 值。
最终输出应如下所示(更新的示例输出,错字):
级别缺陷Class BECtotals GECtotals
1415PA, 0, 643, 1991
1415PA, 1, 1994, 6470
...等等.....
档案一:
Level, defectClass, BEC, GEC
1415PA, 0, 262, 663
1415PA, 1, 1138, 4104
1415PA, 107, 2, 0
1415PA, 14, 3, 4
1415PA, 15, 1, 0
1415PA, 2, 446, 382
1415PA, 21, 5, 0
1415PA, 23, 10, 5
1415PA, 4, 3, 16
1415PA, 6, 52, 105
文件二:
level, defectClass, BEC, GEC
1415PA, 0, 381, 1328
1415PA, 1, 856, 2366
1415PA, 107, 7, 11
1415PA, 14, 4, 1
1415PA, 2, 315, 202
1415PA, 23, 4, 7
1415PA, 4, 0, 2
1415PA, 6, 46, 42
1415PA, 7, 1, 7
我最大的问题是能够对字典进行总结。这是我到目前为止的代码(不工作):
import os
import sys
class AutoVivification(dict):
"""Implementation of perl's autovivification feature. Has features from both dicts and lists,
dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
"""
def __getitem__(self, item):
if isinstance(item, slice):
d = AutoVivification()
items = sorted(self.iteritems(), reverse=True)
k,v = items.pop(0)
while 1:
if (item.start < k < item.stop):
d[k] = v
elif k > item.stop:
break
if item.step:
for x in range(item.step):
k,v = items.pop(0)
else:
k,v = items.pop(0)
return d
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
def __add__(self, other):
"""If attempting addition, use our length as the 'value'."""
return len(self) + other
def __radd__(self, other):
"""If the other type does not support addition with us, this addition method will be tried."""
return len(self) + other
def append(self, item):
"""Add the item to the dict, giving it a higher integer key than any currently in use."""
largestKey = sorted(self.keys())[-1]
if isinstance(largestKey, str):
self.__setitem__(0, item)
elif isinstance(largestKey, int):
self.__setitem__(largestKey+1, item)
def count(self, item):
"""Count the number of keys with the specified item."""
return sum([1 for x in self.items() if x == item])
def __eq__(self, other):
"""od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
while comparison to a regular mapping is order-insensitive. """
if isinstance(other, AutoVivification):
return len(self)==len(other) and self.items() == other.items()
return dict.__eq__(self, other)
def __ne__(self, other):
"""od.__ne__(y) <==> od!=y"""
return not self == other
for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
continue
path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename
for filename2 in os.listdir(path):
if filename2[0] == '.':
continue
path2 = path + "/" + filename2
techData = AutoVivification()
for file in os.listdir(path2):
if file[0:13] == 'SummaryRearr_':
dataFile = path2 + '/' + file
print('Location of file to read: ', dataFile, '\n')
fh = open(dataFile, 'r')
for lines in fh:
if lines[0:5] == 'level':
continue
lines = lines.strip()
elements = lines.split(',')
if techData[elements[0]][elements[1]]['BEC']:
techData[elements[0]][elements[1]]['BEC'].append(elements[2])
else:
techData[elements[0]][elements[1]]['BEC'] = elements[2]
if techData[elements[0]][elements[1]]['GEC']:
techData[elements[0]][elements[1]]['GEC'].append(elements[3])
else:
techData[elements[0]][elements[1]]['GEC'] = elements[3]
print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])
techSumPath = path + '/Summary_' + filename + '.csv'
fh2 = open(techSumPath, 'w')
for key1 in sorted(techData):
for key2 in sorted(techData[key1]):
BECtotal = sum(map(int, techData[key1][key2]['BEC']))
GECtotal = sum(map(int, techData[key1][key2]['GEC']))
fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
print('Created file at:', techSumPath)
input('Go check the file!!!!')
谢谢你看这个!!!!!
亚历克斯