0

蟒蛇大师,

过去,我一直使用 Perl 来处理非常大的文本文件以进行数据挖掘。最近我决定切换,因为我相信 Python 让我更容易浏览我的代码并弄清楚发生了什么。与 Perl 相比,Python 的不幸(或者可能是幸运?)的事情是存储和组织数据极其困难,因为我无法通过自动激活创建散列的散列。我也无法总结字典词典的元素。

也许我的问题有一个优雅的解决方案。

我有数百个包含数百行数据的文件(所有文件都可以放入内存)。目标是结合这两个文件,但有一定的标准:

  1. 对于每个级别(仅在下面显示一个级别),我需要为在所有文件中找到的每个缺陷类创建一行。并非所有文件都具有相同的缺陷。

  2. 对于每个级别和缺陷类别,总结在所有文件中找到的所有 GEC 和 BEC 值。

  3. 最终输出应如下所示(更新的示例输出,错字):

级别缺陷Class BECtotals GECtotals
1415PA, 0, 643, 1991
1415PA, 1, 1994, 6470
...等等.....

档案一:

Level,  defectClass,    BEC,    GEC
1415PA,      0,         262,    663
1415PA,      1,         1138,   4104
1415PA,    107,     2,  0
1415PA,     14,         3,  4
1415PA,     15,         1,  0
1415PA,      2,         446,    382
1415PA,     21,         5,  0
1415PA,     23,         10, 5
1415PA,      4,         3,  16
1415PA,      6,        52,  105

文件二:

level,  defectClass,   BEC, GEC
1415PA, 0,     381, 1328
1415PA, 1,     856, 2366
1415PA, 107,       7,   11
1415PA, 14,    4,   1
1415PA, 2,     315, 202
1415PA, 23,    4,   7
1415PA, 4,     0,   2
1415PA, 6,     46,  42
1415PA, 7,     1,   7

我最大的问题是能够对字典进行总结。这是我到目前为止的代码(不工作):

import os
import sys


class AutoVivification(dict):
    """Implementation of perl's autovivification feature. Has features from both dicts and lists,
    dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
    """
    def __getitem__(self, item):
    if isinstance(item, slice):
        d = AutoVivification()
        items = sorted(self.iteritems(), reverse=True)
        k,v = items.pop(0)
        while 1:
        if (item.start < k < item.stop):
            d[k] = v
        elif k > item.stop:
            break
        if item.step:
            for x in range(item.step):
            k,v = items.pop(0)
        else:
            k,v = items.pop(0)
        return d
    try:
        return dict.__getitem__(self, item)
    except KeyError:
        value = self[item] = type(self)()
        return value

    def __add__(self, other):
    """If attempting addition, use our length as the 'value'."""
    return len(self) + other

    def __radd__(self, other):
    """If the other type does not support addition with us, this addition method will be tried."""
    return len(self) + other

    def append(self, item):
    """Add the item to the dict, giving it a higher integer key than any currently in use."""
    largestKey = sorted(self.keys())[-1]
    if isinstance(largestKey, str):
        self.__setitem__(0, item)
    elif isinstance(largestKey, int):
        self.__setitem__(largestKey+1, item)

    def count(self, item):
    """Count the number of keys with the specified item."""
    return sum([1 for x in self.items() if x == item])

    def __eq__(self, other):
    """od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
    while comparison to a regular mapping is order-insensitive. """
    if isinstance(other, AutoVivification):
        return len(self)==len(other) and self.items() == other.items()
    return dict.__eq__(self, other)

    def __ne__(self, other):
    """od.__ne__(y) <==> od!=y"""
    return not self == other

for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
    if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
    continue
    path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename

    for filename2 in os.listdir(path):
    if filename2[0] == '.':
        continue
    path2 = path + "/" + filename2
    techData = AutoVivification()

    for file in os.listdir(path2):
        if file[0:13] == 'SummaryRearr_':
        dataFile = path2 + '/' + file
        print('Location of file to read: ', dataFile, '\n')
        fh = open(dataFile, 'r')

        for lines in fh:
            if lines[0:5] == 'level':
            continue
            lines = lines.strip()
            elements = lines.split(',')

            if techData[elements[0]][elements[1]]['BEC']:
            techData[elements[0]][elements[1]]['BEC'].append(elements[2])
            else:
            techData[elements[0]][elements[1]]['BEC'] = elements[2]

            if techData[elements[0]][elements[1]]['GEC']:
            techData[elements[0]][elements[1]]['GEC'].append(elements[3])
            else:
            techData[elements[0]][elements[1]]['GEC'] = elements[3]


            print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])

    techSumPath = path + '/Summary_' + filename + '.csv'
    fh2 = open(techSumPath, 'w')
    for key1 in sorted(techData):
    for key2 in sorted(techData[key1]):
        BECtotal = sum(map(int, techData[key1][key2]['BEC']))
        GECtotal = sum(map(int, techData[key1][key2]['GEC']))
        fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
    print('Created file at:', techSumPath)
    input('Go check the file!!!!')

谢谢你看这个!!!!!
亚历克斯

4

1 回答 1

3

我将建议一种不同的方法:如果您正在处理表格数据,您应该查看pandas库。你的代码变成了

import pandas as pd

filenames = "fileone.txt", "filetwo.txt"  # or whatever

dfs = []
for filename in filenames:
    df = pd.read_csv(filename, skipinitialspace=True)
    df = df.rename(columns={"level": "Level"})
    dfs.append(df)

df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)

产生

dsm@notebook:~/coding/pand$ cat combined.csv 
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11

在这里,我将每个文件同时读入内存并将它们组合成一个大文件DataFrame(如 Excel 工作表),但我们可以轻松地groupby逐个文件完成操作,因此我们只需要在内存中保存一个文件如果我们喜欢的话。

于 2014-02-10T05:34:33.397 回答