numpy - 从 txt 文件计算平均值、标准差的有效方法

Question

这是许多 txt 文件中的一个的副本。

Class 1:
Subject A:
posX posY posZ  x(%)  y(%)
  0   2    0    81    72
  0   2   180   63    38
 -1  -2    0    79    84
 -1  -2   180   85    95
  .   .    .    .     .
Subject B:
posX posY posZ  x(%)   y(%)
  0   2     0    71     73
 -1  -2     0    69     88   
  .   .     .    .      .
Subject C:
posX  posY posZ x(%)   y(%)
  0    2    0    86     71
 -1   -2    0    81     55
  .    .    .     .     .
Class 2:
Subject A:
posX posY posZ  x(%)  y(%)
  0   2    0    81    72
 -1  -2    0    79    84
  .   .    .    .     .

班级、科目、行条目的数量都各不相同。
Class1-Subject A 总是有 posZ 条目，其中 0 与 180 交替
按班级和科目计算 x(%)、y(%) 的平均值
按类别和主题计算 x(%)、y(%) 的标准偏差
在计算平均值和 std_deviations 时也忽略 180 行的 posZ

我在 excel 中开发了一个笨拙的解决方案（使用宏和 VBA），但我宁愿在 python 中寻求更优化的解决方案。

numpy 非常有用，但 .mean()、.std() 函数仅适用于数组——我仍在研究它以及 panda 的 groupby 函数。

我希望最终输出如下所示（1. 按类别，2. 按主题）

 1. By Class                 
             X     Y                      
 Average                        
 std_dev     

 2. By Subject  
             X     Y
 Average
 std_dev

score 1 · Accepted Answer

我认为使用字典（和字典列表）是熟悉在 python 中处理数据的好方法。要像这样格式化数据，您需要读取文本文件并逐行定义变量。

开始：

for line in infile:
    if line.startswith("Class"):
        temp,class_var = line.split(' ')
        class_var = class_var.replace(':','')   
    elif line.startswith("Subject"):
        temp,subject = line.split(' ')
        subject = subject.replace(':','')

这将创建对应于当前班级和当前主题的变量。然后，你想读入你的数值变量。读取这些值的一个好方法是通过一个try语句，它会尝试将它们变成整数。

    else:
        line = line.split(" ")
        try:
            keys = ['posX','posY','posZ','x_perc','y_perc']
            values = [int(item) for item in line]
            entry = dict(zip(keys,values))
            entry['class'] = class_var
            entry['subject'] = subject
            outputList.append(entry)
        except ValueError:
            pass

这会将它们放入字典形式，包括先前定义的类和主题变量，并将它们附加到 outputList。你最终会得到这个：

[{'posX': 0, 'x_perc': 81, 'posZ': 0, 'y_perc': 72, 'posY': 2, 'class': '1', 'subject': 'A'},
{'posX': 0, 'x_perc': 63, 'posZ': 180, 'y_perc': 38, 'posY': 2, 'class': '1', 'subject': 'A'}, ...]

等等

然后，您可以通过子集字典列表（应用排除 posZ=180 等规则）来平均/获取 SD。这是按班级平均的：

classes = ['1','2']
print "By Class:"
print "Class","Avg X","Avg Y","X SD","Y SD"
for class_var in classes:   

    x_m = np.mean([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
    y_m = np.mean([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
    x_sd = np.std([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
    y_sd = np.std([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])

    print class_var,x_m,y_m,x_sd,y_sd

您必须尝试打印输出才能获得您想要的内容，但这应该可以帮助您入门。

numpy - 从 txt 文件计算平均值、标准差的有效方法

1 回答 1

Related

Reference