python - 基于特定列的平均数据 - python

Question

我有一个包含多行和 8 列的数据文件 - 我想平均在第 1、2、5 列上具有相同数据的行的第 8 列 - 例如，我的文件可能如下所示：

564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619

我想平均第一行和第三行的最后一列，因为第 1-2-5 列是相同的；

我希望输出看起来像这样：

564645  7371810 0   21642   1530    1   2   25.0813
564645  7371810 0   21642   8250    1   2   0.0103

我的文件（文本文件）非常大（~10000 行）并且冗余数据（基于上述规则）不是定期间隔 - 所以我希望代码找到冗余数据，并对它们进行平均......

回应 larsks 的评论 - 这是我的 4 行代码......

import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)

##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]

score 0 · Accepted Answer

您可以使用 pandas 快速执行此操作：

import pandas as pd
from StringIO import StringIO
data = StringIO("""564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619
""")
df = pd.read_csv(data, sep="\\s+", header=None)
df.groupby(["X.1","X.2","X.5"])["X.8"].mean()

输出是：

X.1     X.2      X.5 
564645  7371810  1530    25.0813
                 8250     0.0103
Name: X.8

如果不需要索引，可以调用：

df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()

这将给出结果：

      X.1      X.2   X.5      X.8
0  564645  7371810  1530  25.0813
1  564645  7371810  8250   0.0103

score 0 · Accepted Answer

好的，根据 Hury 的输入，我更新了代码 -

import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory) 
os.chdir( working)

 ##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset) 

df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)

这适用于 hury 发布的测试数据 - 但是当我在 df = ... 之后使用我的文件时似乎不起作用（我得到如下输出：

Traceback（最近一次调用最后）：文件“/media/DATA/arxeia/Programming/MyPys/data_refine_average.py”，第 31 行，在 df = pd.read_csv(data, sep="\s+", header=None) 文件中“/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”，第 187 行，在 read_csv 返回 _read(TextParser, filepath_or_buffer, kwds) 文件“/usr/lib64/python2.7/site- packages/pandas/io/parsers.py”，第 141 行，在 _read f = com._get_handle(filepath_or_buffer, 'r', encoding=encoding) 文件“/usr/lib64/python2.7/site-packages/pandas/core /common.py"，第 673 行，在 _get_handle f = open(path, mode) IOError: [Errno 36] File name too long: '564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r \n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0。0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216......... .

有任何想法吗？

score 0 · Accepted Answer

这不是最优雅的答案，我不知道它有多快/多高效，但我相信它可以根据您提供的信息完成工作：

import numpy

data_file = "full_location_of_data_file"
data_dict = {}
for line in open(data_file):
    line = line.rstrip()
    columns = line.split()
    entry = [columns[0], columns[1], columns[4]]
    entry = "-".join(entry)
    try: #valid if have already seen combination of 1,2,5
        x = data_dict[entry].append(float(columns[7]))
    except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
        data_dict[entry] = [float(columns[7])]

for entry in data_dict:
    value = numpy.mean(data_dict[entry])   
    output = entry.split("-")
    output.append(str(value))
    output = "\t".join(output)
    print output

我不清楚您是否想要/需要第 3、6 或 7 列，所以我省略了它们。特别是，您没有明确说明要如何处理其中可能存在的不同值。如果您可以详细说明您想要什么行为（即默认为某个值或第一次出现），我建议您填写默认值或将第一个实例存储在字典字典而不是列表字典中。

score 0 · Accepted Answer

import os #needed system utils
import numpy as np# for array data processing


datadirectory = '/media/DATA/arxeia/Dimitris/Testing/12_11'
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)

##这里我试图读取文件，然后在下一行中使用字符串的名称 - 这导致了下面描述的相同错误（错误＃42（我认为） - 名称太大）

data_dict = {} #Create empty dictionary
for line in open('/media/DATA/arxeia/Dimitris/Testing/12_11/1a.dat'): ##above error resolved when used this
    line = line.rstrip()
    columns = line.split()
    entry = [columns[0], columns[1], columns[4]]
    entry = "-".join(entry)
    try: #valid if have already seen combination of 1,2,5
        x = data_dict[entry].append(float(columns[7])) 
    except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
        data_dict[entry] = [float(columns[7])]

for entry in data_dict:
    value = np.mean(data_dict[entry])   
    output = entry.split("-")
    output.append(str(value))
    output = "\t".join(output)
   print output

我现在的另一个问题是以字符串格式（或任何格式）输出 - 然后我相信我知道我可以保存部分并操纵最终格式

np.savetxt('sorted_data.dat', sorted, fmt='%s', delimiter='\t') #Save the data

python - 基于特定列的平均数据 - python

4 回答 4

我现在的另一个问题是以字符串格式（或任何格式）输出 - 然后我相信我知道我可以保存部分并操纵最终格式

我仍然必须弄清楚如何添加其他列 - 我也在努力

python - 基于特定列的平均数据 - python

4 回答 4

我现在的另一个问题是以字符串格式（或任何格式）输出 - 然后我相信我知道我可以保存部分并操纵最终格式

我仍然必须弄清楚如何添加其他列 - 我也在努力

Related

Reference