python - 什么是对存储在大型文本文件中的数据进行过滤、排序和分组的好方法

Question

在蒙特卡罗模拟中，我将每次运行的摘要存储在数据文件中，其中每一列包含一个参数或一个结果值。所以我最终得到了一个大型数据文件，其中存储了多达 40 列数据，其中许多行与其他行没有任何关系。

比如说，这个文件看起来像这样：

#param1    param2    result1    result2
1.0        1.0       3.14       6.28
1.0        2.0       6.28       12.56
...
2.0        1.0       1.14       2.28
2.0        2.0       2.28       4.56

由于我经常想研究其中一个结果对其中一个参数的依赖性，因此我都需要按第二个参数分组并按第一个参数排序。另外，我可能想根据任何参数过滤掉行。

我开始为此编写自己的课程，但这似乎比人们想象的要难。现在我的问题是：是否有任何图书馆已经这样做了？或者，由于我熟悉 SQL，是否很难为 SQLAlchemy 编写一个 SQL 后端，让我可以对我的数据进行简单的 SQL 查询？据我所知，这将提供我需要的一切。

根据 cravoori 的回答（或者至少是他/她发布的链接中的那个），这里有一个很好的简短解决方案来解决我的问题：

#!/usr/bin/python2

import numpy as np
import sqlite3 as sql

# number of columns to read in
COLUMNS = 31

# read the file. My columns are always 18chars long. the first line are the names
data = np.genfromtxt('compare.dat',dtype=None,delimiter=18, autostrip=True,
                     names=True, usecols=range(COLUMNS), comments=None)

# connect to the database in memory
con = sql.connect(":memory:")

# create the table 'data' according to the column names
con.execute("create table data({0})".format(",".join(data.dtype.names)))

# insert the data into the table
con.executemany("insert into data values (%s)" % ",".join(['?']*COLUMNS),
                data.tolist())

# make some query and create a numpy array from the result
res = np.array(con.execute("select DOS_Exponent,Temperature,Mobility from data ORDER \
    BY DOS_Exponent,Temperature ASC").fetchall())

print res

score 2 · Accepted Answer

看到数据是分隔的，一种选择是通过 csv 模块将文件导入内存中的 SQLite 数据库，示例链接如下。Sqlite 支持大多数 SQL 子句

将数据导入 SQLite 数据库

score 0 · Accepted Answer

假设只需要简单的计算，代码内方法可能类似于以下几行：

file = open('list_filter_data.txt', mode='r')
lines = file.read().splitlines()
row_sets = [[float(c) for c in line.split()] for line in lines[1:]] # read and split the lines in the columns

# get only rows whose param1 = 1.0
subset = [row for row in row_sets if row[0] == 1.0]
print subset
# get only rows whose param1 = 2.0
subset = [row for row in row_sets if row[0] == 2.0]
print subset
# average result1 where param2 = 2.0
avg = sum([row[2] for row in row_sets if row[1] == 2.0]) / len([row[2] for row in row_sets if row[1] == 2.0])
print avg

score 0 · Accepted Answer

如果您的文件大小约为几 MB，您可以轻松地在内存中读取此文件并使用其他答案解决。

如果文件大小为几百 MB 或几 GB，则最好使用此处描述的延迟加载方法 -在 Python 中读取大文件的延迟方法？

如果您打算进行的计算可以按行完成，那么这些小块应该足以让您做任何您需要的事情。

否则，只需为您的参数和结果创建一个包含 C1、C2、..CN 列的 SQL 表，假设参数和结果之间都是一对一的关系。只需使用 python 数据库访问 apis 编写 SQL 语句并分析您需要的任何内容。

另一方面，Excel 电子表格也可能解决您的问题

python - 什么是对存储在大型文本文件中的数据进行过滤、排序和分组的好方法

3 回答 3

Related

Reference