python - 在 Python 中搜索二维数组

Question

我希望能够通过 Python 在给定两个或多个参数的情况下检索大型数据集（9M 行，1.4 GB）中的特定行。

例如，从此数据集中：

ID1 2   10  2   2   1   2   2   2   2   2   1

ID2 10  12  2   2   2   2   2   2   2   1   2

ID3 2   22  0   1   0   0   0   0   0   1   2

ID4 14  45  0   0   0   0   1   0   0   1   1

ID5 2   8   1   1   1   1   1   1   1   1   2

给定示例参数：

第二列必须等于 2，并且
第三列必须在 4 到 15 的范围内

我应该获得：

ID1 2   10  2   2   1   2   2   2   2   2   1

ID5 2   8   1   1   1   1   1   1   1   1   2

问题是我不知道如何在 Python 中的二维数组上有效地执行这些操作。

这是我尝试过的：

line_list = []

# Loading of the whole file in memory
for line in file:
    line_list.append(line)

# set conditions
i = 2
start_range = 4
end_range = 15

# Iteration through the loaded list and split for each column
for index in data_list:
    data = index.strip().split()
    # now test if the current line matches with conditions
    if(data[1] == i and data[2] >= start_range and data[2] <= end_range):
        print str(data)

我想多次执行这个过程，我这样做的方式真的很慢，即使数据文件加载到内存中也是如此。

我正在考虑使用 numpy 数组，但我不知道如何在给定条件下检索行。

谢谢你的帮助！

更新：

正如建议的那样，我使用了关系数据库系统。我选择了 Sqlite3，因为它非常易于使用且部署快速。

我的文件在大约 4 分钟内通过 sqlite3 中的导入功能加载。

我在第二列和第三列做了索引，以加快检索信息的过程。

查询是通过 Python 完成的，模块为“sqlite3”。

就是这样，更快！

score 1 · Accepted Answer

我会选择几乎你所拥有的（未经测试）：

with open('somefile') as fin:
    rows = (line.split() for line in fin)
    take = (row for row in rows if int(row[1] == 2) and 4 <= int(row[2]) <= 15)
    # data = list(take)
    for row in take:
        pass # do something

python - 在 Python 中搜索二维数组

更新 ：

1 回答 1

Related

Reference

更新：