9

这个操作需要尽可能快地应用到包含数百万个元素的实际数组中。这是问题的简单版本。

所以,我有一个随机的唯一整数数组(通常是数百万个元素)。

totalIDs = [5,4,3,1,2,9,7,6,8 ...]

我有另一个可以创建掩码的唯一整数数组(通常是数万个)。

subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...

我可以用 numpy 来做

掩码 = np.in1d(totalIDs,subsampleIDs,assume_unique=True)

然后我可以使用掩码提取我想要的另一个数组的信息(比如第 0 列包含我想要的那个)。

变量=所有变量[掩码][:,0]

现在鉴于 ID 在两个数组中都是唯一的,有什么方法可以显着加快速度。为几千个点 (subsampleIDs) 与数百万个 ID (totalID) 匹配构建掩码需要很长时间。

我想通过它一次并写出索引的二进制文件(以加快未来的搜索速度)。

for i in range(0,3):
    mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
    index[mask] = i

其中 X 在 subsampleIDsX 中。然后我可以这样做:

for i in range(0,3):
    if index[i] == i:
        rowmatch = i
        break

variable = allvariables[rowmatch:len(subsampleIDs),0]

对?但这也很慢,因为循环中有一个条件来查找第一次匹配的时间。是否有更快的方法来查找数字何时首次出现在有序数组中,这样条件不会减慢循环速度?

4

2 回答 2

3

I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].

Create some test data first:

import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))

subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)

Then convert you data in to a DataFrame:

import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs) 
df.ix[subsampleIDs]

DataFrame use a hashtable to map the index to it's location, it's very fast.

于 2013-03-08T02:36:16.073 回答
1

通常这种索引最好使用数据库(使用适当的列索引)执行。

Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.

EDIT:

Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).

Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

于 2013-03-07T19:36:45.603 回答