python - 从python中数组的第三列中选择唯一的随机值

Question

我有一个 41000x3 numpy 数组，我在下面的函数中将其称为“sortedlist”。第三列有一堆值，其中一些是重复的，另一些不是。我想从第三列（即 sortedlist[:,2]）中抽取一个唯一值（无重复值）样本。我想我可以用 numpy.random.sample(sortedlist[:,2], sample_size) 轻松做到这一点。问题是我想返回的不仅是那些值，还有所有三列，在最后一列中，有我从 numpy.random.sample 获得的随机选择的值。

编辑：通过唯一值我的意思是我想选择只出现一次的随机值。所以如果我有一个数组：

array = [[0, 6, 2]
         [5, 3, 9]
         [3, 7, 1]
         [5, 3, 2]
         [3, 1, 1]
         [5, 2, 8]]

我想选择第三列的 4 个值，我想得到类似 new_array_1 的东西：

new_array_1 = [[5, 3, 9]
               [3, 7, 1]
               [5, 3, 2]
               [5, 2, 8]]

但我不想要像 new_array_2 这样的东西，其中第三列中的两个值是相同的：

new_array_2 = [[5, 3, 9]
               [3, 7, 1]
               [5, 3, 2]
               [3, 1, 1]]

我有选择随机值的代码，但没有标准，它们不应该在第三列中重复。

samplesize = 100

rand_sortedlist = sortedlist[np.random.randint(len(sortedlist), size = sample_size),:]]

我试图通过做这样的事情来执行这个标准

    array_index = where( array[:,2] == sample(SelectionWeight, sample_size) )

但我不确定我是否走在正确的轨道上。任何帮助将不胜感激！

score 0 · Accepted Answer

我相信这会做你想要的。请注意，运行时间几乎肯定会受到您用来生成随机数的任何方法的支配。（一个例外是如果数据集很大，但您只需要少量行，在这种情况下需要绘制的随机数很少。）所以我不确定这会比纯 python 方法运行得快得多。

# arrayify your list of lists
# please don't use `array` as a variable name!
a = np.asarray(arry)

# sort the list ... always the first step for efficiency
a2 = a[np.argsort(a[:, 2])]

# identify rows that are duplicates (3rd column is non-increasing)
# Note this has length one less than a2
duplicate_rows = np.diff(a2[:, 2]) == 0)

# if duplicate_rows[N], then we want to remove row N and N+1
keep_mask = np.ones(length(a2), dtype=np.bool) # all True
keep_mask[duplicate_rows] = 0 # remove row N
keep_mask[1:][duplicate_rows] = 0 # remove row N + 1

# now actually slice the array
a3 = a2[keep_mask]

# select rows from a3 using your preferred random number generator
# I actually prefer `random` over numpy.random for sampling w/o replacement
import random
result = a3[random.sample(xrange(len(a3)), DESIRED_NUMBER_OF_ROWS)]

score 0 · Accepted Answer

我想不出一个聪明的 numpythonic 方法来做到这一点，它不涉及对数据的多次传递。（有时 numpy 比纯 Python 快得多，这仍然是最快的方法，但感觉永远不对。）

在纯 Python 中，我会做类似的事情

def draw_unique(vec, n):
    # group indices by value
    d = {}
    for i, x in enumerate(vec):
        d.setdefault(x, []).append(i)

    drawn = [random.choice(d[k]) for k in random.sample(d, n)]        
    return drawn

这会给

>>> a = np.random.randint(0, 10, (41000, 3))
>>> drawn = draw_unique(a[:,2], 3)
>>> drawn
[4219, 6745, 25670]
>>> a[drawn]
array([[5, 6, 0],
       [8, 8, 1],
       [5, 8, 3]])

我可以想到一些技巧，np.bincount但scipy.stats.rankdata它们伤害了我的头，最后总是一步，我看不到如何矢量化..如果我没有矢量化整个东西，我不妨使用以上至少是简单的。

python - 从python中数组的第三列中选择唯一的随机值

2 回答 2

Related

Reference