python - 对矩阵 python 进行二次采样

Question

我有一个列出对的文本文件，例如

10,1
2,7
3,1
10,1

然后将其转换为对称矩阵，因此 (1,10) 条目是 (1,10) 对出现在列表中的次数。我现在想对这个矩阵进行二次抽样。我的意思是子样本 - 我想制作一个矩阵，该矩阵将是仅使用原始文本文件中随机 30% 行的结果。所以在这个例子中，如果我删除了 70% 的文本文件，(1,10) 对可能只出现一次而不是两次，因此矩阵中的 (1,10) 条目将是 1 而不是 2。

如果我实际上有原始文本文件，这可以很容易地完成，只需使用 random.sample 来挑选文件中 30% 的行。但是如果我只有矩阵，我怎么能随机抽取 70% 的数据呢？

score 1 · Accepted Answer

不幸的是，示例二和三没有根据原始文件中出现的行数观察到正确的分布。

您可以从矩阵中随机删除计数，而不是从原始数据中删除元组。所以你必须生成随机索引并减少相应的计数。请务必避免减少零计数，而是生成新索引。执行此操作，直到将计数元组的总数减少到 30%。基本上这可能看起来像这样：

amount_to_decrease = 0.7 * overall_amount

decreased = 0

while decreased < amount_to_decrease:
    x = random.randint(0, n)
    y = random.randint(0, n)
    if matrix[x][y] > 0:
        matrix[x][y]-=1
        decreased+=1
        if x != y:
            matrix[y][x]-=1

~~如果您的矩阵填充良好，这应该会很好。如果不是~~ 您可能想从矩阵中重新创建一个元组列表，然后从中选择一个随机子集。在此之后从剩余的元组重新创建您的矩阵：

tuples = []
for y in range(n):
    for x in range(y+1):
        for _ in range(matrix[x][y])
            tuples.append((x,y))
remaining = random.sample(tuples, int(overall_amount*0.7) )

~~或者，您可以进行组合，首先通过查找所有不为零的索引，然后对这些索引进行采样以减少计数：~~

valid_indices = []
for y in range(n):
    for x in range(y+1):
        valid_indices.append((x,y))

amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
    x,y = random.choice(valid_indices)
    matrix[x][y]-=1
    if x != y:
        matrix[y][x]-=1
    if matrix[y][x] == 0:
        valid_indices.remove((x,y))

还有另一种方法可以使用正确的可能性，但可能不会给你一个确切的减少。这个想法是设置保持行/计数的概率。如果您的目标是减少到 30%，这可能是 0.3。然后你可以检查矩阵并检查每个计数是否应该保留。

keep_chance = 0.3
for y in range(n):
    for x in range(y+1):
        for _ in range(matrix[x][y])
            if random.random() > keep_chance:
                matrix[x][y] -= 1
                if x != y:
                    matrix[y][x]-=1

score 1 · Accepted Answer

我想最好的方法取决于你的数据很大：

你有一个巨大的矩阵，其中大部分是少量的吗？或者
您是否有一个包含大量计数的中等大小的矩阵？

这是一个适用于第二种情况的解决方案，尽管它在第一种情况下也可以正常工作。

基本上，计数恰好在 2D 矩阵中这一事实并不那么重要：这基本上是从已分箱的总体中抽样的问题。所以我们可以做的是直接提取 bin，暂时忘记矩阵：

import numpy as np
import random

# Input counts matrix
mat = np.array([
    [5, 5, 2],
    [1, 1, 3],
    [6, 0, 4]
], dtype=np.int64)

# Build a list of (row,col) pairs, and a list of counts
keys, counts = zip(*[
    ((i,j), mat[i,j])
        for i in range(mat.shape[0])
        for j in range(mat.shape[1])
        if mat[i,j] > 0
])

然后使用累积的计数数组从这些箱中采样：

# Make the cumulative counts array
counts = np.array(counts, dtype=np.int64)
sum_counts = np.cumsum(counts)

# Decide how many counts to include in the sample
frac_select = 0.30
count_select = int(sum_counts[-1] * frac_select)

# Choose unique counts
ind_select = sorted(random.sample(xrange(sum_counts[-1]), count_select))

# A vector to hold the new counts
out_counts = np.zeros(counts.shape, dtype=np.int64)

# Perform basically the merge step of merge-sort, finding where
# the counts land in the cumulative array
i = 0
j = 0
while i<len(sum_counts) and j<len(ind_select):
    if ind_select[j] < sum_counts[i]:
        j += 1
        out_counts[i] += 1
    else:
        i += 1

# Rebuild the matrix using the `keys` list from before
out_mat = np.zeros(mat.shape, dtype=np.int64)
for i in range(len(out_counts)):
    out_mat[keys[i]] = out_counts[i]

现在您将在out_mat.

score 0 · Accepted Answer

假设一对 1,10 和 10,1 不同，那么 mat[1][10] 不一定与 mat[10][1] 相同（如果不是，请阅读以下行）

首先计算矩阵中所有值的总和。

让这个总和为S。这会计算文件中的行数。

设x和y为矩阵的维数。

现在循环从 0 到 [S 的 70%]的n ：

选择一个介于 1 和 x 之间的随机整数。让它成为j
在 1 和 y 之间选择一个随机整数。让它成为k
如果 mat[j][k] > 0，减小 mat[j][k] 并做 n++

由于您为文件中的每一行增加矩阵中的单个值，因此随机减少矩阵中的正值与抽取文件中的行相同。

如果 10,1 与 1,10 相同，则不需要矩阵的一半，因此您可以像这样更改算法：

循环n从 0 到 [S 的 70%]：

选择一个介于 1 和 x 之间的随机整数。设这是j
选择一个介于 1 和 k 之间的随机整数。让这成为k
如果 mat[j][k] > 0，减小 mat[j][k] 并做 n++

python - 对矩阵 python 进行二次采样

3 回答 3

Related

Reference