python - 表示要拆分的训练集的最佳方法

Question

训练集由一组样本和一组标签组成，每个样本都有一个标签。在我的例子中，样本是向量，而标签是标量。为了解决这个问题，我使用 Numpy。考虑这个例子：

samples = np.array([[1,0],[0.2,0.5], [0.3,0.8]])
labels = np.array([1,0,0])

现在我必须将训练集分成两个分区，对元素进行洗牌。这个事实引发了一个问题：我失去了与标签的对应关系。我该如何解决这个问题？

由于性能在我的项目中至关重要，因此我不想构建置换向量，因此我正在寻找一种将标签与样本绑定的方法。现在我的解决方案是使用样本数组的最后一列作为标签，例如：

samples_and_labels = np.array([[1,0,0],[0.2,0.5,0], [0.3,0.8,1]])

这是我的情况下最快的解决方案吗？或者有没有更好的？例如创建对？

score 1 · Accepted Answer

索引与浮点数据类型的混合让我感到不安。当您说拆分训练集时，这是完全随机的吗？如果是这样，我会使用随机排列向量 - 我认为您的解决方案不会更快（即使没有我的数据类型保留），因为您在创建 samples_and_labels 数组时仍在分配内存。

你可以做类似的事情（假设len(samples)是为了简单的说明）：

# set n to len(samples)/2
ind = np.hstack((np.ones(n, dtype=np.bool), np.zeros(n, dtype=np.bool)))
# modifies in-place, no memory allocation
np.random.shuffle(ind)

然后你可以做

samples_left, samples_right = samples[ind], samples[ind == False]
labels_left, labels_right = labels[ind], labels[ind == False]

并打电话

np.random.shuffle(ind)

每当您需要新的拆分时

score 0 · Accepted Answer

没有 numpy，也许它不会那么快。您可以尝试导入“_random”而不是“random”以获得更好的洗牌性能。

import random

samples = [[1,0],[0.2,0.5], [0.3,0.8]]
labels = [1,0,0]

print(samples, '\n', labels)

z = list(zip(samples, labels))
random.shuffle(z)

samples, labels = zip(*z)

print(samples, '\n', labels)

python - 表示要拆分的训练集的最佳方法

2 回答 2

Related

Reference