objective-c - 最优解：从数据集中随机抽取项目样本

Question

所以我最近有一个面试问题，我想知道最佳解决方案是什么。代码在 Objective-c 中。

假设我们有一个非常大的数据集，我们希望从中获取随机项目样本来测试新工具。与其担心访问的细节，让我们假设系统提供了这些东西：

// Return a random number from the set 0, 1, 2, ..., n-2, n-1.
int Rand(int n);

// Interface to implementations other people write.
@interface Dataset : NSObject

// YES when there is no more data.
- (BOOL)endOfData;

// Get the next element and move forward.
- (NSString*)getNext;

@end


// This function reads elements from |input| until the end, and
// returns an array of |k| randomly-selected elements.
- (NSArray*)getSamples:(unsigned)k from:(Dataset*)input
{
  // Describe how this works.
}

编辑：所以你应该从给定的数组中随机选择项目。因此，如果 k = 5，那么我想从数据集中随机选择 5 个元素并返回这些项目的数组。数据集中的每个元素都必须有同等的机会被选中。

score 1 · Accepted Answer

这似乎是使用Reservoir Sampling的好时机。以下是针对此用例的 Objective-C 改编版本：

NSMutableArray* result = [[NSMutableArray alloc] initWithCapacity:k];

int i,j;

for (i = 0; i < k; i++) {
    [result setObject:[input getNext] atIndexedSubscript:i];
}

for (i = k; ![input endOfData]; i++) {
    j = Rand(i);

    NSString* next = [input getNext];

    if (j < k) {
        [result setObject:next atIndexedSubscript:j];
    }
}

return result;

上面的代码不是最有效的水库采样算法，因为它为经过 index 处的条目的水库的每个条目生成一个随机数k。在一般类别“油藏采样”下存在稍微复杂的算法。这是关于名为“算法 Z”的算法的有趣读物。我会很好奇人们是否也能找到关于水库采样的更新文献，因为这篇文章发表于 1985 年。

score 0 · Accepted Answer

有多种方法可以做到这一点，第一种方法：

1. use input parameter k to dynamically allocate an array of numbers
    unsigned * numsArray = (unsigned *)malloc(sizeof(unsigned) * k);

2. run a loop that gets k random numbers and stores them into the numsArray (must be careful here to check each new random to see if we have gotten it before, and if we have, get another random, etc...)

3. sort numsArray

4. run a loop beginning at the beginning of DataSet with your own incrementing counter dataCount and another counter numsCount both beginning at 0.  whenever dataCount is equal to numsArray[numsCount], grab the current data object and add it to your newly created random list then increment numsCount.

5. The loop in step 4 can end when either numsCount > k or when dataCount reaches the end of the dataset.

6. The only other step that may need to be added here is before any of this to use the next command of the object type to count how large the dataset is to be able to bound your random numbers and check to make sure k is less than or equal to that.

第二种方法是遍历实际列表 MULTIPLE 次。

// one must assume that once we get to the end, we can start over within the set again
1. run a while loop that checks for endOfData
    a. count up a count variable that is initialized to 0

2. run a loop from 0 through k-1
    a. generate a random number that you constrain to the list size
    b. run a loop that moves through the dataset until it hits the rand element
    c. compare that element with all other elements in your new list to make sure it isnt already in your new list
    d. store the element into your new list

可能有一些方法可以通过存储当前列表位置来加速第二种方法，这样，如果您生成一个超过当前指针的随机数，您不必再次移动整个列表以返回元素 0，然后返回您希望检索的元素。

一种潜在的第三种方法可能是：

1. run a loop from 0 through k-1
    a. generate a random
    b. use the generated random as a skip count, move skip count objects through the list
    c. store the current item from the list into your new list

第三种方法的问题是不知道列表有多大，你不知道如何限制随机跳过计数。此外，即使你这样做了，它也不会真正看起来像一个随机抓取的子集，可以轻松到达列表中的最后一个元素，因为从统计上讲，你永远不可能到达最后一个元素（即不是每个元素都给出被选中的平等机会。）

可以说，最快的方法是方法 1，首先生成随机数字，然后只遍历列表一次（是的，实际上是两次，一次是为了获取数据集列表的大小，然后是再次获取随机元素）

score 0 · Accepted Answer

有趣的问题，但由于 DataSet 中没有计数或类似方法，并且您不能多次迭代，我只能想出这个解决方案来获得好的随机样本（没有 k > Datasize 处理）：

- (NSArray *)getSamples:(unsigned)k from:(Dataset*)input {
    NSMutableArray *source = [[NSMutableArray alloc] init];
    while(![input endOfData]) {
      [source addObject:[input getNext]];
    }

    NSMutableArray *ret = [[NSMutableArray alloc] initWithCapacity:k];
    int count = [source count];
    while ([ret count] < k) {
        int index = Rand(count);
        [ret addObject:[source objectAtIndex:index]];
        [source removeObjectAtIndex:index];
        count--;
    }
    return ret;
}

score 0 · Accepted Answer

这不是我在采访中所做的答案，但这是我希望我所做的：

存储指向数据集中第一个元素的指针
循环数据集以获取计数
将数据集重置为指向第一个元素
创建 NSMutableDictionary 用于存储随机索引
执行从 i=0 到 i=k 的 for 循环。每次迭代，生成一个随机值，检查字典中是否存在值。如果是这样，请继续生成一个随机值，直到获得一个新值。
循环数据集。如果当前索引在字典中，则将其添加到随机子集值数组中。
返回随机子集的数组。

score 0 · Accepted Answer

我们需要一点概率论。和其他人一样，我将忽略 n < k 的情况。第 n 项被选入大小为 k 的集合的概率仅为 C(n-1, k-1) / C(n, k)，其中 C 是二项式系数。一点数学表明这只是k/n。对于其余部分，请注意第 n 个项目的选择独立于所有其他选择。换句话说，“过去并不重要”。

所以一个算法是：

S = set of up to k elements
n = 0
while not end of input
   v = next value
   n = n + 1
   if |S| < k add v to S
   else if random(0,1) >= k/n replace a randomly chosen element of S with v

我会让编码人员编码这个！这很微不足道。您所需要的只是一个大小为 k 的数组和一个数据传递。

score 0 · Accepted Answer

如果您关心效率（正如您的标签所暗示的那样）并且知道总体中的项目数量，请不要使用水库抽样。这将需要您遍历整个数据集并为每个数据集生成一个随机数。

相反，选择从0到n-1 的五个值。在不太可能的情况下，五个索引之间存在重复，将重复替换为另一个随机值。然后使用五个索引对总体中的第 i 个元素进行随机访问查找。

这很简单。它使用随机数生成器的最小调用次数。它只为相关的选择访问内存。

如果您事先不知道数据元素的数量，您可以循环数据一次以获取总体大小并按上述方法进行操作。

如果您不允许对数据进行多次迭代，请使用分块形式的储层抽样：1) 选择前五个元素作为初始样本，每个元素的概率为 1/5。2) 读入大量数据并从新集合中选择五个新样本（仅使用对 Rand 的五个调用）。3）成对地，决定是保留新样本项还是旧样本元素（几率与两个样本组中的每一个的概率成正比）。4) 重复直到所有数据都被读取。

例如，假设有 1000 个数据元素（但我们事先并不知道这一点）。

选择前五个作为初始样本： current_sample = read(5); 人口=5。
读取一大块n个数据点（在本例中可能 n=200）：
- 子流行=读取（200）；
- m = len(subpop);
- new_sample = 选择(5, subpop);
- 成对循环两个样本：
  - for (a, b) in (current_sample and new_sample): 如果 random(0 to population + m) < population，则保留a，否则保留 *b)
- 人口 += 米
- 重复

objective-c - 最优解：从数据集中随机抽取项目样本

6 回答 6

Related

Reference