string - 使用 R 在多个列表中均匀分布重复的字符串

Question

所以假设我有一个长度为 150000 的字符向量。向量中的字符串不是唯一的，实际上它们是正态分布的，最常见的字符串出现 28 次，另外 24 个，超过 1000 个出现超过 5 次。我想将向量分成 28 个较小的向量，将字符串分布在较小的向量中，这样每个较小的向量中没有字符串出现超过两次，理想情况下只有一次（或不存在）。我需要保留每个字符串，所以我不能这样做!duplicated()理想情况下，向量的大小大致相同。

我该怎么做？

我在想像开始添加到第一个向量直到遇到第一个非唯一字符串，跳过它，继续填充跳过非唯一字符串直到达到 150000/28 = 5357，然后继续处理其他向量同样，一旦将字符串分配给较小的字符串，就从父向量中删除字符串？这有什么问题吗？没有令人讨厌的for循环森林的有效方法？

score 1 · Accepted Answer

这似乎是一个非常有趣的问题，尽管它可能只是因为我误解了它才显得有趣——我在这里得到的解决方案创建length of character vector / frequency of most frequent item子向量，然后将每个字符串放入f这些子向量中，f该字符串的频率在哪里。这可能比您实际要求的要复杂。

library(plyr)
# I created a file with 10000 random strings and a roughly similar frequency 
# distribution using python, and now I can't remember exactly what I did
strings <- read.csv("random_strings.txt", header=FALSE,
                    stringsAsFactors=FALSE)$V1
freq_table <- table(strings)

num_sub_vectors <- max(freq_table)
# Create a list of empty character vectors
split_list <- alply(1:num_sub_vectors, 1, function(x) return(character(0)))
for (s in names(freq_table)) {
  # Put each string into f of the sub-vectors, where f is the string's 
  # frequency
  freq <- freq_table[[s]]
  # Choose f random indexes to put this string into
  sub_vecs <- sample(1:num_sub_vectors, freq)
  for (sub in sub_vecs) {
    split_list[[sub]] <- c(split_list[[sub]], s)
  }
}

要测试它是否有效，请选择一个字符串s或一个频率f，并检查子向量中s出现的情况。f重复直到你有信心。

> head(freq_table[freq_table==15])
strings
ad ak bj cg cl cy 
15 15 15 15 15 15 
> sum(sapply(split_list, function(x) "ad" %in% x))
[1] 15

score 0 · Accepted Answer

这可以相当简洁地满足您的要求（每个字符串每个子向量只有一次），只需计算每个字符串出现的频率，然后根据“出现 i 次或更多次的字符串”进行分区：

inputs <- c("foo", "bar", "baz", "bar", "baz", "bar", "bar")
histo <- table(inputs)
lapply(1:max(histo), function(i) { names(histo)[histo>=i] }

这当然会产生大小差异很大的分区，但您不太清楚您在该区域的要求是什么。

string - 使用 R 在多个列表中均匀分布重复的字符串

2 回答 2

Related

Reference