我的 Linux 机器上的文本文件中有大约 1 亿个值/计数对。我想弄清楚我将使用哪种公式来生成更多遵循相同分布的对。



3 回答 3



于 2009-06-17T16:26:29.580 回答


您需要一个可以快速搜索“key <= X 的最高条目”的文件结构——例如,Sleepycat 的伯克利数据库有一个 btree 结构;SQLite 甚至更容易,虽然可能没有那么快(但在键上有一个索引应该没问题)。

将数据以对的形式放置,其中键是到该点的累积计数(按递增值排序)。称 K 为最高键。

要生成一个与样本完全遵循相同分布的随机对,请生成一个介于 0 和 K 之间的随机整数 X,并在该文件结构中使用提到的“最高为 <=”查找它并使用相应的值。

不知道如何在 R 中完成所有这些工作——在你的鞋子里,我会尝试一个 Python/R 桥,在 Python 中进行逻辑和控制,只在 R 本身中进行统计,但是,这是个人选择!

于 2009-06-17T15:04:11.373 回答

I'm assuming that you're interested in understanding the distribution over your categorical values.

The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.

To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:

affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)

In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:

dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)

One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).

于 2009-06-27T08:54:55.430 回答