我正在尝试根据值的出现来获取数据框的子集。这在下面给出的示例中得到了最好的解释。这个问题与以下内容密切相关:为 R 中数据名列中列的每个唯一值选择前有限行数 但是,我想改变 head() 命令选择的项目数。
#Sample data
input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008,"2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04","2011-01-01","2011-01-02","2011-01-01","2011-01-04"), ncol=3)
colnames(input) <- c( "Product" , "Something" ,"Date")
input <- as.data.frame(input)
input$Date <- as.Date(input[,"Date"], "%Y-%m-%d")
#Sort based on date, I want to leave out the entries with the oldest dates.
input <- input[ with( input, order(Date)), ]
#Create number of items I want to select
table_input <- as.data.frame(table(input$Product))
table_input$twentyfive <- ceiling( table_input$Freq*0.25 )
#This next part is a very time consuming method (Have 2 mln rows, 90k different products)
first <- TRUE
for( i in table_input$Var1 ) {
data_selected <- input[input$Product == i,]
number <- table_input[table_input$Var1 == i ,]$twentyfive
head <- head( data_selected, number)
if( first == FALSE) {
output <- rbind(output, head)
} else {
output <- head
}
first <- FALSE
}
希望有人知道更好,更有效的方法。我尝试使用此处答案中的拆分函数:为 R 中数据名列中的每个唯一值选择顶部有限行数以拆分产品,然后尝试迭代它们并选择 head()。但是 split 函数总是耗尽内存(无法分配..)
input_split <- split(input, input$Product) #Works here, but not i my problem.
所以最后我的问题是我希望选择不同数量的每个独特产品。所以这里有 1000001 的 2 个项目和 1000002 和 1000003 的 1 个项目。