r - 限制可重现示例的分层数据的大小

Question

我正在尝试为这个问题提出可重现的示例（RE）： Errors related to data frame columns during merging。要获得 RE 的资格，该问题仅缺少可重复的数据。但是，当我尝试使用非常标准的方法时dput(head(myDataObj))，产生的输出是 14MB 大小的文件。问题是我的数据对象是一个数据帧列表，因此head()限制似乎无法递归工作。

我还没有找到任何选项dput()和head()函数可以让我递归地控制复杂对象的数据大小。除非我在上面说错了，否则在这种情况下你会推荐什么其他方法来创建最小RE 数据集？

score 2 · Accepted Answer

根据@MrFlick 对 using 的评论lapply，您可以根据需要使用任何apply函数族来执行headorsample函数，以减少RE和测试目的的大小（我发现使用大数据集的子集或子样本更适合调试甚至绘制图表）。

应该注意的是headandtail提供结构的第一个或最后一个位，但有时这些在它们中没有足够的变化用于 RE 目的，而且肯定不是随机的，这sample可能会变得更有用。

假设我们有一个分层的树结构（...列表的列表），我们希望对每个“叶子”进行子集化，同时保留树中的结构和标签。

x <- list( 
    a=1:10, 
    b=list( ba=1:10, bb=1:10 ), 
    c=list( ca=list( caa=1:10, cab=letters[1:10], cac="hello" ), cb=toupper( letters[1:10] ) ) )

注意：在下文中，我实际上无法区分 usinghow="replace"和how="list".

data.frame另请注意：这对叶节点来说不是很好。

# Set seed so the example is reproducible with randomized methods:
set.seed(1)

head您可以通过这种方式在递归应用中使用默认值：

rapply( x, head, how="replace" )

或者传递一个修改行为的匿名函数：

# Complete anonymous function
rapply( x, function(y){ head(y,2) }, how="replace" )
# Same behavior, but using the rapply "..." argument to pass the n=2 to head.
rapply( x, head, how="replace", n=2 )

以下获取sample每个叶子的随机排序：

# This works because we use minimum in case leaves are shorter
# than the requested maximum length.
rapply( x, function(y){ sample(y, size=min(length(y),2) ) }, how="replace" )

# Less efficient, but maybe easier to read:
rapply( x, function(y){ head(sample(y)) }, how="replace" )

# XXX: Does NOT work The following does **not** work 
# because `sample` with a `size` greater than the 
# item being sampled does not work (when 
# sampling without replacement)
rapply( x, function(y){ sample(y, size=2) }, how="replace" )

r - 限制可重现示例的分层数据的大小

1 回答 1

Related

Reference