julia - 数据帧的分层抽样

Question

给定一个包含“a”，“b”和“value”列的数据框，我想从每对（“a”，“b”）中采样 N 行。在 python pandas 中，使用以下语法很容易做到这一点：

import pandas as pd
df.groupby(["a", "b"]).sample(n=10)

在 Julia 中，我找到了一种实现类似功能的方法：

using DataFrames, StatsBase

combine(groupby(df, [:a, :b]),
names(df) .=> sample .=> names(df)
)

但是，我不知道如何将其扩展到 n>1。我试过了

combine(groupby(df, [:a, :b]),
names(df) .=> x -> sample(x, n) .=> names(df)
)

但这返回了错误（对于n=3）：

DimensionMismatch("数组不能广播到一个共同的大小；得到一个长度为 3 和 7 的维度")

我发现的一种方法（语法略有不同）是：

combine(groupby(df, [:a, :b]), x -> x[sample(1:nrow(x), n), :])

但我很想知道是否有更好的选择

score 3 · Accepted Answer

也许作为补充评论。如果您的数据框中有一个 id 列（包含行号），则：

df[combine(groupby(df, [:a, :b]), :id => (x -> rand(x, n)) => :id).id, :]

会快一点（但不会快很多）。

这是一个例子：

using DataFrames
n = 10
df = DataFrame(a=rand(1:1000, 10^8), b=rand(1:1000, 10^8), id=1:10^8)
combine(groupby(df, [:a, :b]), x -> x[rand(1:nrow(x), n), :]); # around 16.5 seconds on my laptop
df[combine(groupby(df, [:a, :b]), :id => (x -> rand(x, n)) => :id).id, :]; # around 14 seconds on my laptop

julia - 数据帧的分层抽样

1 回答 1

Related

Reference