r - sparkR中计数函数的运行时间

Question

我在 sparkR 中有一个 DataFrame X。X 包含 ID = 1 2 3 1 2 3 9 ... 的列以及每个条目的分数： score = 1241 233 20100 ....

因此，要找到 ID 的所有分数

s=filter(X, X$ID==1)

然后我们得到 ID 1 的所有分数，我们可以将它们相加。

我想知道 X 中 ID=1 的数量，所以我使用 SparkR 中的“计数”函数

count(s)

但这需要很长时间来计算。有一个更好的方法吗？

假设我们已经安排或排序了 X 所以 ID = 1 1 1 2 3 3 3 4 ..... 那么也许有更好的选择来避免做 count(s) 。

score 0 · Accepted Answer

By aggregating on ID and counting how many items there are, you immediately get the result for all ID's, however, with only 100000 rows it shouldn't take long at all!

countedData <- agg(groupBy(X, "ID"), count = n(X[["score"]]))

r - sparkR中计数函数的运行时间

1 回答 1

Related

Reference