r - 在 R 中，如何根据列属性的统计信息选择行？

Question

我的表有数千行，按 400 个类分类，还有十几列。

理想的结果将是一个基于列“z”的最大值的具有 400 行（每个类 1 行）的表，并且包含所有原始列。

这是我的数据示例，我只需要在此示例中使用 R 提取的第 2、4、7、8 行。

     x           y         z    cluster 
1  712521.75  3637426.49  19.46   12 
2  712520.69  3637426.47  19.66   12  *
3  712518.88  3637426.63  17.37   225
4  712518.4   3637426.48  19.42   225 *
5  712517.11  3637426.51  18.81   225
6  712515.7   3637426.58  17.8    17 
7  712514.68  3637426.55  18.16   17  *
8  712513.58  3637426.55  18.23   50  *
9  712512.1   3637426.62  17.24   50
10 712513.93  3637426.88  18.08   50

我尝试了许多不同的组合，包括：

  tapply(data$z, data$cluster, max)       # returns only the max value and cluster columns
  which.max(data$z)         # returns only the index of the max value in the entire table

我也通读了 plyr 包，但没有找到解决方案。

score 2 · Accepted Answer

一个非常简单的方法是使用aggregateand merge：

> merge(aggregate(z ~ cluster, mydf, max), mydf)
  cluster     z        x       y
1      12 19.66 712520.7 3637426
2      17 18.16 712514.7 3637427
3     225 19.42 712518.4 3637426
4      50 18.23 712513.6 3637427

您甚至可以使用tapply代码的输出来获得所需的内容。只需将其变为 adata.frame而不是命名向量。

> merge(mydf, data.frame(z = with(mydf, tapply(z, cluster, max))))
      z        x       y cluster
1 18.16 712514.7 3637427      17
2 18.23 712513.6 3637427      50
3 19.42 712518.4 3637426     225
4 19.66 712520.7 3637426      12

有关更多选项，请参阅此问题的答案。

score 0 · Accepted Answer

谢谢大家的帮助！ aggregate()和merge()非常适合我。

重要的一点：aggregate() - 每个集群只选择一个重复点，但是merge() - 选择所有重复点，因为它们在一个集群中具有相同的最大值。

在这种情况下这是理想的，因为这些点是 3D 的，并且在考虑 x 和 y 坐标时不是重复的。

这是我的解决方案：

df        <- read.table("data.txt", header=TRUE, sep=",")
attach(df)
names(df)
[1] "Row"         "x"           "y"           "z"           "cluster"

head(df)
  Row        x       y     z     cluster
1   1 712521.8 3637426 19.46         361
2   2 712520.7 3637426 19.66         361
3   3 712518.9 3637427 17.37         147
4   4 712518.4 3637426 19.42         147
5   5 712517.1 3637427 18.81         147
6   6 712515.7 3637427 17.80          42


new_table_a     <- aggregate(z ~ cluster, df, max)  # output 400 rows, no duplicates
new_table_b     <- merge(new_table_a, df)          # output 408 rows, includes duplicates of "z"

head(new_table_b)
      cluster     z  Row        x       y
1           1 20.44 6043 712416.2 3637478
2          10 26.09 1138 712458.4 3637511
3         100 19.39 6496 712423.4 3637485
4         101 25.74 2141 712521.2 3637488
5         102 17.33 2320 712508.2 3637484
6         103 21.01 6908 712462.2 3637493

r - 在 R 中，如何根据列属性的统计信息选择行？

2 回答 2

Related

Reference