r - 删除数据集中的行

Question

我的面板数据集中有一些损坏的数据 - 对于某些组 (gid) 和时间 (t) 我有不止一个观察结果。所有观察值都有变量（数量） - 我希望 R 排除数量最少的观察值。

我目前的解决方案是这样，但是我无法控制 R 将排除的两个相同观察中的哪一个......

IMS <- subset(IMS, !duplicated(data.frame(t,gid)))

例子：

Product    Strength    Pack_size    y        t    Quantity    gid
Ibumetin    600MG        100      5.9183     1      10226    2613
Ibumetin    600MG        100      25.3500    1        100    2613

在示例中，要排除的观察是数量为 100 的观察，因为 10226>100。

我将不胜感激您可以提供的帮助，

亨里克

score 3 · Accepted Answer

使用非常有用的“plyr”包有一个非常简单的方法来做到这一点。

设置：我需要一些试验数据来完成这项工作。这是我使用的：

IMS <- read.table(text="
Product    Strength    Pack_size    y        t    Quantity    gid
Ibumetin    600MG        100      5.9183     1      10226    2613
Ibumetin    600MG        100      25.3500    1        100    2613
Simvastatin  30MG         90      14.1630    1       1036    2614
Simvastatin  30MG         90      12.3345    1       2102    2614
", header=TRUE)

第 1 步：找出每组 [gid]-[t] 对的 [Quantity] 最大值是多少。

library(plyr)
temp_IMS <- ddply(IMS, .(gid,t), mutate, Quantity_max=max(Quantity))

#       Product Strength Pack_size       y t Quantity  gid Quantity_max
# 1    Ibumetin    600MG       100  5.9183 1    10226 2613        10226
# 2    Ibumetin    600MG       100 25.3500 1      100 2613        10226
# 3 Simvastatin     30MG        90 14.1630 1     1036 2614         2102
# 4 Simvastatin     30MG        90 12.3345 1     2102 2614         2102

我们在这里使用“ply”的“dd”变体，因为我们期望输入和输出的数据帧（dd-ply）。我们没有做任何特别的事情；我们只是添加了一个名为 [Quantity_max] 的新列，它是通过采用max()共享相同 [gid] 和 [t] 对的 [Quantity] 值中.(gid,t)的 'mutate' 函数保留了数据帧的其余部分，让我们不必为了完成这项工作而做一些愚蠢的小动作。

步骤 2：选择具有相同 [Quantity] 和 [Quantity_max] 的行。

IMS_filtered <- IMS[temp_IMS$Quantity == temp_IMS$Quantity_max,]

#       Product Strength Pack_size       y t Quantity  gid
# 1    Ibumetin    600MG       100  5.9183 1    10226 2613
# 4 Simvastatin     30MG        90 12.3345 1     2102 2614

我们所做的是根据使用“plyr”创建的临时数据框过滤原始数据框。

第 3 步（可选）：如果多行可能具有相同的数量值，那么您需要找到某种方法来选择要使用的行。如果行完全相同，那么您有一个简单的解决方案：

IMS_filtered <- unique(IMS_filtered)

但是，如果您有不同的 [y] 值，则需要执行其他操作，例如基于重复匹配项进行过滤，同时仅考虑某些列。例如，如果我不在乎选择了哪个值，只要 [gid] 和 [t] 对相同，那么我可以通过简单的搜索来搜索重复项，如下所示：

IMS_filtered <- IMS_filtered[!duplicated(IMS_filtered[,c("gid","t")]),]

这通过选择不重复的行来选择第一次出现的 [gid]-[t] 对。

希望这会有所帮助。

score 2 · Accepted Answer

最简单的方法是重新排序数据，以便首先列出最大数量，然后使用您提供的方法：

subset(IMS[order(-IMS$Quantity),],!duplicated(data.frame(t,gid)))
      Product Strength Pack_size       y t Quantity  gid
1    Ibumetin    600MG       100  5.9183 1    10226 2613
4 Simvastatin     30MG        90 12.3345 1     2102 2614

score 1 · Accepted Answer

1

您可以使用 unique(df) 返回唯一行

于 2013-05-23T11:34:51.650 回答

r - 删除数据集中的行

3 回答 3

Related

Reference