r - 从分组值中的 R 表中提取值

Question

我有下表按第一、第二和名称排序。

    myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53), 
    Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene", 
    "Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
    ), class = "factor"), Area = c(699468L, 153744L, 32913L, 
    4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L, 
    2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
    ), class = "factor")), .Names = c("first", "second", "Name", 
    "Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))

在每个组中，我想提取与特定样本相对应的区域。有几个组没有样本中的区域，因此如果未检测到样本，它应该返回“NA”。理想情况下，最终输出应该是每个样本的列。

我尝试使用 ifelse 函数为每个样本创建一列：

PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")

然而，这并没有考虑到组分布。我想这样做，但在小组内。如果 sample=PO1:1，Area，否则在每个组内（一个组的第一、第二和 Name 列的值相等），否则 NA。

对于第一组：

structure(list(first = c(120L, 120L), second = c(1.33, 1.33), 
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"), 
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1", 
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name", 
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))

输出应该是：

structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))

有什么建议吗？

score 1 · Accepted Answer

正如问题中的示例一样，我假设Sample是一个因素。如果不是这种情况，请考虑这样做。

首先，让我们清理列`Sample`以使其成为合法名称，否则可能会导致错误

levels(myData$Sample)  <-  make.names(levels(myData$Sample))


## DEFINE THE CUTS##

# Adjust these as necessary
#--------------------------
  max.second <- 3  #  max & nin range of myData$second 
  min.second <- 0  #
  sprd <- 0.15     # with spread for each group
#--------------------------

# we will cut the myData$second according to intervals,   cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)

# Next, lets create a group column to split our  data frame by 
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-') 
groups <- split(myData, myData$group)

samples <- levels(myData$Sample)   ## I'm assuming not all samples are present in the example.  Manually adjusting with: samples <- sort(c(samples,  "PO3.1"))


# Apply over each group, then apply over each sample    
myOutput <- 
  t(sapply(groups, function(g) {

      #-------------------------------
      # NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
      res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - "))  # allowing for multiple values
      unlist(ifelse(res=="", NA, res))

      ## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
      # res <- sapply(samples, function(s) g$Area[g$Sample==s])  # <~~ This line will work when only one value per sample
      # unlist(ifelse(res==0, NA, res))
      #-------------------------------

  }))

# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-")  ## or whichever proper group name

# remove dummy column 
myData$group <- NULL

结果

myOutput

        PO1.1    PO2.1    PO3.1 PO4.1            
Group-1 NA       "153744" NA    "699468"         
Group-2 NA       NA       NA    "32913 - 4948619"
Group-3 NA       NA       NA    "83528"          
Group-4 "536339" NA       NA    NA               
Group-5 "105598" NA       NA    NA

score 1 · Accepted Answer

你不能真的指望 R 直觉 PO2 和 PO4 之间有第四个因素水平，现在你可以。

> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
  first second               Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1   120    1.3          Pentanone     699468     153744         NA
3   126    0.4            Heptene      32913         NA         NA
4   126    0.4            Heptene    4948619         NA         NA
5   126    0.3        Methylamine      83528         NA         NA
6   132    0.5            Benzene         NA         NA     536339
7   132    0.5 Ethene._trichloro-         NA         NA     105598

r - 从分组值中的 R 表中提取值

2 回答 2

首先，让我们清理列Sample以使其成为合法名称，否则可能会导致错误

结果

Related

Reference

首先，让我们清理列`Sample`以使其成为合法名称，否则可能会导致错误