0

Thanks in advance for any assistance!

I have two separate data frames in R, one with a start and end number, the second with a middle number. Included here is a mock data set illustrating my conundrum.

The data table with two numbers also has a GroupID as seen here.

TwoNum <- structure(list(GroupID = structure(1:10, .Label = c("Clstr001", 
"Clstr002", "Clstr007", "Clstr008", "Clstr010", "Clstr011", "Clstr015", 
"Clstr016", "Clstr017", "Clstr018"), class = "factor"), StartNum = c(2L, 
5L, 23L, 26L, 32L, 41L, 67L, 70L, 73L, 78L), EndNum = c(4L, 7L, 
25L, 27L, 40L, 43L, 68L, 72L, 75L, 80L)), .Names = c("GroupID", 
"StartNum", "EndNum"), class = "data.frame", row.names = c(NA, 
-10L))

head(TwoNum)

Here is the date table with a single number

OneNum <- structure(list(GroupID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MiddleNum = c(3L, 5L, 
6L, 7L, 24L, 25L, 33L, 34L, 35L, 37L, 42L, 67L, 71L, 73L, 74L, 
75L, 78L, 79L, 80L)), .Names = c("GroupID", "MiddleNum"), class = "data.frame", 
row.names = c(NA, 
-19L))  

head(OneNum)

When the MiddleNum is between the StartNum and EndNum I am trying to replace the NA with the corresponding GroupID - i.e. replace the NA with the GroupID row that brackets the middle number.

My real data set is substantially longer and I am thus trying to build this into a for() loop that checks if the Middle number is between ANY (i.e. all rows) of the Start and End pairs and if yes, adds the corresponding GroupID to the OneNum data frame.

Any suggestions would be appreciated. I am not necessarily looking for someone to create the entire loop (but would not turn that down either...), but new ideas would help greatly.
Thanks.

4

2 回答 2

1

使用data.table包 -

TwoNum <- data.table(TwoNum)
OneNum <- data.table(OneNum)
OneNum[, GroupID := NULL]

TwoNum <- TwoNum[,MiddleNum := StartNum]

setkey(TwoNum, MiddleNum)
setkey(OneNum, MiddleNum)

TwoNum[OneNum, roll = Inf]

roll = Inf基本上允许最接近匹配的合并。您的问题可能有更多情况(同一个 MiddleNum 的多个匹配,所有范围之外的 MiddleNum 等),我建议您稍微尝试一下,以确保它有效。

输出

> TwoNum[OneNum, roll = Inf]
    MiddleNum  GroupID StartNum EndNum
 1:         3 Clstr001        2      4
 2:         5 Clstr002        5      7
 3:         6 Clstr002        5      7
 4:         7 Clstr002        5      7
 5:        24 Clstr007       23     25
 6:        25 Clstr007       23     25
 7:        33 Clstr010       32     40
 8:        34 Clstr010       32     40
 9:        35 Clstr010       32     40
10:        37 Clstr010       32     40
11:        42 Clstr011       41     43
12:        67 Clstr015       67     68
13:        71 Clstr016       70     72
14:        73 Clstr017       73     75
15:        74 Clstr017       73     75
16:        75 Clstr017       73     75
17:        78 Clstr018       78     80
18:        79 Clstr018       78     80
19:        80 Clstr018       78     80
于 2013-10-10T07:07:07.937 回答
1

这是解决该问题的一些基本R。对于非常大的数据集,这不会非常快,但如果 StartNum 和 EndNum 的范围变大,它不会遇到内存问题。此外,这可以满足您的字面要求,并处理值不介于 NA 之间的情况。如果您不关心失败时会发生什么或不可能完全失败(每个值都被分类),那么您可以省略该if语句。您可以修改它以<=在必要时使用。

ids <- as.character(TwoNum$GroupID)
f <- function(x){
    a <- ids[ (TwoNum$StartNum < x) & (x < TwoNum$EndNum) ]
    if (length(a) == 0) NA else a
    }       
OneNum$GroupID <- lapply(OneNum$MiddleNum, f)

如果您实际上涵盖了所有可能的范围,并且每个 MiddleNum 都将被标记,那么您只需要点的一侧,并且在 R 中已经有一个函数可以执行此操作。在这种情况下,我包括等于端点的数字.

cut(OneNum$MiddleNum, breaks = c(2, TwoNum$EndNum), labels = TwoNum$GroupID, include.lowest = TRUE, right = TRUE)
于 2013-10-10T04:10:23.750 回答