13

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:

Case     zip     market
1        44485   NA
2        44488   NA
3        43210   NA

There are over 3.5 million records.

Then, I have a second data frame, 'zipcodes'.

market    zip
1         44485
1         44486
1         44488
...       ... (100 zips in market 1)
2         43210
2         43211
...       ... (100 zips in market 2, etc.)

I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.

4

5 回答 5

14

由于您不关心 中的marketalldata,因此您可以先使用以下方法将其剥离,alldata然后zipcodes根据列合并zipmerge

merge(alldata[, c("Case", "zip")], zipcodes, by="zip")

by参数指定键条件,因此如果您有复合键,您可以执行类似by=c("zip", "otherfield").

于 2013-07-24T20:48:21.197 回答
9

另一个对我有用并且非常简单的选项:

alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
于 2017-07-28T13:56:27.747 回答
4

对于如此庞大的数据集,您可能需要环境查找的速度。您可以使用qdapTools 包lookup中的函数,如下所示:

library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])

或者

alldata$zip %l% zipcodes[, 2:1]
于 2013-07-24T22:13:17.367 回答
3

这是dplyr这样做的方法:

library(tidyverse)
alldata %>%
  select(-market) %>%
  left_join(zipcodes, by="zip")

在我的机器上,它的性能与lookup.

于 2017-05-18T10:14:42.270 回答
0

的语法match有点笨拙。您可能会发现该lookup软件包更易于使用。

alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
##   Case   zip market
## 1    1 44485      1
## 2    2 44488      1
## 3    3 43210      2
于 2021-04-14T16:31:45.780 回答