0

看起来很简单,但我在网上找不到答案。我有 1995-2015 年间具有城市特征的面板数据。对于某些变量,我只有 2000 年和 2010 年的数据。因此,我想创建新变量,在其中将 1995-2004 年的缺失数据与 2000 年的值和 2005-2015 年的缺失数据与 2010 年的值进行估算。

我的数据集看起来像这个例子:

   cities  idhm year
1       B    NA 1995
2       C    NA 1996
3       D    NA 1997
4       E    NA 1998
5       F    NA 1999
6       G 24599 2000
7       H    NA 2001
8       I    NA 2002
9       J    NA 2003
10      K    NA 2004
11      L    NA 2005
12      M    NA 2006
13      N    NA 2007
14      O    NA 2008
15      P    NA 2009
16      Q  5598 2010
17      R    NA 2011
18      S    NA 2012
19      T    NA 2013
20      U    NA 2014
21      V    NA 2015

我想要一个像这样的数据集:

   cities  idhm year newvar
1       B    NA 1995  24599
2       C    NA 1996  24599
3       D    NA 1997  24599
4       E    NA 1998  24599
5       F    NA 1999  24599
6       G 24599 2000  24599
7       H    NA 2001  24599
8       I    NA 2002  24599
9       J    NA 2003  24599
10      K    NA 2004  24599
11      L    NA 2005   5598
12      M    NA 2006   5598
13      N    NA 2007   5598
14      O    NA 2008   5598
15      P    NA 2009   5598
16      Q  5598 2010   5598
17      R    NA 2011   5598
18      S    NA 2012   5598
19      T    NA 2013   5598
20      U    NA 2014   5598
21      V    NA 2015   5598

欢迎任何帮助。

4

2 回答 2

2

我怀疑你的数据可能比这个例子大,所以更一般的情况是使用滚动连接。我发现最简单的data.table.

首先,制作一个包含完整数据的字典以供加入。

library(data.table)
setDT(data1)
dictionary <- data1[!is.na(idhm),.(year,idhm)]
dictionary
#   year  idhm
#1: 2000 24599
#2: 2010  5598

然后执行连接on = "year"roll = "nearest"

result <- dictionary[data1,on = "year",roll="nearest"]
result[,.(cities,year,idhm)]
#   cities year  idhm
# 1:      B 1995 24599
# 2:      C 1996 24599
# 3:      D 1997 24599
# 4:      E 1998 24599
# 5:      F 1999 24599
# 6:      G 2000 24599
# 7:      H 2001 24599
# 8:      I 2002 24599
# 9:      J 2003 24599
#10:      K 2004 24599
#11:      L 2005 24599
#12:      M 2006  5598
#13:      N 2007  5598
#14:      O 2008  5598
#15:      P 2009  5598
#16:      Q 2010  5598
#17:      R 2011  5598
#18:      S 2012  5598
#19:      T 2013  5598
#20:      U 2014  5598
#21:      V 2015  5598
#    cities year  idhm

数据

data1 <- structure(list(cities = structure(1:21, .Label = c("B", "C", 
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", 
"Q", "R", "S", "T", "U", "V"), class = "factor"), idhm = c(NA, 
NA, NA, NA, NA, 24599L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5598L, 
NA, NA, NA, NA, NA), year = 1995:2015), class = "data.frame", row.names = c(NA, 
-21L))
于 2020-04-03T03:01:49.613 回答
1

我们可以做的 :

df$new_var <- NA
df$new_var[df$year >= 1995 & df$year <= 2004] <- df$idhm[df$year == 2000]
df$new_var[df$year >= 2005 & df$year <= 2015] <- df$idhm[df$year == 2010]

或使用dplyr

library(dplyr)

df %>%
   mutate(new_var = case_when(between(year, 1995, 2004) ~idhm[year == 2000], 
                         between(year, 2005, 2015) ~idhm[year == 2010]))


#   cities  idhm year new_var
#1       B    NA 1995   24599
#2       C    NA 1996   24599
#3       D    NA 1997   24599
#4       E    NA 1998   24599
#5       F    NA 1999   24599
#6       G 24599 2000   24599
#7       H    NA 2001   24599
#8       I    NA 2002   24599
#9       J    NA 2003   24599
#10      K    NA 2004   24599
#11      L    NA 2005    5598
#12      M    NA 2006    5598
#13      N    NA 2007    5598
#14      O    NA 2008    5598
#15      P    NA 2009    5598
#16      Q  5598 2010    5598
#17      R    NA 2011    5598
#18      S    NA 2012    5598
#19      T    NA 2013    5598
#20      U    NA 2014    5598
#21      V    NA 2015    5598
于 2020-04-03T02:53:24.117 回答