我正在使用“共享模型”来估计缺失观测值的值。使用示例数据集my.data
,我将根据 1970 年的观测分布情况按比例填充三年中每年的缺失观测值(尽管我可以使用 2010 年或 1970 年和 2010 年两者来完成)。
下面我介绍了示例数据、期望的结果和代码,以两种方式获得期望的估计。第一种方法的代码非常特定于这个例子。我希望创建一个比第二种方法更通用的功能。在我看来,创建一个更通用的函数需要在列表列表上调用一个函数。我希望有人可以就如何将函数应用于列表列表提供建议。
这是示例数据集和高度具体的解决方案:
my.data <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
aa 50 NA 70 NA 500
cc 10 20 NA NA 100
ee 800 NA NA 400 8000
gg 1000 1900 NA NA 10000
ii 200 400 300 100 2000
kk 20 40 30 NA 200
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
my.total <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
total 2080 4000 3000 1000 20800
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
desired.result <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
aa 50 96.47059 70 23.148148 500
cc 10 20 14.36464 4.629630 100
ee 800 1543.529 1149.17127 400 8000
gg 1000 1900 1436.46409 462.962963 10000
ii 200 400 300 100 2000
kk 20 40 30 9.259259 200
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
x70 <- c(50, 800)
estimates.for.80 <- (x70 / sum(x70)) * (my.total$y1980 - sum(my.data$y1980, na.rm = TRUE))
x80 <- c(10, 800, 1000)
estimates.for.90 <- (x80 / sum(x80)) * (my.total$y1990 - sum(my.data$y1990, na.rm = TRUE))
x90 <- c(50, 10, 1000, 20)
estimates.for.00 <- (x90 / sum(x90)) * (my.total$y2000 - sum(my.data$y2000, na.rm = TRUE))
这是功能。d.counties
如果我知道如何将其作为输入列表包含到函数中,我认为这可以概括。换句话说,我怎样才能包含d.counties
并my.input
仍然使该功能正常工作?我认为我的困惑源于d.counties
不同年份的长度。
state <- 'my.state'
my.df <- read.table(text = '
county y1970 y1980 y1990 y2000 y2010
aa 50 NA 70 NA 500
cc 10 20 NA NA 100
ee 800 NA NA 400 8000
gg 1000 1900 NA NA 10000
ii 200 400 300 100 2000
kk 20 40 30 NA 200
total 2080 4000 3000 1000 20800
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
pre.divide.up <- tail(my.df[,2:ncol(my.df)], 1) - colSums(head(my.df[,2:ncol(my.df)], -1), na.rm = TRUE)
# For each column containing NA's define the years to use as shares
# If use.years = 'pre' then use the year in pre.year
# If use.years = 'post' then use the year in post.year
# If use.years = 'both' then use both the year in pre.year and the year in post.year
#
# Here I define pre.year = y1970 and post.year = 2010 for every year
# However, 'pre.year' and 'post.year' are variables. They can differ among rows below.
shares <- read.table(text = '
cyear pre.year post.year use.years
y1980 y1970 y2010 pre
y1990 y1970 y2010 pre
y2000 y1970 y2010 pre
', header = TRUE, na.strings = "NA")
d.counties.80 <- c( 'aa' ,
'ee' )
d.counties.90 <- c( 'cc' ,
'ee' ,
'gg' )
d.counties.00 <- c( 'aa' ,
'cc' ,
'gg' ,
'kk' )
d.counties <- list(d.counties.80, d.counties.90, d.counties.00)
my.input <- data.frame(shares)
my.function <- function(y) {
# extract years of interest from my.df and store in data.frame called year.data
if(y[[4]] != 'last') year.data = my.df[names(my.df) %in% c("county", y[[2]], y[[1]], y[[3]])]
if(y[[4]] == 'last') year.data = my.df[names(my.df) %in% c("county", y[[2]], y[[1]] )]
# subset counties in year.data to only include counties with NA's in current year
if(as.numeric(substr(y[1], 2, 5)) == 1980) year.data = year.data[year.data$county %in% d.counties.80,]
if(as.numeric(substr(y[1], 2, 5)) == 1990) year.data = year.data[year.data$county %in% d.counties.90,]
if(as.numeric(substr(y[1], 2, 5)) == 2000) year.data = year.data[year.data$county %in% d.counties.00,]
# reorder columns in year.data
if(y[[4]] != 'last') year.data = year.data[, c('county', y[[2]], y[[1]], y[[3]])]
if(y[[4]] == 'last') year.data = year.data[, c('county', y[[2]], y[[1]] )]
# values to be divided, or distributed, among counties with NA's in the current year
divide.up <- pre.divide.up[, y[[1]]]
# sum values from designated pre and/or post years and bind those totals to bottom of year.data
if(y[[4]] != 'last') colsums.year = data.frame('total', as.data.frame(t(as.numeric(colSums(year.data[,c(2:4)], na.rm=TRUE)))))
if(y[[4]] == 'last') colsums.year = data.frame('total', as.data.frame(t(as.numeric(colSums(year.data[,c(2:3)], na.rm=TRUE)))))
names(colsums.year) <- names(year.data)
year.data.b <- rbind(year.data, colsums.year)
# obtain percentages in designated pre and/or post years for counties with NA's in current year
year.data.c <- year.data.b
year.data.c[, -1] <- lapply( year.data.c[ , -1], function(x){ x/x[nrow(year.data.b)] } )
# estimate county values for current year by distributing total missing values in current year
# according to how values were distributed in those same counties in other years
if(y[[4]] == 'both') year.data.b[, y[[1]]] = rowMeans(data.frame(year.data.c[, y[[2]]], year.data.c[, y[[3]]])) * as.numeric(divide.up)
if(y[[4]] == 'pre') year.data.b[, y[[1]]] = year.data.c[, y[[2]]] * as.numeric(divide.up)
if(y[[4]] == 'post') year.data.b[, y[[1]]] = year.data.c[, y[[3]]] * as.numeric(divide.up)
if(y[[4]] == 'last') year.data.b[, y[[1]]] = year.data.c[, y[[2]]] * as.numeric(divide.up)
# extract estimates for current year along with the county column, then remove the last row
year.data.last <- year.data.b[names(year.data.b) %in% c("county", y[[1]])]
year.data.last <- year.data.last[-nrow(year.data.last),]
colnames(year.data.last) <- c('county', 'acreage')
# create a data set for export
this.year <- rep(as.numeric(substr(y[[1]], 2, 5)), nrow(year.data.last))
revised.data <- data.frame(state, this.year, year.data.last)
return(revised.data)
}
my.list <- apply(shares, 1, function(y) my.function(y))
my.list2 <- do.call("rbind", my.list)
my.list2
state this.year county acreage
1 my.state 1980 aa 96.470588
3 my.state 1980 ee 1543.529412
2 my.state 1990 cc 14.364641
31 my.state 1990 ee 1149.171271
4 my.state 1990 gg 1436.464088
11 my.state 2000 aa 23.148148
21 my.state 2000 cc 4.629630
41 my.state 2000 gg 462.962963
6 my.state 2000 kk 9.259259
尽管此函数不像我在下面的答案中那样通用,但上面的函数确实允许明确指定哪些县具有相关的缺失值。在实际数据中,有两种类型的缺失值,下面我的答案中的函数无法区分这两种类型。上面的函数可以区分它们,因为我准确地告诉它每年要考虑哪些县。
再次感谢您的任何建议和已经提供的建议。