r - Using R to sum values for cross-sectional unit in panel data

Question

All,

The company where I work gave me this data to work with. In short, it's TSCS data with the firm as the cross-sectional unit with time units as fiscal years. Each firm has various accounts. I'm interested in creating a total of money spent on each account for a given firm.

I can provide a simple illustration of the data below. Let firm be the cross-sectional unit of interest. Each firm has various accounts on which the company spends money. Some accounts are common to all firms, others are unique. Not every firm had money spent on an account in a given year. In fact, some were not eligible for accounts until later on in the data, and others drop out (as such, the panel data can be considered unbalanced). As such, the NAs in the data I was provided could be treated as 0s, though it's a little bit problematic. Some firms are eligible in a given year but don't receive money in an account. Other firms are ineligible because of drop-out or late entry.

The data look like this, and it was given to me in wide format. It's a simplified version for illustration. In this illustration, firm=B wasn't eligible for an account in FY1990 and firm=C drops out in FY1992.

firm   account   FY1990 FY1991 FY1992
A     Account 1    500    900   1000
A     Account 2     30     40     40
A     Account 3     NA     60     20
A     Account 4     NA     35     NA
B     Account 1     NA    340     60
B     Account 2     NA    500    800
B     Account 3     NA    800     NA
B     Account 4     NA     60   1000
C     Account 1   1000    400     NA
C     Account 5    500     60     NA
C     Account 8     60   1000     NA
D     Account 1    400    400    400
D     Account 2     NA   1000   1000
D     Account 3    300     40    300
D     Account 6     NA    300    300
D     Account 7    900    900   1000
D     Account 8   1000   1200   1500

What I'd like to do (and was told to do) was amend this data so that it looks like this:

firm   account   FY1990 FY1991 FY1992
A     Account 1    500    900   1000
A     Account 2     30     40     40
A     Account 3     NA     60     20
A     Account 4     NA     35     NA
A      TOTAL       530   1035   1060
B     Account 1     NA    340     60
B     Account 2     NA    500    800
B     Account 3     NA    800     NA
B     Account 4     NA     60   1000
B      TOTAL        NA   1700   1860
C     Account 1   1000    400     NA
C     Account 5    500     60     NA
C     Account 8     60   1000     NA
C      TOTAL      1560   1460     NA
D     Account 1    400    400    400
D     Account 2     NA   1000   1000
D     Account 3    300     40    300
D     Account 6     NA    300    300
D     Account 7    900    900   1000
D     Account 8   1000   1200   1500
D      TOTAL      2600   3840   4500

I could just as easily do this in Excel or some other spreadsheet program, but that would be tedious and it invites more human error than if I were to use R to program this. I'm not against creating a new data frame with the totals rather than trying to add a row underneath all the accounts for a given firm. It might be easier to just put a 0 for the total for a given firm ineligible for an account in a given fiscal year. I can always recode some zeroes as NAs next and automate that process as well.

My assumption is this would require a loop, but I'm a novice in R programming. Any input would be greatly appreciated.

Reproducible code for this illustration is below.

firm <- c("A","A","A","A","B","B","B","B","C","C","C","D","D","D","D","D","D")
account <- c("Account 1","Account 2","Account 3","Account 4","Account 1","Account 2","Account 3","Account 4","Account 1","Account 5","Account 8","Account 1","Account 2","Account 3","Account 6","Account 7","Account 8")
FY1990 <- c(500,30,NA,NA,NA,NA,NA,NA,1000,500,60,400,NA,300,NA,900,1000)
FY1991 <- c(900,40,60,35,340,500,800,60,400,60,1000,400,1000,40,300,900,1200)
FY1992 <- c(1000,40,20,NA,60,800,NA,1000,NA,NA,NA,400,1000,300,300,1000,1500)

Data=data.frame(firm=firm, account=account, FY1990=FY1990, FY1991=FY1991, FY1992=FY1992)
summary(Data)
Data

score 5 · Accepted Answer

这是一种data.table方法：

library(data.table)
dt <- data.table(Data)

dt[, rbind(.SD,
           c("TOTAL",
             lapply(.SD[, grepl("^FY[0-9]+", names(.SD)), with = F],
                    function(x){sum(x, na.rm = !all(is.na(x)))}
                   )),
           use.names = F),
     by = firm]

其工作原理如下：我们迭代公司（by = firm），并且对于每个公司，我们堆叠（rbind）......

与该公司 ( .SD) 相关的数据子集
一个以开头的向量"TOTAL"，其余部分由该长lapply调用创建。

lapply一次只处理与一家公司相关的数据。该数据存储在.SD上面提到的特殊临时 data.table 中。列名也可以直接命名（但在本例中没有）。

该lapply调用的工作方式如下：我们遍历向量列表（通过选择名称通过grepl正则表达式测试的列来选择），并且对于每个向量，我们应用sum函数的特殊变体。

sum 函数的这个变体查看完整的向量x，这里再次——这个向量是从我们正在迭代的列表中选择的，并且一次只有与一个公司相关联的行——并检查是否有任何非NA条目x（即，如果!all(is.na(x))）。如果有，则将这些条目相加，将任何NAs 视为零（因为na.rm=TRUE）；如果没有，它会返回NA（因为na.rm=FALSE我们有NAs）。

有关该na.rm论点的详细信息，请查看?sum. 类似地，上述函数的详细信息（grepl、lapply、...）可以通过搜索?term或找到?"term"。

然后该by=firm选项将公司的结果叠加并添加“公司”作为第一列。

这是结果：

    firm   account FY1990 FY1991 FY1992
 1:    A Account 1    500    900   1000
 2:    A Account 2     30     40     40
 3:    A Account 3     NA     60     20
 4:    A Account 4     NA     35     NA
 5:    A     TOTAL    530   1035   1060
 6:    B Account 1     NA    340     60
 7:    B Account 2     NA    500    800
 8:    B Account 3     NA    800     NA
 9:    B Account 4     NA     60   1000
10:    B     TOTAL     NA   1700   1860
11:    C Account 1   1000    400     NA
12:    C Account 5    500     60     NA
13:    C Account 8     60   1000     NA
14:    C     TOTAL   1560   1460     NA
15:    D Account 1    400    400    400
16:    D Account 2     NA   1000   1000
17:    D Account 3    300     40    300
18:    D Account 6     NA    300    300
19:    D Account 7    900    900   1000
20:    D Account 8   1000   1200   1500
21:    D     TOTAL   2600   3840   4500
    firm   account FY1990 FY1991 FY1992

您必须先安装并加载data.table软件包。

score 2 · Accepted Answer

如果你想这样做，只是另一种选择data.frame。

require(plyr)

sumNA <- function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm = TRUE))

res <- rbind(Data,
             ddply(within(Data, account <- "TOTAL"), .(firm, account), 
                           numcolwise(sumNA))
             )


(res <- res[order(res$firm), ])

##    firm   account FY1990 FY1991 FY1992
## 1     A Account 1    500    900   1000
## 2     A Account 2     30     40     40
## 3     A Account 3     NA     60     20
## 4     A Account 4     NA     35     NA
## 18    A     TOTAL    530   1035   1060
## 5     B Account 1     NA    340     60
## 6     B Account 2     NA    500    800
## 7     B Account 3     NA    800     NA
## 8     B Account 4     NA     60   1000
## 19    B     TOTAL     NA   1700   1860
## 9     C Account 1   1000    400     NA
## 10    C Account 5    500     60     NA
## 11    C Account 8     60   1000     NA
## 20    C     TOTAL   1560   1460     NA
## 12    D Account 1    400    400    400
## 13    D Account 2     NA   1000   1000
## 14    D Account 3    300     40    300
## 15    D Account 6     NA    300    300
## 16    D Account 7    900    900   1000
## 17    D Account 8   1000   1200   1500
## 21    D     TOTAL   2600   3840   4500

r - Using R to sum values for cross-sectional unit in panel data

2 回答 2

Related

Reference