2

我对 R 很陌生,并且对循环有疑问

在我的真实数据集中,有 80 个国家的 7000 个观测值,有 15 个部门和 6 种组织类型,但这里是一个简化的示例。

country <- c("a","a","a","a","a","a","b","b","b","b","b","b",
             "c","c","c","c","c","c","d","d","d","d","d","d")
sector <- c("a","a","a","b","c","c","a","b","b","b","c","c",
            "b","b","b","b","c","c","a","a","b","b","c","c")
organization <-c("a","b","c","c","b","a","a","b","b","c","b","b",
                 "c","a","a","b","b","c","c","b","a","a","b","c")
budget <-c(2,4,3,5,9,7,5,4,3,6,1,2,4,5,6,1,5,3,4,2,3,5,4,6)
table <- data.frame(country, sector, organization, budget)

我想要的是:

  1. 特定国家/地区特定部门中不同类型组织的数量。
  2. 分配给不同类型组织的部门总预算的百分比。

我首先必须制作一个子集以仅从国家“a”和部门“a”中选择信息

smalltable <-subset(table, (country == "a") & (sector == "a"))

然后回答我的第一个问题,一个国家的一个部门中每种类型的组织有多少

smalltable$count <- table(smalltable$organization)

然后我需要找到财务的百分比

smalltable$percentage <- smalltable$budget / sum(smalltable$budget)

然后我用了tapply

 N <- tapply(smalltable$count, smalltable$organization, FUN=sum)
 financialshare <- tapply(smalltable$percentage, smalltable$organization, FUN=sum)    

最后结合了这个:

 total <- data.frame (smalltable$country,smalltable$sector,smalltable$organization, N,financialshare)
 total

这是我需要的小桌子!

但是我在所有 15 个部门和所有 80 个国家都需要这个,所以我需要某种循环函数来运行所有部门的循环并为每个国家重复这个循环。我需要使这些表格尽可能精简,将有关 1 个国家(即 15 个部门)的所有信息汇总到一张表格中。还应从表中删除零值以节省空间。

我需要如何进行?

4

3 回答 3

3

我会给出data.table答案

library(data.table)
my_table=data.table(country, sector, organization, budget)
by_org=my_table[, list(count=.N, budget=sum(budget)),
                  keyby=list(country, sector, organization)]
total_budgets=my_table[, list(total_budget=sum(budget)),
                  keyby=list(country, sector)]
joined_table= total_budgets[by_org]
joined_table[,percentage:=budget/total_budget]

来自 Matthew 的编辑:在 v1.8.1 中,:=按组使用,不需要连接,因此它更容易和更快,并且该total_budget列被添加到右侧,这是一个比在 v1.8.0 中使用连接更自然的地方:

DT = data.table(country, sector, organization, budget) 
ans = DT[, list(count=.N, budget=sum(budget)),
           keyby=list(country, sector, organization)] 
ans[, total_budget:=sum(budget), by=list(country,sector)]
ans[, percentage:=budget/total_budget]

结果(使用 v1.8.1):

    country sector organization count budget total_budget percentage
 1:       a      a            a     1      2            9  0.2222222
 2:       a      a            b     1      4            9  0.4444444
 3:       a      a            c     1      3            9  0.3333333
 4:       a      b            c     1      5            5  1.0000000
 5:       a      c            a     1      7           16  0.4375000
 6:       a      c            b     1      9           16  0.5625000
 7:       b      a            a     1      5            5  1.0000000
 8:       b      b            b     2      7           13  0.5384615
 9:       b      b            c     1      6           13  0.4615385
10:       b      c            b     2      3            3  1.0000000
11:       c      b            a     2     11           16  0.6875000
12:       c      b            b     1      1           16  0.0625000
13:       c      b            c     1      4           16  0.2500000
14:       c      c            b     1      5            8  0.6250000
15:       c      c            c     1      3            8  0.3750000
16:       d      a            b     1      2            6  0.3333333
17:       d      a            c     1      4            6  0.6666667
18:       d      b            a     2      8            8  1.0000000
19:       d      c            b     1      4           10  0.4000000
20:       d      c            c     1      6           10  0.6000000

这里需要注意两件事:首先,就计数和总和而言,您的问题有点模糊和矛盾,但希望我的代码片段就我正在做的计算而言足够自我解释。

R其次,循环大量观察结果并不习惯,因为这往往很慢。大多数编程R了一段时间的人倾向于使用向量运算、、、plyrdata.table其他类似的包。

但要完整,循环构造如下:

for (item in list)
{
    ...
}

迭代常见的索引...

for (i in 1:length(object))
{
    ...
}
于 2012-06-19T11:58:19.113 回答
2
library(plyr)
ddply(table,.(country,sector), transform,count=as.vector(table(budget)),percentage=budget / sum(budget))

   country sector organization budget count percentage
1        a      a            a      2     1  0.2222222
2        a      a            b      4     1  0.4444444
3        a      a            c      3     1  0.3333333
4        a      b            c      5     1  1.0000000
5        a      c            b      9     1  0.5625000
6        a      c            a      7     1  0.4375000
7        b      a            a      5     1  1.0000000
8        b      b            b      4     1  0.3076923
9        b      b            b      3     1  0.2307692
10       b      b            c      6     1  0.4615385
11       b      c            b      1     1  0.3333333
12       b      c            b      2     1  0.6666667
13       c      b            c      4     1  0.2500000
14       c      b            a      5     1  0.3125000
15       c      b            a      6     1  0.3750000
16       c      b            b      1     1  0.0625000
17       c      c            b      5     1  0.6250000
18       c      c            c      3     1  0.3750000
19       d      a            c      4     1  0.6666667
20       d      a            b      2     1  0.3333333
21       d      b            a      3     1  0.3750000
22       d      b            a      5     1  0.6250000
23       d      c            b      4     1  0.4000000
24       d      c            c      6     1  0.6000000
于 2012-06-19T11:30:32.057 回答
1

您已经完美地设置了它以使用plyr. 我的意思是,你有一个(几乎)处理一个子集的过程,它返回你想要的那个子集,现在你只需要遍历所有可能的子集。我重新编写了您的代码以使其更紧密并解决可能丢失organization的 s。

library("plyr")

ddply(table, .(country, sector), function(smalltable) {
  smalltable <- ddply(smalltable, .(organization), summarise, 
                      count=length(budget), budget=sum(budget))
  smalltable$percentage <- smalltable$budget / sum(smalltable$budget)
  smalltable
})

这使

   country sector organization count budget percentage
1        a      a            a     1      2  0.2222222
2        a      a            b     1      4  0.4444444
3        a      a            c     1      3  0.3333333
4        a      b            c     1      5  1.0000000
5        a      c            a     1      7  0.4375000
6        a      c            b     1      9  0.5625000
7        b      a            a     1      5  1.0000000
8        b      b            b     2      7  0.5384615
9        b      b            c     1      6  0.4615385
10       b      c            b     2      3  1.0000000
11       c      b            a     2     11  0.6875000
12       c      b            b     1      1  0.0625000
13       c      b            c     1      4  0.2500000
14       c      c            b     1      5  0.6250000
15       c      c            c     1      3  0.3750000
16       d      a            b     1      2  0.3333333
17       d      a            c     1      4  0.6666667
18       d      b            a     2      8  1.0000000
19       d      c            b     1      4  0.4000000
20       d      c            c     1      6  0.6000000

请注意,这table不是变量的好名称,因为它也是基本函数的名称。

于 2012-06-19T15:24:20.360 回答