3

我对 R 很陌生,我的问题如下:

我有一组像这样按时间序列组织的面板数据(仅显示部分):

Week_Starting    Team A            Team B      Team C   Team D              
2010-01-02         1                   2           3        4
2010-01-09         2                  40           1        5
2010-01-16        15                <NA>           4       11
2010-01-23        25                <NA>           7       18
2010-01-30        38                <NA>           9       29
2010-02-06      <NA>                <NA>          12       34
2010-02-13      <NA>                <NA>          16       40
2010-02-20      <NA>                <NA>          20     <NA>
2010-02-27      <NA>                <NA>          15       28
2010-03-06      <NA>                <NA>          20     <NA>
2010-03-13      <NA>                <NA>          24     <NA>
2010-03-20      <NA>                <NA>          24     <NA>
2010-03-27      <NA>                <NA>          21     <NA>
2010-04-03      <NA>                <NA>          27     <NA>
2010-04-10      <NA>                <NA>          24     <NA>
2010-04-17      <NA>                <NA>          25     <NA>
2010-04-24      <NA>                <NA>          35     <NA>
2010-05-01      <NA>                <NA>          40     <NA>
2010-05-08      <NA>                <NA>          32     <NA>
2010-05-15      <NA>                <NA>        <NA>     <NA>
2010-05-22      <NA>                <NA>          39     <NA>

例如,使用 B 组是没有意义的,因为有太多的观察缺失。排名系统不提供排名低于 40 的数据。所以我想通过删除没有至少 8 周连续观察的列(变量)来清理(例如本例中的团队 A、B 和 D)。因此 D 不符合要求,因为从 2010-02-20 开始的一周有间隔。请记住,我有超过 1000 列。

我以前试过这个,但它没有给我想要的东西,不幸的是我不够熟练,无法修改代码以满足我的需要。

我能想到的一些可能的解决方案:

  1. 子集每个变量的具有 8 个或更多连续观测值的部分

  2. 如果连续运行 8 个 obs 包含 NA,则设置观察值 = NA,然后删除只有 NA 的列,因为不满足最少 8 周要求的列将只有 NA 值(我希望你明白我的意思)

只是出于兴趣,如果数据以长格式组织,做同样的事情会更困难吗?

#Using MrFlick's data frame

melt(dd,id="Week_Starting")

       Week_Starting variable value
    1     2010-01-02   Team_A     1
    2     2010-01-09   Team_A     2
    3     2010-01-16   Team_A    15
    4     2010-01-23   Team_A    25
    5     2010-01-30   Team_A    38
    6     2010-02-06   Team_A    NA
    7     2010-02-13   Team_A    NA
    8     2010-02-20   Team_A    NA
    9     2010-02-27   Team_A    NA
    10    2010-03-06   Team_A    NA
    11    2010-03-13   Team_A    NA
    12    2010-03-20   Team_A    NA
    13    2010-03-27   Team_A    NA
    14    2010-04-03   Team_A    NA
    15    2010-04-10   Team_A    NA
    16    2010-04-17   Team_A    NA
    17    2010-04-24   Team_A    NA
    18    2010-05-01   Team_A    NA
    19    2010-05-08   Team_A    NA
    20    2010-05-15   Team_A    NA
    21    2010-05-22   Team_A    NA
    22    2010-01-02   Team_B     2
    23    2010-01-09   Team_B    40
    24    2010-01-16   Team_B    NA
    25    2010-01-23   Team_B    NA
    26    2010-01-30   Team_B    NA
    27    2010-02-06   Team_B    NA
    28    2010-02-13   Team_B    NA
    29    2010-02-20   Team_B    NA
    30    2010-02-27   Team_B    NA
    31    2010-03-06   Team_B    NA
    32    2010-03-13   Team_B    NA
    33    2010-03-20   Team_B    NA
    34    2010-03-27   Team_B    NA
    35    2010-04-03   Team_B    NA
    36    2010-04-10   Team_B    NA
    37    2010-04-17   Team_B    NA
    38    2010-04-24   Team_B    NA
    39    2010-05-01   Team_B    NA
    40    2010-05-08   Team_B    NA
    41    2010-05-15   Team_B    NA
    42    2010-05-22   Team_B    NA
    43    2010-01-02   Team_C     3
    44    2010-01-09   Team_C     1
    45    2010-01-16   Team_C     4
    46    2010-01-23   Team_C     7
    47    2010-01-30   Team_C     9
    48    2010-02-06   Team_C    12
    49    2010-02-13   Team_C    16
    50    2010-02-20   Team_C    20
    51    2010-02-27   Team_C    15
    52    2010-03-06   Team_C    20
    53    2010-03-13   Team_C    24
    54    2010-03-20   Team_C    24
    55    2010-03-27   Team_C    21
    56    2010-04-03   Team_C    27
    57    2010-04-10   Team_C    24
    58    2010-04-17   Team_C    25
    59    2010-04-24   Team_C    35
    60    2010-05-01   Team_C    40
    61    2010-05-08   Team_C    32
    62    2010-05-15   Team_C    NA
    63    2010-05-22   Team_C    39
    64    2010-01-02   Team_D     4
    65    2010-01-09   Team_D     5
    66    2010-01-16   Team_D    11
    67    2010-01-23   Team_D    18
    68    2010-01-30   Team_D    29
    69    2010-02-06   Team_D    34
    70    2010-02-13   Team_D    40
    71    2010-02-20   Team_D    NA
    72    2010-02-27   Team_D    28
    73    2010-03-06   Team_D    NA
    74    2010-03-13   Team_D    NA
    75    2010-03-20   Team_D    NA
    76    2010-03-27   Team_D    NA
    77    2010-04-03   Team_D    NA
    78    2010-04-10   Team_D    NA
    79    2010-04-17   Team_D    NA
    80    2010-04-24   Team_D    NA
    81    2010-05-01   Team_D    NA
    82    2010-05-08   Team_D    NA
    83    2010-05-15   Team_D    NA
    84    2010-05-22   Team_D    NA

有什么建议么?

4

1 回答 1

4

您可以使用rle计算非 NA 值的运行长度来执行此操作。首先,这是一个不错的 data.frame,您可以使用您的数据复制/粘贴。

dd<-structure(list(Week_Starting = structure(1:21, .Label = c("2010-01-02", 
"2010-01-09", "2010-01-16", "2010-01-23", "2010-01-30", "2010-02-06", 
"2010-02-13", "2010-02-20", "2010-02-27", "2010-03-06", "2010-03-13", 
"2010-03-20", "2010-03-27", "2010-04-03", "2010-04-10", "2010-04-17", 
"2010-04-24", "2010-05-01", "2010-05-08", "2010-05-15", "2010-05-22"
), class = "factor"), Team_A = c(1L, 2L, 15L, 25L, 38L, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Team_B = c(2L, 
40L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), Team_C = c(3L, 1L, 4L, 7L, 9L, 12L, 16L, 
20L, 15L, 20L, 24L, 24L, 21L, 27L, 24L, 25L, 35L, 40L, 32L, NA, 
39L), Team_D = c(4L, 5L, 11L, 18L, 29L, 34L, 40L, NA, 28L, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Week_Starting", 
"Team_A", "Team_B", "Team_C", "Team_D"), class = "data.frame", row.names = c(NA, 
-21L))

现在我们定义一个函数,可以计算向量中非 NA 值的最长运行

consecnonNA <- function(x) {
    rr<-rle(is.na(x))
    max(rr$lengths[rr$values==FALSE])
}

我们可以为每一列计算这个值,并返回至少连续 8 周的那些列的名称

atleast <- function(i) {function(x) x>=i}
hasatleast8 <- names(Filter(atleast(8), sapply(dd[,-1], consecnonNA)))

然后我们可以用

dd[, c("Week_Starting", hasatleast8), drop=F]
于 2014-07-06T22:09:19.457 回答