1

我正在尝试将函数应用于数据框的每个元素。这种数据框的一个简单示例是:

> accts
ACCOUNT       DATE
   1     2008-03-01
   2     2009-06-17
   3     2008-07-02
   4     2009-03-15

我需要做的是查看此数据框的每一行,然后在更大的数据框中找到该帐户,如下所示:

> trans
ACCOUNT_NUM  TRAN_DATE
        1    2008-02-02
        2    2008-04-02
        3    2008-03-16
        3    2009-08-22
        3    2008-05-05
        6    2010-11-03
        7    2008-09-18
        4    2009-10-14
        4    2009-01-15
       10    2011-07-06

对于“accts”数据框中的每一行,我需要获取与该帐户对应的“trans”数据框中的记录,该帐户也具有最接近“DATE”但在它之前发生的“TRAN_DATE”。我尝试使用应用功能:

tranDateVector <- apply(accts, 2, getTranDate)

getTranDate <- function(x)
{
  tranDate <- subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
  dataDiff <- x[2] - tranDate
  tranDate <- unique(date[which(dateDiff == min(dateDiff))])
  return(tranDate)
}

accts <- cbind(accts, tranDateVector)

当我运行我的迷你示例时,我收到以下错误:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

然而,当我运行我的完整版本时,我得到了一个不同的错误,我意识到它来自这一行:

subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))

如果我将 x 设置为我的 'accts' 数据框的第三行,那么:

     x
     ACCOUNT       DATE
3       3      2008-07-02

并运行代码的“子集”行,我收到以下错误,这与我在常规代码上遇到的错误相对应:

> subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
Error in eval(expr, envir, enclos) : 
  dims [product 1] do not match the length of object [10]
In addition: Warning message:
In eval(expr, envir, enclos) :
  Incompatible methods ("Ops.Date", "Ops.data.frame") for "<"

谢谢你的帮助。


(以下信息是在提供上述答案后添加的 b/c 我意识到有一个并发症)

我刚刚意识到的功能还有一些额外的限制需要考虑,这些都会导致问题变得更加复杂。在“accts”数据框中有两种不同的状态:

> accts <- data.frame(
+     ACCOUNT = 1:4,
+     DATE = as.Date(c("2008-03-01", "2009-06-17",
+                      "2008-07-02", "2009-03-15")),
+     STATUS = c("new", "old", "new", "old"))

在“accts”框架中,记录可以分类为旧的或新的。如果帐户是“新”的,则它需要满足前面指定的条件,但它也只能与“trans”中标记为“revised”的记录匹配。同样对于“旧”帐户,它们只能与 trans 的“原始”记录进行比较:

> trans <- data.frame(
+     ACCOUNT_NUM = c(1,2,3,3,3,6,7,4,4,10),
+     TRAN_DATE = as.Date(c("2008-02-02", "2008-04-02",
+                           "2008-03-16", "2009-08-22",
+                           "2008-05-05", "2010-11-03",
+                           "2008-09-18", "2009-10-14",
+                           "2009-01-15", "2011-07-06")),
+     BALANCE = c("orig", "orig", "orig", "orig", "revised", "orig", "revised", "revised", "revised", "orig"))

我尝试实现您的代码以适应这种情况,如下所示:

library(plyr)
adply(accts, 1, transform,
            TRAN_DATE = { 
                 if(STATUS == "old")
                 {
                    data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
                                             TRAN_DATE < DATE & BALANCE == "orig")
                 }else{
                    data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
                                             TRAN_DATE < DATE & BALANCE == "revised")
                 }
                 tail(data$TRAN_DATE, 1) })

我从这段代码中得到以下错误:

Error in data.frame(list(ACCOUNT = 1L, DATE = 13939, STATUS = 1L), BALANCE = list( : 
  arguments imply differing number of rows: 1, 0

我很抱歉在我最初的帖子中没有指定这个要求,我没有意识到这会导致问题。

4

1 回答 1

4

因为您的数据混合了类型(数字、日期),所以我会远离使用apply,因为它会将您的数据强制转换为单一类型。相反,我建议使用plyr'sadply函数,它确实保留所有类型,因为每一行都作为 data.frame 处理。它还有一个优点是仍然可以使用列名访问字段,这通常会导致代码更具可读性,我会让你判断。

您的数据:

accts <- data.frame(
  ACCOUNT = 1:4,
  DATE = as.Date(c("2008-03-01", "2009-06-17",
                   "2008-07-02", "2009-03-15")))

trans <- data.frame(
  ACCOUNT_NUM = c(1,2,3,3,3,6,7,4,4,10),
  TRAN_DATE = as.Date(c("2008-02-02", "2008-04-02",
                        "2008-03-16", "2009-08-22",
                        "2008-05-05", "2010-11-03",
                        "2008-09-18", "2009-10-14",
                        "2009-01-15", "2011-07-06")))

使用的解决方案adply

library(plyr)
adply(accts, 1, transform,
      TRAN_DATE = { data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
                                          TRAN_DATE < DATE)
                    tail(data$TRAN_DATE, 1) })
#   ACCOUNT       DATE  TRAN_DATE
# 1       1 2008-03-01 2008-02-02
# 2       2 2009-06-17 2008-04-02
# 3       3 2008-07-02 2008-05-05
# 4       4 2009-03-15 2009-01-15
于 2013-04-12T23:12:15.570 回答