0

考虑以下数据:

Country1 = c("Brazil", "India", "China","China","Brazil")
Date1<-as.Date(c("2001-01-21", "2002-04-13","2003-06-19","2006-06-19","2007-06-19"))
Name1<-c("B","C","A","A","A")
Data1<-data.frame(Country1,Date1,Name1)

Name2<-c("B","B","C","C","C","A","A","A")
Quality2<-c("good","good","medium","good","good","bad","good","good")
Country2<-c("China","Brazil","Taiwan","India","India","United States","China","Brazil")
Date2<-as.Date(c("2002-02-21", "1999-03-13","1998-08-19", "1996-09-13","2000-12-12","1998-07-21","2005-03-22","2003-06-19"))
Data2<-data.frame(Name2,Quality2,Country2,Date2)

在 Data1 中,我想添加一个名为“结果”的列。“结果”(对于 Data1 的每一行)应该是满足四个条件的 Data2 的行数的总和(1)Data2$Name2 应该匹配 Data1$Name1 的行条目,(2)Data2$Country2 应该匹配行的条目Data1$Country1,(3) Data2$Quality2 应该是“好”,(4) Data2$Date2 应该小于 Data1$Date1 的行条目。因此,Data1$Result 应该是 1、2、0、1 和 1。

例如,对于第一行,Data1$Result 应该为 1,因为 Data2 只有 1 行满足这些条件: sum(Data2$Name2==as.character(Data1$Name1)[1] & Data2$Country2==as.character(Data1$Country1)[1] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[1])

或者,换句话说

sum(Data2$Name2=="B" & Data2$Country2=="Brazil" & Data2$Quality2=="good" & Data2$Date2 < "2001-01-21")

同样,对于第二行,Data1$Result 应该是 2,因为 Data2 有 2 行满足这些条件:sum(Data2$Name2==as.character(Data1$Name1)[2] & Data2$Country2==as.character(Data1$Country1)[2] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[2])

或者,

sum(Data2$Name2=="C" & Data2$Country2=="India" & Data2$Quality2=="good" & Data2$Date2 < "2002-04-13").

对于第三行,Data1$Result 应该为 0,因为 Data2 没有任何满足这些条件的行: sum(Data2$Name2==as.character(Data1$Name1)[3] & Data2$Country2==as.character(Data1$Country1)[3] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[3])

或者,

sum(Data2$Name2=="A" & Data2$Country2=="China" & Data2$Quality2=="good" & Data2$Date2 < "2003-06-19").

第 4 行和第 5 行也是如此:

sum(Data2$Name2==as.character(Data1$Name1)[4] & Data2$Country2==as.character(Data1$Country1)[4] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[4])

sum(Data2$Name2==as.character(Data1$Name1)[5] & Data2$Country2==as.character(Data1$Country1)[5] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[5])

作为 R 的初学者,我编写了以下代码:

sum(Data2$Name2==as.character(Data1$Name1)[1:nrow(Data1)] & Data2$Country2==as.character(Data1$Country1)[1:nrow(Data1)] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[1:nrow(Data1)])

但是,它不会返回所需的结果。我想根据 Data1 的行数编写一个动态代码。在我的实际数据中,我在每个数据中都有大约 100,000 个观察值。

理想情况下,我正在寻找 R 根据 Data1 “n” 的行数读取的一些代码。

例如,对于第一行,R 应该执行

sum(Data2$Name2==as.character(Data1$Name1)[1] & Data2$Country2==as.character(Data1$Country1)[1] & ata2$Quality2=="good" & Data2$Date2 < Data1$Date1[1])

对于第二行,

sum(Data2$Name2==as.character(Data1$Name1)[2] & Data2$Country2==as.character(Data1$Country1)[2] & ata2$Quality2=="good" & Data2$Date2 < Data1$Date1[2])

对于(假设)第 54,342 行

sum(Data2$Name2==as.character(Data1$Name1)[54342] & Data2$Country2==as.character(Data1$Country1)[54342] & ata2$Quality2=="good" & Data2$Date2 < Data1$Date1[54342])

对于第 n 行

sum(Data2$Name2==as.character(Data1$Name1)[n] & Data2$Country2==as.character(Data1$Country1)[n] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[n])

另外,我想在 Data1 中添加另一列,名称为“Min.Date.Result”,它给出了满足相同四个条件的 Data2$Date2 的最小(最旧)值。所以 Data1$Min.Date.Result 应该是“1999-03-13”、“1996-09-13”、“NA”、“2005-03-22”、“2003-06-19”。

4

1 回答 1

0

我们可以保留行,将其与 连接filter,并计算行数和最小值。Quality2"Good"Data1group_by Country2Date2 < Date1

library(dplyr)

Data2 %>%
  filter(Quality2 == 'good') %>%
  right_join(Data1, by = c('Name2' = 'Name1', 'Country2' = 'Country1')) %>%
  group_by(Country2) %>%
  summarise(Result = sum(Date2 < Date1), 
            Date1 = min(Date2[Date2 < Date1]))

# A tibble: 3 x 3
#  Country2 Result Date1     
#  <chr>     <int> <date>    
#1 Brazil        1 1999-03-13
#2 China         0 NA        
#3 India         2 1996-09-13

对于更新的数据,我们可以更改方法并执行以下操作:

Data1 %>%
  left_join(Data2, by = c('Name1' = 'Name2', 'Country1' = 'Country2')) %>%
  group_by(Country1, Date1) %>%
  summarise(Result = sum(Date2 < Date1 & Quality2 == "good"), 
            Date = min(Date2[Date2 < Date1 & Quality2 == "good"]))

#  Country1 Date1      Result Date      
#  <chr>    <date>      <int> <date>    
#1 Brazil   2001-01-21      1 1999-03-13
#2 China    2003-06-19      0 NA        
#3 China    2006-06-19      1 2005-03-22
#4 India    2002-04-13      2 1996-09-13
于 2020-03-07T02:59:28.607 回答