-1

可重现的数据集

   data1 <- data.frame(ID = c(1,2), Description = c("Chiquita","Chiquita mazamorra"), Max = c(200,125))
   data2 <- data.frame(ID = c(1,2,3,4,5,6,7), Description = c("Chiquita mini", "Chiquita Oriville","Chiquita 24h","Manzano Chiquita 5j...","Chiquita mazamorra 1,2h..","Chiquita mazamorra Buro","Chiquita AM 2F"), Max = c(24,110,80,90,134,123,210))

我有一个数据集data1,如下图

  Id     Description            Max
  1      Chiquita               200
  2      Chiquita mazamorra     125

我还有另一个数据集data2,如下图

  Id     Description                   Actual
  1      Chiquita mini                 24
  2      Chiquita Oriville             110
  3      Chiquita 24h                  80
  4      Manzano Chiquita 5j...        90
  5      Chiquita mazamorra 1,2h...    134
  6      Chiquita mazamorra Buro       123
  7      Chiquita AM 2F                210
  8      Chiquita.....                 124
  9      Chiquita(P)                   213
  10     Chiquita, mazamorra, S        188                   

如果语句应检查 Data2 描述是否在 data2$Description Chiquita mazamorra中包含此字符,如果是,则检查 Data2$Actual > Data1$Max。如果是,那么结果 == 好,否则小。请注意,在 Chiquita mazamorra 之后可能还有其他字符,例如Chiquita mazamorra 1,2h..这没关系,但不是Chiquita mazamorra Buro

同样,另一个 ifelse 应检查 Data2 描述是否包含Chiquita,如果是,则检查 Data2$Actual > Data1$Max。如果是,那么结果 == 好,否则小。在 Chiquita 之后可能还有其他字符,例如Chiquita 24hChiquita AM 2F,这些都可以,但不是Chiquita miniChiquita Oriville

这是最终所需的输出(data2)

  Id     Description                   Actual      Result
  1      Chiquita mini                 24          NA
  2      Chiquita Oriville             110         NA
  3      Chiquita 24h                  80          Small
  4      Manzano Chiquita 5j...        90          NA
  5      Chiquita mazamorra 1,2h...    134         Good         
  7      Chiquita mazamorra Buro       123         NA
  6      Chiquita AM 2F                210         Good
  8      Chiquita.....                 124         Small
  9      Chiquita(P)                   213         NA
  10     Chiquita, mazamorra, S        188         Good

我知道这可以使用 grepl 和 ifelse 语句的组合来完成,我不是很自信?也许有更好的方法可以做到这一点,我不知道,我很困惑。需要帮忙。

4

1 回答 1

0

这是解决方案的概要

data1 <- read.csv(text=
"Id,Description,Max
1,Chiquita,200
2,Chiquita mazamorra,125")

data2 <- read.csv(text=
"Id,Description,Actual
1,Chiquita mini,24
2,Chiquita Oriville,110
3,Chiquita 24h,80
4,Manzano Chiquita 5j,90
5,Chiquita mazamorra 12h,134
6,Chiquita mazamorra Buro,123
7,Chiquita AM 2F,210")


# start by trimming the description to the first few words 
# that don't start with a number
data2$Description_trimmed <- gsub('\\s+\\d.*$','',data2$Description)

# initialize the output field
data2$Results <- NA

# loop while there are missing values in data$Results
while(any(is.na(data2$Results))){

    # identify records that still need to be calculated
    indx <- is.na(data2$Results)

    # calculate the result based on the current trimmed description
    data2[indx,'Results']  <-  ifelse(
                data2[indx,'Actual']  < 
                    data1[match(data2[indx,'Description_trimmed'],
                                data1[    ,'Description']),
                          "Max"],
                'Good',
                'Small')

    # trim the last word from Description_trimmed
    data2$Description_trimmed <- gsub('(^| +)[^ ]*$','',data2$Description_trimmed)

    # stop if the remaining trimmed descriptions are empty
    if(all(grepl('^\\s*$',data2$Description_trimmed)))
        break
}

data2
#>   Id             Description Actual Description_trimmed Results
#> 1  1           Chiquita mini     24                        Good
#> 2  2       Chiquita Oriville    110                        Good
#> 3  3            Chiquita 24h     80                        Good
#> 4  4     Manzano Chiquita 5j     90                        <NA>
#> 5  5  Chiquita mazamorra 12h    134                       Small
#> 6  6 Chiquita mazamorra Buro    123                        Good
#> 7  7          Chiquita AM 2F    210                       Small

(BTY,这个解决方案is.na(data$Results)每个循环计算两次,而你真的只需要计算一次 - 我是为了便于阅读而不是在这方面的效率......)

于 2015-02-23T00:59:28.700 回答