0

我想使用表达式列表来编码一个新字段。

在我的数据框中,Bisaccategory1 包含对图书类别的完整描述。表示该字段中部分值的特定字符串可用于定义称为“流派”的新字段。一种特殊的类型是“非小说”,它映射到 25 个独特的完整描述。我可以通过指定其中包含的某些模式来识别这些完整描述:

  nonfiction<-c("BIOGRAPHY & AUTOBIOGRAPHY","BODY, MIND & SPIRIT","BUSINESS & ECONOMICS","COMICS & GRAPHIC NOVELS",
                  "COMPUTERS","COOKING","FAMILY & RELATIONSHIPS","HEALTH & FITNESS","HISTORY","HOUSE & HOME","HUMOR",
                  "LITERARY CRITICISM","NATURE","PERFORMING 
ARTS","PETS","PHOTOGRAPHY","POETRY","POLITICAL SCIENCE","RELIGION",
                      "SCIENCE","SELF-HELP","SOCIAL SCIENCE","SPORTS & RECREATION","TRANSPORTATION","TRUE CRIME")

然后我可以匹配这些字符串以完成 Biscategory1 值,如下所示:

matches <- unique (grep(paste(nonfiction,collapse="|"), 
                                detail$Bisaccategory1, value=TRUE))

但我不清楚如何使用这些“匹配”将值“非小说”分配给我的新类型字段。

这是样本数据:

structure(list(Author = c("James Swallow", "Billy Crystal", "Mark Divine", 
"Charles Cumming", "Victoria Schwab", "Louise Penny", "Elizabeth Warren", 
"Linda Castillo", "Paul Fischer", "Sandy Hall", "Louise Penny", 
"Louise Penny", "Lisa Scottoline", "Linda Castillo", "Evan Osnos", 
"Porter Erisman"), Title = c("24: Deadline", "700 Sundays - Still Foolin' 'Em", 
"8 Weeks to Sealfit", "A Colder War", "A Dark Shade of Magic", 
"A Fatal Grace", "A Fighting Chance", "A Hidden Secret", "A Kim Jong-Il Production", 
"A Little Something Different", "A Rule Against Murder", "A Trick of the Light", 
"Accused", "After the Storm", "Age of Ambition", "Alibaba's World"
), Bisac = c("FICTION / Thrillers / General", "BIOGRAPHY & AUTOBIOGRAPHY / Entertainment & Performing Arts", 
"HEALTH & FITNESS / Exercise", "FICTION / Thrillers / Espionage", 
"FICTION / Fantasy / Historical", "FICTION / Mystery & Detective / Traditional", 
"BIOGRAPHY & AUTOBIOGRAPHY / Political", "FICTION / Mystery & Detective / Police Procedural", 
"HISTORY / Asia / Korea", "JUVENILE FICTION / Love & Romance", 
"FICTION / Mystery & Detective / Traditional", "FICTION / Mystery & Detective / Traditional", 
"FICTION / Thrillers / Legal", "FICTION / Mystery & Detective / Police Procedural", 
"HISTORY / Asia / China", "BUSINESS & ECONOMICS / E-Commerce / General"
)), .Names = c("Author", "Title", "Bisac"), class = "data.frame", row.names = c(NA, 
-16L))

我知道我可以做类似的事情:

df$Genre[Bisaccategory1=="BODY, MIND & SPIRIT / Inspiration & Personal Growth"]<-"nonfiction"

但我有数百个类别,这并不是真正可扩展的。我会很感激任何建议。

4

2 回答 2

2

而不是grep该函数grepl将返回进行匹配的逻辑索引。您可以使用它来对 Genre 列进行子集化。我将不是“非小说”的条目分配给小说,但你可以随意制作它们。

matches <- grepl(paste(nonfiction,collapse="|"), detail$Bisac)
detail$Genre <- "fiction"
detail$Genre[matches] <- "non-fiction"
# Bisac       Genre
# 1                                FICTION / Thrillers / General     fiction
# 2  BIOGRAPHY & AUTOBIOGRAPHY / Entertainment & Performing Arts non-fiction
# 3                                  HEALTH & FITNESS / Exercise non-fiction
# 4                              FICTION / Thrillers / Espionage     fiction
# 5                               FICTION / Fantasy / Historical     fiction
# 6                  FICTION / Mystery & Detective / Traditional     fiction
# 7                        BIOGRAPHY & AUTOBIOGRAPHY / Political non-fiction
于 2015-09-17T03:57:46.143 回答
0
library(dplyr)
library(tidyr)
library(stringi)

non_fiction_books = 
  detail %>%
  mutate(Bisac = Bisac %>% stri_split_fixed(" / ") ) %>%
  unnest(Bisac) %>%
  mutate(Bisac = Bisac %>% stri_trans_toupper) %>%
  right_join(data_frame(Bisac = non_fiction) ) %>%
  select(-Bisac) %>%
  distinct
于 2015-09-17T04:10:54.980 回答