r - unnest_tokens 的对面

Question

这很可能是一个愚蠢的问题，但我已经用谷歌搜索和搜索，但找不到解决方案。我认为这是因为我不知道正确的方式来表达我要搜索的问题。

我有一个数据框，已在 R 中转换为整洁的文本格式，以消除停用词。我现在想将该数据框“整理”回其原始格式。

unnest_tokens 的相反/反向命令是什么？

编辑：这是我正在使用的数据的样子。我正在尝试复制 Silge 和 Robinson 的Tidy Text book 中的分析，但使用的是意大利歌剧歌词。

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line)
sample_df

character line
FIGARO    Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA   Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE     Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!

我把它变成整洁的文本，这样我就可以去掉停用词：

tribble <- sample_df %>%
           unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
            anti_join(itstopwords)

现在我有这样的事情：

text    word
FIGARO  cinque
FIGARO  dieci
FIGARO  venti
FIGARO  trenta
...

我想将其恢复为字符名称和相关行的格式以查看其他内容。基本上我希望文本格式与以前相同，但删除了停用词。

score 14 · Accepted Answer

不是一个愚蠢的问题！答案在一定程度上取决于您正在尝试做什么，但如果我想使用group_by()dplyr 中的函数以整理的形式进行一些处理后将文本恢复为原始形式，这将是我的典型方法。

首先，让我们从原始文本转换为经过整理的格式。

library(tidyverse)
library(tidytext)

tidy_austen <- janeaustenr::austen_books() %>%
    group_by(book) %>%
    mutate(linenumber = row_number()) %>%
    ungroup() %>%
    unnest_tokens(word, text)

tidy_austen
#> # A tibble: 725,055 x 3
#>    book                linenumber word       
#>    <fct>                    <int> <chr>      
#>  1 Sense & Sensibility          1 sense      
#>  2 Sense & Sensibility          1 and        
#>  3 Sense & Sensibility          1 sensibility
#>  4 Sense & Sensibility          3 by         
#>  5 Sense & Sensibility          3 jane       
#>  6 Sense & Sensibility          3 austen     
#>  7 Sense & Sensibility          5 1811       
#>  8 Sense & Sensibility         10 chapter    
#>  9 Sense & Sensibility         10 1          
#> 10 Sense & Sensibility         13 the        
#> # … with 725,045 more rows

文字现在很整齐！但是我们可以把它弄乱，回到某种类似于它的原始形式的东西。我通常使用dplyr 和group_by()stringr来处理这个问题。在这种特殊情况下，最后的文本是什么样的？summarize()str_c()

tidy_austen %>% 
    group_by(book, linenumber) %>% 
    summarize(text = str_c(word, collapse = " ")) %>%
    ungroup()
#> # A tibble: 62,272 x 3
#>    book            linenumber text                                         
#>    <fct>                <int> <chr>                                        
#>  1 Sense & Sensib…          1 sense and sensibility                        
#>  2 Sense & Sensib…          3 by jane austen                               
#>  3 Sense & Sensib…          5 1811                                         
#>  4 Sense & Sensib…         10 chapter 1                                    
#>  5 Sense & Sensib…         13 the family of dashwood had long been settled…
#>  6 Sense & Sensib…         14 was large and their residence was at norland…
#>  7 Sense & Sensib…         15 their property where for many generations th…
#>  8 Sense & Sensib…         16 respectable a manner as to engage the genera…
#>  9 Sense & Sensib…         17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib…         18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows

^{由reprex 包（v0.3.0）于 2019-07-11 创建}

score 7 · Accepted Answer

library(tidyverse)
tidy_austen %>% 
     group_by(book,linenumber) %>% 
     summarise(text = str_c(word, collapse = " "))

r - unnest_tokens 的对面

2 回答 2

Related

Reference