r - 在数据框中按行进行文本挖掘

Question

我有这个数据框：

> str(final)
'data.frame':   112 obs. of  3 variables:
 $ FAO_CountryName: chr  Algeria  Egypt  Libya  Morocco ...
 $ FAO_CountryURL : chr  "http://www.fao.org/giews/countrybrief/country.jsp?code=DZA" "http://www.fao.org/giews/countrybrief/country.jsp?code=EGY" "http://www.fao.org/giews/countrybrief/country.jsp?code=LBY" "http://www.fao.org/giews/countrybrief/country.jsp?code=MAR" ...
 $ Text           : chr  "\r\n   Reference Date: 24-November-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n          "| __truncated__ "\r\n   Reference Date: 28-November-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n          "| __truncated__ "\r\n   Reference Date: 15-November-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n          "| __truncated__ "\r\n   Reference Date: 21-September-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n         "| __truncated__ ...

我想以一种我可以的方式处理 Text 变量 - 例如 - 逐行计算一个单词在其中出现的次数。换句话说，我想得到一个如下的数据框：

> head(final, n=2)
  FAO_CountryName   FAO_CountryURL             Text                    WordCount 
  Algeria            http://www.fao.org…       Algeria is nice…          Algeria  1 
                                                                              is  1
                                                                             ...
  Egypt              http://www.fao.org…       Egypt is nice too…          Egypt    1  
                                                                              is    5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                              ...

然而，我已经这样做了：

## Counting the words included in the textual dataset.
   keywords <- text_df %>% 
   unnest_tokens(word, text) %>% 
   count(word, sort = TRUE) %>%
   ungroup()

## Scoring the textual frequencies into the textual dataset (i.e. how many times the words are present)
   total_words <- keywords %>% 
   group_by(word) %>% 
   summarize(total = sum(n))

尽管如此，这样我只能获得所有列的字数，而不是 ROW BY ROW。有什么建议吗？

r - 在数据框中按行进行文本挖掘

0 回答 0

Related

Reference