2

我有一个数据集,我试图通过文章获取情绪。我有大约1000篇文章。每篇文章都是一个字符串。该字符串中有多个句子。理想情况下,我想添加另一列来总结每篇文章的情绪。有没有使用 dplyr 的有效方法?

下面是一个只有 2 篇文章的示例数据集。

date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n  \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this  link  .',
   'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')

df<-data.frame(date, text, link, V4)

head(df)

在此处输入图像描述

所以我一直在寻找如何使用感测器包来做到这一点,并在下面创建。但是,这只会输出每个句子的情绪(我通过 strsplit of 来做到这一点.,),而我想在应用这个 strsplit 之后聚合整个文章级别的所有内容。

library(sentimentr)
full<-df %>%
  group_by(V4) %>%
  mutate(V2 = strsplit(as.character(V4), "[.],")) %>% 
  unnest(V2) %>%
  get_sentences() %>%
  sentiment()

我正在寻找的所需输出是简单地在我的数据框中添加一个额外的列df,其中包含每篇文章的摘要总和(情绪)。

基于以下答案的附加信息:

date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n  \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this  link  .',
   'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')

df<-data.frame(date, text, link, V4)

df %>%
  group_by(V4) %>% # group by not really needed
  mutate(V4 = gsub("[.],", ".", V4), 
         sentiment_score = sentiment_by(V4)) 

# A tibble: 2 x 5
# Groups:   V4 [2]
  date       text                      link                                V4                                                  sentiment_score$e~ $word_count   $sd $ave_sentiment
  <date>     <chr>                     <chr>                               <chr>                                                            <int>       <int> <dbl>          <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~                  1         172 0.204       -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~                  1         161 0.329       -0.174  
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>. 
4

1 回答 1

1

如果您需要整个文本的情感,则无需先将文本拆分为句子,情感函数会处理这一点。我将文本中的 ., 替换为句点,因为这是情感功能所必需的。情感函数识别“先生”。因为不是句子的结尾。如果你get_sentences()先使用,你会得到每个句子的情绪,而不是整个文本。

该函数sentiment_by处理整个文本的情绪并很好地平均它。averaging.function如果您需要更改此选项,请查看有关选项的帮助。该by功能的一部分可以处理您要应用的任何分组。

df %>%
  group_by(V4) %>% # group by not really needed
  mutate(V4 = gsub("[.],", ".", V4), 
         sentiment_score = sentiment_by(V4)) 

# A tibble: 2 x 5
# Groups:   V4 [2]
  date       text               link                      V4                            sentiment_score$~ $word_count   $sd $ave_sentiment
  <date>     <chr>              <chr>                     <chr>                                     <int>       <int> <dbl>          <dbl>
1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~                 1         172 0.204       -0.00849
2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~                 1         161 0.329       -0.174  
于 2020-06-25T13:54:28.620 回答