0

我有一个包含 50,176 条推文的数据集(tweets_data: 50176 obs. of 1 variable)。现在,我已经创建了一个自制的词典(formal_lexicon),它由大约 100 万个单词组成,都是正式的语言风格。现在,我想创建一个小代码,每条推文计算该词典中有多少(如果有的话)单词。

推文数据:

                   Content            
1                 "Blablabla"               
2                 "Hi my name is"               
3                 "Yes I need"                 
.  
.
. 
50176            "TEXT50176" 

正式词典:

                       X            
1                 "admittedly"               
2                 "Consequently"               
3                 "Furthermore"                 
.  
.
. 
1000000            "meanwhile"   

因此,输出应如下所示:

                  Content             Lexicon
1                 "TEXT1"                1
2                 "TEXT2"                3
3                 "TEXT3"                0 
.  
.
. 
50176            "TEXT50176"             2

应该是一个简单的 for 循环,例如:

for(sentence in tweets_data$Content){ 
  for(word in sentence){
    if(word %in% formal_lexicon){
         ...
}
}
}

我认为“单词”不起作用,如果单词在词典中,我不确定如何在特定列中计数。任何人都可以帮忙吗?

structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")

c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )
4

2 回答 2

1

你可以尝试这样的事情:

library(tidytext)
library(dplyr)

# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))


tweets_data_df  %>% 
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)

结果:

Joining, by = "id"
# A tibble: 6 x 3
     id Content                                                              cnt
  <int> <chr>                                                              <dbl>
1     1 "@barackobama Thank you for your incredible grace in leadership a~     0
2     2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~     0
3     3 "2017 resolution: to embody authenticity!"                             0
4     4 "Happy Holidays! Sending love and light to every corner of the ea~     0
5     5 "Damn, it's hard to wrap presents when you're drunk. cc @santa"        0
6     6 "When my whole fam tryna have a peaceful holiday "                     0
于 2021-07-28T14:27:32.403 回答
0

希望这对您有用:

library(magrittr)
library(dplyr)
library(tidytext)

# Data frame with tweets, including an ID
tweets <- data.frame(
  id = 1:3,
  text = c(
    'Hello, this is the first tweet example to your answer',
    'I hope that my response help you to do your task',
    'If it is tha case, please upvote and mark as the correct answer'
  )
)

lexicon <- data.frame(
  word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)


# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
  tidytext::unnest_tokens(output = 'words', input = text) %>% 
# Determining if a word is in your lexicon
  dplyr::mutate(
    in_lexicon = words %in% lexicon$word
  ) %>% 
  dplyr::group_by(id) %>%
  dplyr::summarise(words_in_lexicon = sum(in_lexicon))

# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

于 2021-07-28T14:45:54.713 回答