我有一个包含 50,176 条推文的数据集(tweets_data: 50176 obs. of 1 variable)。现在,我已经创建了一个自制的词典(formal_lexicon),它由大约 100 万个单词组成,都是正式的语言风格。现在,我想创建一个小代码,每条推文计算该词典中有多少(如果有的话)单词。
推文数据:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
正式词典:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
因此,输出应如下所示:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
应该是一个简单的 for 循环,例如:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
我认为“单词”不起作用,如果单词在词典中,我不确定如何在特定列中计数。任何人都可以帮忙吗?
structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc @santa", "When my whole fam tryna have a peaceful holiday " )