我必须使用 R 基于包含表情符号的文本片段进行主题建模。使用replace_emoji()
andreplace_emoticon
函数让我分析它们,但结果存在问题。
红心表情符号被翻译为“红心ufef”。然后在分析过程中分别处理这些词并损害结果。
像“heart”这样的术语可以有非常不同的含义,就像“red heart ufef”和“broken heart”一样。该功能replace_emoji_identifier()
也无济于事,因为标识符使分析变得困难。
通过使用可重现的虚拟数据集dput()
(包括步骤force to lowercase
:
Emoji_struct <- c(
list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),
list(content = "", " thanks for helping", " oh no, why? ", "careful, challenging ❌❌❌")
)
当前编码(data_orig
是几个文件的列表):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
期望的输出:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
有任何想法吗?小写也可以。最好的祝福。注意安全。保持健康。