5

我必须使用 R 基于包含表情符号的文本片段进行主题建模。使用replace_emoji()andreplace_emoticon函数让我分析它们,但结果存在问题。

红心表情符号被翻译为“红心ufef”。然后在分析过程中分别处理这些词并损害结果。

像“heart”这样的术语可以有非常不同的含义,就像“red heart ufef”和“broken heart”一样。该功能replace_emoji_identifier()也无济于事,因为标识符使分析变得困难。

通过使用可重现的虚拟数据集dput()(包括步骤force to lowercase

Emoji_struct <- c(
      list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),  
      list(content = "", " thanks for helping",  " oh no, why? ", "careful, challenging ❌❌❌&quot;)
)

当前编码(data_orig是几个文件的列表):

library(textclean)
#The rest should be standard r packages for pre-processing

#pre-processing:
data <- gsub("'", "", data) 
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data)  #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data) 
data <- gsub("[[:digit:]]", "", data)  #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)

期望的输出:

[1] list(content = c("fire fire wow", 
                     "facewithopenmouth look at that", 
                     "facewithsteamfromnose this makes me angry facewithsteamfromnose", 
                     "smilingfacewithhearteyes redheart \ufe0f, i love it!"), 
         content = c("smilingfacewithhearteyes smilingfacewithhearteyes", 
                     "smilingfacewithsmilingeyes thanks for helping", 
                     "cryingface oh no, why? cryingface", 
                     "careful, challenging crossmark crossmark crossmark"))

有任何想法吗?小写也可以。最好的祝福。注意安全。保持健康。

4

1 回答 1

2

回答

将默认转换表替换replace_emoji为删除空格/标点符号的版本:

hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)

replace_emoji(Emoji_struct[,1], emoji_dt = hash2)

例子

单个字符串:

replace_emoji("wow! that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"

字符向量:

replace_emoji(c("1: ", "2: "), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "

列表:

list("list_element_1: ", "list_element_2: ❌&quot;) %>%
  lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "

基本原理

要将表情符号转换为文本,请replace_emoji用作lexicon::hash_emojis转换表(哈希表):

head(lexicon::hash_emojis)
#              x                        y
#1: <e2><86><95>            up-down arrow
#2: <e2><86><99>          down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a>                    watch
#6: <e2><8c><9b>           hourglass done

这是一个类的对象data.table。我们可以简单地修改y这个哈希表的列,以便我们删除所有的空格和标点符号。请注意,这还允许您添加新的 ASCII 字节表示和随附的字符串。

于 2021-05-17T20:32:30.563 回答