r - 如何编写自定义 removePunctuation() 函数以更好地处理 Unicode 字符？

Question

在 tm text-mining R-package 的源代码中，在文件transform.R中，有这个removePunctuation()函数，目前定义为：

function(x, preserve_intra_word_dashes = FALSE)
{
    if (!preserve_intra_word_dashes)
        gsub("[[:punct:]]+", "", x)
    else {
        # Assume there are no ASCII 1 characters.
        x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
        x <- gsub("[[:punct:]]+", "", x)
        gsub("\1", "-", x, fixed = TRUE)
    }
}

我需要解析和挖掘来自科学会议的一些摘要（从他们的网站以 UTF-8 格式获取）。摘要包含一些需要删除的 unicode 字符，尤其是在单词边界处。有通常的 ASCII 标点字符，还有一些 Unicode 破折号、Unicode 引号、数学符号……

文本中还有URL，其中的标点符号需要保留词内标点符号。tm的内置removePunctuation()功能太激进了。

所以我需要一个自定义removePunctuation()函数来根据我的要求进行删除。

我的自定义 Unicode 函数现在看起来像这样，但它没有按预期工作。我很少使用 R，所以在 R 中完成任务需要一些时间，即使是最简单的任务。

我的功能：

corpus <- tm_map(corpus, rmPunc =  function(x){ 
# lookbehinds 
# need to be careful to specify fixed-width conditions 
# so that it can be used in lookbehind

x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ; 
# lookaheads (can use variable-width conditions) 
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;

# remove all strings that consist *only* of punct chars 
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;

}

它没有按预期工作。我想，它根本没有做任何事情。标点符号仍在术语文档矩阵中，请参阅：

 head(Terms(tdm), n=30)

  [1] "<></>"                      "---"                       
  [3] "--,"                        ":</>"                      
  [5] ":()"                        "/)."                       
  [7] "/++"                        "/++,"                      
  [9] "..,"                        "..."                       
 [11] "...,"                       "..)"                       
 [13] "“”,"                        "(|)"                       
 [15] "(/)"                        "(.."                       
 [17] "(..,"                       "()=(|=)."                  
 [19] "(),"                        "()."                       
 [21] "(&)"                        "++,"                       
 [23] "(0°"                        "0.001),"                   
 [25] "0.003"                      "=0.005)"                   
 [27] "0.006"                      "=0.007)"                   
 [29] "000km"                      "0.01)" 
...

所以我的问题是：

为什么对我的 function(){} 的调用没有达到预期的效果？如何改进我的功能？
R 的 perl 兼容正则表达式是否支持 Unicode 正则表达式模式类，例如 if \P{ASCII}或支持？\P{PUNCT}我认为它们不是（默认情况下）PCRE: : "只有对各种 Unicode 属性的支持 \p 是不完整的，但最重要的是支持。"

score 2 · Accepted Answer

尽管我很喜欢 Susana 的回答，但它在tm的较新版本中破坏了语料库（不再是 PlainTextDocument 并破坏了元数据）

您将得到一个列表和以下错误：

Error in UseMethod("meta", x) : 
no applicable method for 'meta' applied to an object of class "character"

使用

tm_map(your_corpus, PlainTextDocument)

会给你你的语料库，但 $meta 坏了（特别是文档ID会丢失。

解决方案

使用content_transformer

toSpace <- content_transformer(function(x,pattern)
    gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")

资料来源： 使用 R 进行数据科学实践，文本挖掘，Graham.Williams@togaware.com http://onepager.togaware.com/

更新

此功能会删除所有非字母数字（即 UTF-8 表情符号等）

removeNonAlnum <- function(x){
  gsub("[^[:alnum:]^[:space:]]","",x)
}

score 1 · Accepted Answer

我有同样的问题，自定义功能不起作用，但实际上必须添加下面的第一行

问候

苏珊娜

replaceExpressions <- function(x) UseMethod("replaceExpressions", x)

replaceExpressions.PlainTextDocument <- replaceExpressions.character  <- function(x) {
    x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
    x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
    x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
    return(x)
}

notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)

r - 如何编写自定义 removePunctuation() 函数以更好地处理 Unicode 字符？

2 回答 2

更新

Related

Reference