r - How to identify/delete non-UTF-8 characters in R

Question

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

score 38 · Accepted Answer

另一个使用iconv和它的解决方案参数sub：字符串。如果不是 NA（这里我将其设置为 ''），它用于替换输入中任何不可转换的字节。

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

这里请注意，如果我们选择正确的编码：

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

score 6 · Accepted Answer

Yihui 的xfun包有一个函数，read_utf8它试图读取一个文件并假设它被编码为 UTF-8。如果文件包含非 UTF-8 行，则会触发警告，让您知道哪些行包含非 UTF-8 字符。在引擎盖下，它使用了一个非导出函数xfun:::invalid_utf8()，如下所示：which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))).

要检测字符串中的特定非 UTF-8 单词，您可以稍微修改上面的内容并执行以下操作：

invalid_utf8_ <- function(x){

  !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))

}

detect_invalid_utf8 <- function(string, seperator){

  stringSplit <- unlist(strsplit(string, seperator))

  invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_))

  data.frame(
    word = stringSplit[invalidIndex],
    stringIndex = which(invalidIndex == TRUE)
  )

}

x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade"

detect_invalid_utf8(x, " ")

#     word stringIndex
# 1 façile    5
# 2 façade    9

score 4 · Accepted Answer

在整个数据集上使用 dplyr 删除坏字符的另一种方法：

library(dplyr)

MyDate %>%
    mutate_at(vars(MyTextVar1, MyTextVar2), function(x){gsub('[^ -~]', '', x)})

MyData删除坏苹果的数据集和MyTextVar文本变量在哪里？这可能不如更改编码那么健壮，但通常只需删除它们就可以了。

score 1 · Accepted Answer

您可以尝试使用将它们转换为 UTF-8 字符串，而不是删除它们iconv。

require(foreign)
dat <- read.dta("data.dta")

for (j in seq_len(ncol(dat))) {
   if (class(dat[, j]) == "factor")
       levels(dat[, j]) <- iconv(levels(dat[, j]), from = "latin1", to = "UTF-8")
}

您可以latin1在您的情况下用更合适的编码替换。由于我们无法访问您的数据，因此很难知道哪个更合适。

r - How to identify/delete non-UTF-8 characters in R

4 回答 4

Related

Reference