regex - 如何在一个字符串中删除特殊字符，空格并修剪R中的字符变量

Question

我在 R 中有一个字符类型的变量的小问题。我在数据框中的变量具有如下结构：

X1
ANGLO AUTOMOTRIZ S.A. MATRIZ
AUTOMOTORES Y ANEXOS / AYASA
ECUA - AUTO S.A. MATRIZ
METROCAR S.A. 10 DE AGOSTO
MOSUMI LA "Y"

我的问题是我想要一个没有这样的新变量，./-""并且必须将字符串分组在一个没有空格的地方：

X2
ANGLOAUTOMOTRIZSAMATRIZ
AUTOMOTORESYANEXOSAYASA
ECUAAUTOSAMATRIZ
METROCARSA10DEAGOSTO
MOSUMILAY

可以在 R 中进行此操作。谢谢。

score 14 · Accepted Answer

试gsub...

gsub( "\\.|/|\\-|\"|\\s" , "" , df$X1 )
#[1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"       
#[4] "METROCARSA10DEAGOSTO"    "MOSUMILAY"

\\.- 匹配文字.
|- 或分隔符
/- 匹配一个/（不需要转义）
\\-- 匹配文字-
\"- 匹配文字"
\\s- 匹配一个空格

gsub是贪婪的，所以尝试尽可能多地匹配，并且它也是矢量化的，所以你可以一次通过整个列。第二个参数是替换值，在这种情况下是""，它将所有匹配的字符替换为空。

score 6 · Accepted Answer

Since you are also dealing with accented characters, I can think of two options:

Get rid of the accented characters entirely.
Use iconv to attempt to "transliterate" the accented characters to ASCII characters.

Here are both. For both examples, I'm using the following sample text:

Z <- c("ANGLO AUTOMOTRIZ S.A. MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA",
"ECUA - AUTO S.A. MATRIZ", "METROCAR S.A. 10 DE AGOSTO", "MOSUMI LA \"Y\"",
"distribuir contenidos", "proponer autoevaluaciones", "como buzón de actividades")

Option 1: Note that the accented "ó" is dropped in the last item.

gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE)
# [1] "ANGLOAUTOMOTRIZSAMATRIZ"  "AUTOMOTORESYANEXOSAYASA"  "ECUAAUTOSAMATRIZ"        
# [4] "METROCARSA10DEAGOSTO"     "MOSUMILAY"                "distribuircontenidos"    
# [7] "proponerautoevaluaciones" "comobuzndeactividades"

Option 2: Note that the "ó" has been converted to "o"

gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT"))
# [1] "ANGLOAUTOMOTRIZSAMATRIZ"  "AUTOMOTORESYANEXOSAYASA"  "ECUAAUTOSAMATRIZ"        
# [4] "METROCARSA10DEAGOSTO"     "MOSUMILAY"                "distribuircontenidos"    
# [7] "proponerautoevaluaciones" "comobuzondeactividades"

Notes:

For convenience, I've decided to just use the character classes [[:punct:]] and [[:space:]].
For the first option, you need perl = TRUE to recognize the [[:ascii:]] character class.
The ^ in option 1 means "not" (so, you can read it as "find anything that is not an ASCII character, that is a space, or that is a punctuation mark, and replace it with nothing).

regex - 如何在一个字符串中删除特殊字符，空格并修剪R中的字符变量

2 回答 2

Related

Reference