Since you are also dealing with accented characters, I can think of two options:
- Get rid of the accented characters entirely.
- Use
iconv
to attempt to "transliterate" the accented characters to ASCII characters.
Here are both. For both examples, I'm using the following sample text:
Z <- c("ANGLO AUTOMOTRIZ S.A. MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA",
"ECUA - AUTO S.A. MATRIZ", "METROCAR S.A. 10 DE AGOSTO", "MOSUMI LA \"Y\"",
"distribuir contenidos", "proponer autoevaluaciones", "como buzón de actividades")
Option 1: Note that the accented "ó" is dropped in the last item.
gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE)
# [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"
# [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos"
# [7] "proponerautoevaluaciones" "comobuzndeactividades"
Option 2: Note that the "ó" has been converted to "o"
gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT"))
# [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"
# [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos"
# [7] "proponerautoevaluaciones" "comobuzondeactividades"
Notes:
- For convenience, I've decided to just use the character classes
[[:punct:]]
and [[:space:]]
.
- For the first option, you need
perl = TRUE
to recognize the [[:ascii:]]
character class.
- The
^
in option 1 means "not" (so, you can read it as "find anything that is not an ASCII character, that is a space, or that is a punctuation mark, and replace it with nothing).