3

I have some code that scrapes some data from a website that happens to be in Thai script. As part of the project I need to replace some of the Thai text. A small example follows:

thaidf <- data.frame(thdate = c("31 พฤษภาคม 2555","30 เมษายน 2555"), 
                     value = c(100,110))
english_months <- seq(1:12)
thai_months <- c('มกราคม','กุมภาพันธ์','มีนาคม','เมษายน','พฤษภาคม','มิถุนายน',
                 'กรกฎาคม','สิงหาคม','กันยายน','ตุลาคม','พฤศจิกายน','ธันวาคม')

print(thaidf)
for (ii in seq_along(thai_months)) { 
     ## convert months in Thai script to numerical
     thaidf$thdate <- (sapply(thaidf$thdate, 
                      function(x) {gsub(thai_months[ii], 
                                   english_months[ii], x, useBytes = TRUE)}))
}
print(thaidf)

When I run this code from inside Emacs/ESS it does not work. Note how in the screenshot below the console cannot print the Thai characters to screen when the code is excecuted, nor can it apparently recognise the variables in thaidf$thdate, so the gsub() call does not succeed. Instead of producing '31 5 2555' where 31 is the day, 5 is the month and 2555 is the Buddhist year, it outputs '311 2555'

enter image description here

However, when I copy and paste this same code into the RGui front-end, it works fine. It both prints the characters, as shown below, and the gsub() correctly replaces the Thai script with Latin numbers, as one would expect. As you can see from the screengrab below, the Thai script for 'May' becomes '5' and the Thai script for 'April' becomes '4'.

enter image description here

My first thought was that it might be a font issue, but the Thai fonts do seem to be displayed in the Emacs buffer, they are just not recognised when the code is run with C-c M-b. Why does this happen? How can I prevent it?

sessionInfo() R version 2.15.0 Patched (2012-06-03 r59501) Platform: i386-pc-mingw32/i386 (32-bit)

locale: 1 LC_COLLATE=English_United Kingdom.1252 2 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: 1 stats graphics grDevices utils datasets methods base

4

0 回答 0