1

我有一个 .csv 文件,保存时它是 UTF-8 编码的。该脚本是该文件中数据的梵文。我可以在 excel 中看到 csv 文件中的单词

में
लिए
किया
गया
हैं
नहीं
सिंह
पुलिस
दिया
करने
कहा
रहे
बाद
करें
साथ
रहा

但是当我在 R 中打开它时,这些单词没有被正确编码。print() 的输出是这样的:

                                                    word
                      सारे_खतरों_को
                जानते_हà¥\u0081à¤\u008f_भी
                                   विवेक_ने
                                             टीवी

我该如何解决这个问题?我试过了Sys.setlocale()read.delim(wordlist.csv, encoding = "UTF-8")但都没有奏效。

4

1 回答 1

1

评论太长了(对不起,我是新手R):

print( sessionInfo())

library(stringi)
library(magrittr)

x <- read.delim("D:\\bat\\SO\\64497248_devangari.csv", encoding = "UTF-8")
print('=== print(x)')
print(x)
for (line in x){
  y <- line %>% 
    stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
    stri_unescape_unicode() %>% 
    stri_enc_toutf8()
}

print('=== print(y)')
print(y)

print('=== for (i in y) {print(i)}')
for (i in y) {print(i)}

print('=== print(z)')
z <- x['word'] %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
print(z)

输出(在Rgui.exe控制台中):

> source ( 'D:\\bat\\SO\\64497248.r' )
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=Czech_Czechia.1250  LC_CTYPE=Czech_Czechia.1250    LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C                   LC_TIME=Czech_Czechia.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.0.1
[1] "=== print(x)"
                                       word
1                  <U+092E><U+0947><U+0902>
2                  <U+0932><U+093F><U+090F>
3          <U+0915><U+093F><U+092F><U+093E>
4                  <U+0917><U+092F><U+093E>
5                  <U+0939><U+0948><U+0902>
6          <U+0928><U+0939><U+0940><U+0902>
7          <U+0938><U+093F><U+0902><U+0939>
8  <U+092A><U+0941><U+0932><U+093F><U+0938>
9          <U+0926><U+093F><U+092F><U+093E>
10         <U+0915><U+0930><U+0928><U+0947>
11                 <U+0915><U+0939><U+093E>
12                 <U+0930><U+0939><U+0947>
13                 <U+092C><U+093E><U+0926>
14         <U+0915><U+0930><U+0947><U+0902>
15                 <U+0938><U+093E><U+0925>
16                 <U+0930><U+0939><U+093E>
[1] "=== print(y)"
 [1] "में&quot;    "लिए&quot;  "किया" "गया&quot;  "हैं&quot;    "नहीं&quot;  "सिंह"  "पुलिस&quot; "दिया" "करने"  "कहा&quot;  "रहे&quot;   "बाद&quot;  "करें"   "साथ&quot;  "रहा&quot; 
[1] "=== for (i in y) {print(i)}"
[1] "में&quot;
[1] "लिए&quot;
[1] "किया"
[1] "गया&quot;
[1] "हैं&quot;
[1] "नहीं&quot;
[1] "सिंह"
[1] "पुलिस&quot;
[1] "दिया"
[1] "करने"
[1] "कहा&quot;
[1] "रहे&quot;
[1] "बाद&quot;
[1] "करें"
[1] "साथ&quot;
[1] "रहा&quot;
[1] "=== print(z)"
[1] "c(\"में\", \"लिए\", \"किया\", \"गया\", \"हैं\", \"नहीं\", \"सिंह\", \"पुलिस\", \"दिया\", \"करने\", \"कहा\", \"रहे\", \"बाद\", \"करें\", \"साथ\", \"रहा\"\n)"
Warning messages:
1: package ‘magrittr’ was built under R version 4.0.2 
2: In stri_replace_all_regex(., "<U\\+([[:alnum:]]+)>", "\\\\u$1") :
  argument is not an atomic vector; coercing
> 
于 2020-10-23T17:41:11.547 回答