4

这应该是一件容易的事。

假设我在 R 中有这个字符串:

a <- "%C3%B6sterlich

这意味着:

österlich(在德语中是“东风”的意思)

但是,如果我这样做URLdecode(a),我会得到:

[1] "österlich"

这在某种程度上是有道理的,因为 %C3 是 Ã 而 %B6 是 ¶ 在 ASCII URL 编码中。但正如您在这里看到的: http ://www.backbone.se/urlencodingUTF8.htm , %C3%B6 表示 ö 在 UTF-8 编码中。

现在的问题是:我如何告诉URLdecode()使用 UTF-8 表?

4

2 回答 2

4

我终于找到了解决这个问题的方法。这是我的用例和我尝试过的。

这些来自使用rvest抓取维基百科,所以应该没有问题。全部包含%但并非全部导致问题。

#problem strings
problem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree", 
  "J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan", 
  "Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova", 
  "Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique", 
  "Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien", 
  "Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor", 
  "Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd", 
  "Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal", 
  "Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale", 
  "No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"
)

首先尝试 base-r 解决方案。由于某种原因它没有被矢量化,所以我们使用purrr

#utils::URLdecode
problem_strs %>% purrr::map_chr(utils::URLdecode)

[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker"            "Vicco_von_Bülow"         
[5] "Bülent_Ceylan"            "Seán_Cullen"              "Chris_D'Elia"              "Uğur_Rıfat_Karlova"     
[9] "Mike_Krüger"              "Andrés_López_Forero"     "Mo'Nique"                  "José_Sánchez_Mota"      
[13] "Dara_Ó_Briain"            "Conan_O'Brien"             "Mike_O'Brien_(actor)"      "Carroll_O'Connor"         
[17] "Donald_O'Connor"           "Rosie_O'Donnell"           "Michael_O'Donoghue"        "Chris_O'Dowd"             
[21] "Ardal_O'Hanlon"            "Catherine_O'Hara"          "Patrice_O'Neal"            "Barunka_O'Shaughnessy"    
[25] "Raven-Symoné"             "Charles_\"Chic\"_Sale"     "Noël_Wells"               "\"Weird_Al\"_Yankovic"    
[29] "Cem_Yılmaz"

如果我们将这些与之前的比较,我们可以看到模式:带有 2%的会导致问题。因此,我阅读了与 R 的 url 解码相关的所有问题,并找到了这些建议的解决方案:

#urltools::url_decode
urltools::url_decode(problem_strs)

结果和以前一样。

什么是编码?尝试设置为 UTF-8:

> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> #try to set
> Encoding(problem_strs) = "UTF-8"
> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(problem_strs) = "utf8"
> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> urltools::url_decode(problem_strs)

与之前的输出相同。

有人提出了另一种检查和设置的方法:

> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")
> Encoding(problem_strs)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"

我在列表中找到了另一个包:

> #Ruchardet to detect?
> Ruchardet::detectEncoding(problem_strs)
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

#Is it simpler than we thought?
urltools::url_decode(problem_strs) %>% urltools::url_decode()

相同的输出。

所以我用谷歌搜索了一个导致问题的特定模式,例如%C3%BC. 所以,这里有一个为 php 提供了一半的答案

首先你需要对它进行 urldecode,这会给你 ü,这是 ü 的 UTF8 编码表示,所以你应该一切都好。

好的,让我们在 R 中尝试一下:

#url decode, then set utf
halfway = urltools::url_decode(problem_strs)
Encoding(halfway) = "UTF-8"
halfway
 [1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker"             "Vicco_von_Bülow"          
 [5] "Bülent_Ceylan"             "Seán_Cullen"               "Chris_D'Elia"              "Uğur_Rıfat_Karlova"       
 [9] "Mike_Krüger"               "Andrés_López_Forero"       "Mo'Nique"                  "José_Sánchez_Mota"        
[13] "Dara_Ó_Briain"             "Conan_O'Brien"             "Mike_O'Brien_(actor)"      "Carroll_O'Connor"         
[17] "Donald_O'Connor"           "Rosie_O'Donnell"           "Michael_O'Donoghue"        "Chris_O'Dowd"             
[21] "Ardal_O'Hanlon"            "Catherine_O'Hara"          "Patrice_O'Neal"            "Barunka_O'Shaughnessy"    
[25] "Raven-Symoné"              "Charles_\"Chic\"_Sale"     "Noël_Wells"                "\"Weird_Al\"_Yankovic"    
[29] "Cem_Yılmaz"               

这是一个可重用的函数:

url_decode_utf = function(x) {
  y = urltools::url_decode(x)
  Encoding(y) = "UTF-8"
  y
}
于 2016-12-02T23:45:27.777 回答
3

尝试这个:

> Encoding(a) <- "UTF-8"

或者使用 iconv 函数:
http ://stat.ethz.ch/R-manual/R-devel/library/base/html/iconv.html http://astrostatistics.psu.edu/datasets/2006tutorial/html/utils/ html/iconv.html

希望对你有帮助^_^

于 2013-07-30T12:47:38.640 回答