我终于找到了解决这个问题的方法。这是我的用例和我尝试过的。
这些来自使用rvest抓取维基百科,所以应该没有问题。全部包含%
但并非全部导致问题。
#problem strings
problem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree",
"J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan",
"Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova",
"Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique",
"Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien",
"Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor",
"Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd",
"Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal",
"Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale",
"No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"
)
首先尝试 base-r 解决方案。由于某种原因它没有被矢量化,所以我们使用purrr:
#utils::URLdecode
problem_strs %>% purrr::map_chr(utils::URLdecode)
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker" "Vicco_von_Bülow"
[5] "Bülent_Ceylan" "Seán_Cullen" "Chris_D'Elia" "Uğur_Rıfat_Karlova"
[9] "Mike_Krüger" "Andrés_López_Forero" "Mo'Nique" "José_Sánchez_Mota"
[13] "Dara_Ó_Briain" "Conan_O'Brien" "Mike_O'Brien_(actor)" "Carroll_O'Connor"
[17] "Donald_O'Connor" "Rosie_O'Donnell" "Michael_O'Donoghue" "Chris_O'Dowd"
[21] "Ardal_O'Hanlon" "Catherine_O'Hara" "Patrice_O'Neal" "Barunka_O'Shaughnessy"
[25] "Raven-Symoné" "Charles_\"Chic\"_Sale" "Noël_Wells" "\"Weird_Al\"_Yankovic"
[29] "Cem_Yılmaz"
如果我们将这些与之前的比较,我们可以看到模式:带有 2%
的会导致问题。因此,我阅读了与 R 的 url 解码相关的所有问题,并找到了这些建议的解决方案:
#urltools::url_decode
urltools::url_decode(problem_strs)
结果和以前一样。
什么是编码?尝试设置为 UTF-8:
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> #try to set
> Encoding(problem_strs) = "UTF-8"
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(problem_strs) = "utf8"
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> urltools::url_decode(problem_strs)
与之前的输出相同。
有人提出了另一种检查和设置的方法:
> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
我在列表中找到了另一个包:
> #Ruchardet to detect?
> Ruchardet::detectEncoding(problem_strs)
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
#Is it simpler than we thought?
urltools::url_decode(problem_strs) %>% urltools::url_decode()
相同的输出。
所以我用谷歌搜索了一个导致问题的特定模式,例如%C3%BC
. 所以,这里有一个为 php 提供了一半的答案。
首先你需要对它进行 urldecode,这会给你 ü,这是 ü 的 UTF8 编码表示,所以你应该一切都好。
好的,让我们在 R 中尝试一下:
#url decode, then set utf
halfway = urltools::url_decode(problem_strs)
Encoding(halfway) = "UTF-8"
halfway
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker" "Vicco_von_Bülow"
[5] "Bülent_Ceylan" "Seán_Cullen" "Chris_D'Elia" "Uğur_Rıfat_Karlova"
[9] "Mike_Krüger" "Andrés_López_Forero" "Mo'Nique" "José_Sánchez_Mota"
[13] "Dara_Ó_Briain" "Conan_O'Brien" "Mike_O'Brien_(actor)" "Carroll_O'Connor"
[17] "Donald_O'Connor" "Rosie_O'Donnell" "Michael_O'Donoghue" "Chris_O'Dowd"
[21] "Ardal_O'Hanlon" "Catherine_O'Hara" "Patrice_O'Neal" "Barunka_O'Shaughnessy"
[25] "Raven-Symoné" "Charles_\"Chic\"_Sale" "Noël_Wells" "\"Weird_Al\"_Yankovic"
[29] "Cem_Yılmaz"
这是一个可重用的函数:
url_decode_utf = function(x) {
y = urltools::url_decode(x)
Encoding(y) = "UTF-8"
y
}