我也陷入了编码兔子洞,我学到的重要一件事是"unknown"
编码并不一定意味着它不是 UTF-8。还是不好。或者需要修复的东西。
这里有些例子:
# Some string that might be UTF-8 or just some ASCII (but created in UTF-8 editor/environment)
ambiguous <- "wat"
Encoding(ambiguous)
#> [1] "unknown"
# Forced coercion to UTF-8 via stringi
ambiguous <- stringi::stri_enc_toutf8("wat", is_unknown_8bit = TRUE)
# Still ambiguous
Encoding(ambiguous)
#> [1] "unknown"
# Some pretty-sure-not-ASCII string
totallygermanic <- "wät"
# It's UTF-8 because that's what my RStudio and every other part of my env is set to
Encoding(totallygermanic)
#> [1] "UTF-8"
# Let's force it to be unknowm
Encoding(totallygermanic) <- "unknown"
# Still prints ok
totallygermanic
#> [1] "wät"
# What's its encoding now?
Encoding(totallygermanic)
#> [1] "unknown"
# Converting it to UTF-8 still prints ok
stringi::stri_enc_toutf8(totallygermanic)
#> [1] "wät"
# So the converted string is UTF-8, right? No.
Encoding(stringi::stri_enc_toutf8(totallygermanic))
#> [1] "unknown"
# Maybe we should just guess?
stringi::stri_enc_detect("wat")
#> [[1]]
#> Encoding Language Confidence
#> 1 ISO-8859-1 en 0.75
#> 2 ISO-8859-2 ro 0.75
#> 3 UTF-8 0.15
stringi::stri_enc_detect("wät")
#> [[1]]
#> Encoding Language Confidence
#> 1 UTF-8 0.8
#> 2 UTF-16BE 0.1
#> 3 UTF-16LE 0.1
#> 4 GB18030 zh 0.1
#> 5 EUC-JP ja 0.1
#> 6 EUC-KR ko 0.1
#> 7 Big5 zh 0.1
由reprex 包(v0.2.1)于 2019 年 2 月 11 日创建
要点是:如果你的字符串不是明显的非 ASCII,例如它只包含字母 az,它可能是 ASCII,或者它可能是 UTF-8,所以你得到一个unknown
,但这并不一定意味着你的字符串显然,它实际上不是 UTF-8。您可能会尝试强行强制字符串,在此过程中您可能会破坏一些根本没有破坏的东西。根据我的经验,在变量/向量上使用一些转换函数可能是完全足够stringi::stri_enc_toutf8
的,测试它是否按预期打印/工作,也许对可能有问题的字符使用正则表达式过滤器(作为德国人,我们倾向于寻找äöüß
)。
无论如何,如果您想深入了解细节,我可以建议您查看stringi
软件包及其编码功能。这个包是背后的力量stringr
,它提供了一个更高级的接口。