r - 从解析的西班牙语推文中替换子字符串的问题 - R 2.15.3

Question

这是提取推文的完整代码：加载所需的包

require(XML)

让我们获取一些关于 #18A 哈希标签定义 twitter 搜索 url 的推文（遵循 atom 标准）

twitter_url = "http://search.twitter.com/search.atom?"

编码查询

query = URLencode("#18A")

存储结果的向量

tweets = character(0)

分页 17 次以收获推文

for (page in 1:17)
{
twitter_search = paste(twitter_url, "q=", query,
"&rpp=100&lang=es&pagegeocode=-34.686173,-58.648529,15mi", page, sep="")
tmp = xmlParseDoc(twitter_search, asText=F)
tweets = c(tweets, xpathSApply(tmp, "//s:entry/s:title",
                         xmlValue, namespaces=c('s'='http://www.w3.org/2005/Atom')))
}

print(tweets)
class(tweets)

Then, replacing the spanish characters (á, é, í,..) isn't working.

tweets = gsub("<U\\+00E1>", "a", tweets)
tweets = gsub("<U\\+00E9>", "e", tweets)

我们可以在 1699 推文中看到结果是如何不正确的

print(results[1699])

我设法通过将推文的编码更改为：

Encoding(tweets) <- "ISO-8859"

# Replace spanish character with accent for "normal" character

tweets = gsub("\303\272", "u", tweets)
tweets = gsub("\303\241", "a", tweets)
tweets = gsub("\303\255", "i", tweets)
tweets = gsub("\303\263", "o", tweets)
tweets = gsub("\303\251", "e", tweets)
tweets = gsub("\303\271", "u", tweets)
tweets = gsub("\303\201", "O", tweets)
tweets = gsub("\303\211", "E", tweets)
tweets = gsub("\342\234\224", "", tweets)
tweets = gsub("\302\241", "", tweets)
tweets = gsub("\302\277", "", tweets)

我想一定有更好的解决方案。我想知道为什么更改编码会使 gsub() 函数起作用，以及为什么它在以前的推文中不起作用。

R 版本 2.15.3 (2013-03-01) 平台：x86_64-apple-darwin9.8.0/x86_64 (64-bit)

score 3 · Accepted Answer

在正则表达式中，+符号 ahs 的特殊含义。您可以使用fixed = TRUE参数gsub或转义特殊字符：

tweet = gsub("<U\\+00E9>", "e", tweet)
tweet = gsub("<U\\+00E1>", "a", tweet)
tweet = gsub("<U\\+00BF>", "" , tweet)


## [1] "RT @LuchoBugallo: Quieren una primicia? @CFKArgentina el #18A se va a      #Venezuela. Cual sera el motivo que la moviliza hacer un viaje d ..."
## [2] "RT @LuchoBugallo: #18A - Ya estan apareciendo las cuentas truchas de militontos, que usan s<U+00F3>lo en epoca de cacerolazos!"

score 0 · Accepted Answer

使用该选项fixed = TRUE，因为+可能与正则表达式搞砸了：

tweet = gsub("<U+00E9>", "e", tweet, fixed = T)
tweet = gsub("<U+00E1>", "a", tweet, fixed = T)
tweet = gsub("<U+00BF>", "" , tweet, fixed = T)

score -1 · Accepted Answer

我正在寻找完全解决问题的答案如下：只是改变 R 的语言。我用西班牙语写的，这就是推文编码问题的根源

解决方案只是在 Mac OSX 终端中运行此代码。

默认写入 org.R-project.R force.LANG en_US.UTF-8

r - 从解析的西班牙语推文中替换子字符串的问题 - R 2.15.3

3 回答 3

Related

Reference