r - R从数据中的字符串中删除引用

Question

我已经从 Wikipedia 中提取了税收数据并正在重新组合它，但我无法从数据中删除引用标签 ( http://en.wikipedia.org/wiki/List_of_countries_by_tax_rates#Countries )。起初，我尝试在 [ 上使用 strsplit 来删除它，但这就是我得到的：

URL <- "http://en.wikipedia.org/wiki/List_of_countries_by_tax_rates#Countries"

library(XML) 
taxes <- readHTMLTable(URL, which=2) 

matrix(unlist(strsplit(taxes$Country, "\\[")), ncol = 2, byrow = TRUE)
[,1]                       [,2]                      
[1,] "Albania"                  "1]"                      
[2,] "Algeria"                  "3]"                      
[3,] "Andorra"                  "citation needed]"        
[4,] "Angola"                   "1]"                      
[5,] "Argentina"                "Armenia"                 
[6,] "1]"                       "Aruba"

最终，我想删除引用（编号或“需要引用”以及它们周围的括号）。我希望在第二列中有数字，在第一列中有国家名称，这样我就可以保留名称，但是当没有脚注时它会混合列。我也研究过使用cSplit这种方法，但也没有取得任何成功。有什么建议么？

score 1 · Accepted Answer

我认为这个正则表达式会起作用：

URL <- "http://en.wikipedia.org/wiki/List_of_countries_by_tax_rates#Countries"

library(XML) 
taxes <- readHTMLTable(URL, which=2) 

gsub("\\[(\\d+|citation needed)\\]", "", taxes$Country)

r - R从数据中的字符串中删除引用

1 回答 1

Related

Reference