0

这个问题可能看起来像重复,但我在从字符串中提取国家名称时遇到了一些问题。我已经通过此链接 [链接]从作者附属机构中提取国家名称, 但我无法解决我的问题。我尝试使用 grepl 和 for 循环进行文本匹配和替换,我的数据列包含超过 300k 行,因此使用 grepl用于模式匹配的 for 循环非常非常慢。

我有一个这样的专栏。

org_loc

Zug
Zug  Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza 
York  United Kingdom
Delhi
Yalleroi  Queensland
Waterloo  Ontario
Waterloo  ON 
Washington  D.C.
Washington D.C. Metro 
New York


df$org_loc <- c("zug", "zug  canton of zug", "zimbabwe", 
"zigong", "zhuhai", "zaragoza","York  United Kingdom", "Delhi","Yalleroi  Queensland","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","New York")

该字符串可能包含州、城市或国家/地区的名称。我只想要 Country 作为输出。像这样

org_loc

Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state

我正在尝试使用国家代码库将状态(如果找到匹配项)转换为其国家,但无法这样做。任何帮助都是不言而喻的。

4

3 回答 3

0

您可以将 yourCity_and_province_list.csv作为自定义字典用于countrycode. City自定义字典在原始向量(您的列)中不能有重复项City_and_province_list.csv,因此您必须首先删除它们或以某种方式处理它们(如下面的示例)。目前,您在查找 CSV 中的示例中没有所有可能的字符串,因此它们并未全部转换,但如果您将所有可能的字符串添加到 CSV,它将完全工作。

library(countrycode)

org_loc <- c("Zug", "Zug  Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
             "Zaragoza", "York  United Kingdom", "Delhi",
             "Yalleroi  Queensland", "Waterloo  Ontario", "Waterloo  ON",
             "Washington  D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)

city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")

# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]

df$country <- countrycode(df$org_loc, "City", "Country", 
                          custom_dict = city_country)

df
# org_loc                  country
# 1                    Zug              Switzerland
# 2     Zug  Canton of Zug                     <NA>
# 3               Zimbabwe                     <NA>
# 4                 Zigong                    China
# 5                 Zhuhai                    China
# 6               Zaragoza                    Spain
# 7   York  United Kingdom                     <NA>
# 8                  Delhi                    India
# 9   Yalleroi  Queensland                     <NA>
# 10     Waterloo  Ontario                     <NA>
# 11          Waterloo  ON                     <NA>
# 12      Washington  D.C.                     <NA>
# 13 Washington D.C. Metro                     <NA>
# 14              New York United States of America
于 2018-03-26T15:43:18.397 回答
0
library(countrycode)
df <- c("zug  switzerland", "zug  canton of zug  switzerland", "zimbabwe", 
            "zigong  chengdu  pr china", "zhuhai  guangdong  china", "zaragoza","York  United Kingdom", "Yamunanagar","Yalleroi  Queensland  Australia","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')

它与其中的很多不匹配,但根据countrycode.

于 2018-03-22T18:31:17.323 回答
0

使用 ggmap 包中的地理编码功能,您可以完成任务,但并非完全准确;您还必须使用您的标准说“萨拉戈萨”是西班牙的一个城市(这是地理编码返回的),而不是阿根廷的某个地方;当有多个同音词时,地理编码往往会给你最大的城市。(删除 $country 以查看所有输出)

library(ggmap)
org_loc <- c("zug", "zug  canton of zug", "zimbabwe", 
                "zigong", "zhuhai", "zaragoza","York  United Kingdom", 
             "Delhi","Yalleroi  Queensland","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","New York")
    geocode(org_loc, output = "more")$country

由于地理编码由谷歌提供,它有一个查询限制,每个 IP 地址每天 2,500 个;如果它返回 NAs 可能是因为限制检查不一致,请再试一次。

于 2018-03-22T18:31:49.930 回答