3

请任何人都可以帮助我将这些数据从文本或 dat 文件导入 R。它有空格分隔,但城市名称不应视为两个名称。就像纽约一样。

1 NEW YORK  7,262,700
2 LOS ANGELES  3,259,340
3 CHICAGO  3,009,530
4 HOUSTON  1,728,910
5 PHILADELPHIA  1,642,900
6 DETROIT  1,086,220
7 SAN DIEGO  1,015,190
8 DALLAS  1,003,520
9 SAN ANTONIO  914,350
10 PHOENIX  894,070
4

3 回答 3

4

主题的变体......但首先,一些示例数据:

cat("1 NEW YORK  7,262,700",
    "2 LOS ANGELES  3,259,340",
    "3 CHICAGO  3,009,530",
    "4 HOUSTON  1,728,910",
    "5 PHILADELPHIA  1,642,900",
    "6 DETROIT  1,086,220",
    "7 SAN DIEGO  1,015,190",
    "8 DALLAS  1,003,520",
    "9 SAN ANTONIO  914,350",
    "10 PHOENIX  894,070", sep = "\n", file = "test.txt")

第 1 步:读取数据readLines

x <- readLines("test.txt")

第 2 步:找出可用于插入分隔符的正则表达式。在这里,模式似乎是(从行看)一组数字和逗号,前面有空格,前面有一些全大写的单词。我们可以捕获这些组并插入一些“制表符”分隔符 (\t)。额外的斜线是为了正确地转义它们。

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
#  [1] "1\t NEW YORK  \t7,262,700"     "2\t LOS ANGELES  \t3,259,340" 
#  [3] "3\t CHICAGO  \t3,009,530"      "4\t HOUSTON  \t1,728,910"     
#  [5] "5\t PHILADELPHIA  \t1,642,900" "6\t DETROIT  \t1,086,220"     
#  [7] "7\t SAN DIEGO  \t1,015,190"    "8\t DALLAS  \t1,003,520"      
#  [9] "9\t SAN ANTONIO  \t914,350"    "10\t PHOENIX  \t894,070"  

第 3 步:由于我们知道我们gsub正在工作,并且我们知道read.delim有一个“text”参数可以用来代替“file”参数,我们可以read.delim直接在以下结果上gsub

out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
                  header = FALSE, strip.white = TRUE)
out
#    V1           V2        V3
# 1   1     NEW YORK 7,262,700
# 2   2  LOS ANGELES 3,259,340
# 3   3      CHICAGO 3,009,530
# 4   4      HOUSTON 1,728,910
# 5   5 PHILADELPHIA 1,642,900
# 6   6      DETROIT 1,086,220
# 7   7    SAN DIEGO 1,015,190
# 8   8       DALLAS 1,003,520
# 9   9  SAN ANTONIO   914,350
# 10 10      PHOENIX   894,070

一个可能的最后一步是将第三列转换为数字:

out$V3 <- as.numeric(gsub(",", "", out$V3))
于 2013-09-22T14:50:34.747 回答
4

对于您的特定数据框,其中真正的空格仅出现在大写字母之间,请考虑使用正则表达式:

gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK  7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO  3,009,530")
# [1] "3 CHICAGO  3,009,530"

然后,您可以将空格解释为字段分隔符。

于 2013-09-22T07:42:31.680 回答
1

扩展@Hugh 的答案,我会尝试以下方法,尽管它不是特别有效。

lines <- scan("cities.txt", sep="\n", what="character")
lines <- unlist(lapply(lines, function(x) { 
  gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x) 
}))

citiesDF <- data.frame(num  = rep(0, length(lines)), 
                       city = rep("", length(lines)), 
                       population = rep(0, length(lines)),
                       stringsAsFactors=FALSE)

for (i in 1:length(lines)) {
   splitted = strsplit(lines[i], " +")
   citiesDF[i, "num"] <- as.numeric(splitted[[1]][1])
   citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2])
   citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3]))
}
于 2013-09-22T08:00:33.520 回答