使用好的行以一个数字字段结尾并且除第一个字段之外的每个字段都是数字的事实:
URL <- "http://lib.stat.cmu.edu/datasets/sleep"
L <- readLines(URL)
# lines ending in a one digit field
good.lines <- grep(" \\d$", L, value = TRUE)
# insert commas before numeric fields
lines.csv <- gsub("( [-0-9.])", ",\\1", good.lines)
# re-read
DF <- read.table(text = lines.csv, sep = ",", as.is = TRUE, strip.white = TRUE,
na.strings = "-999.0")
如果您也对标题感兴趣,这里有一些代码。如果您对标题不感兴趣,请忽略其余部分。
# get headings - of the lines starting at left edge these are the ncol(DF) lines
# starting with the one containing "species"
headings0 <- grep("^[^ ]", L, value = TRUE)
i <- grep("species", headings0)
headings <- headings0[seq(i, length = ncol(DF))]
# The headings are a bit long so we shorten them to the first word
names(DF) <- sub(" .*$", "", headings)
这给出了:
> head(DF)
species body brain slow paradoxical total maximum
1 African elephant 6654.000 5712.0 NA NA 3.3 38.6
2 African giant pouched rat 1.000 6.6 6.3 2.0 8.3 4.5
3 Arctic Fox 3.385 44.5 NA NA 12.5 14.0
4 Arctic ground squirrel 0.920 5.7 NA NA 16.5 NA
5 Asian elephant 2547.000 4603.0 2.1 1.8 3.9 69.0
6 Baboon 10.550 179.5 9.1 0.7 9.8 27.0
gestation predation sleep overall
1 645 3 5 3
2 42 3 1 3
3 60 1 1 1
4 25 5 2 3
5 624 3 5 4
6 180 4 4 4
更新:空白修剪的小幅简化
更新2:缩短标题
更新 3:添加na.strings = "-999.0"