@thelatemail 是关于如何进行的。这是我汇总的一个小功能,可帮助您开始使用更强大的解决方案:
read.dat.dct <- function(dat, dct) {
temp <- readLines(dct)
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+([a-z0-9_]+)\\s+%([0-9]+).*"
classes <- c("numeric", "character", "character", "numeric")
metadata <- setNames(lapply(1:4, function(x) {
out <- gsub(pattern, paste("\\", x, sep = ""), temp)
out <- gsub("^\\s+|\\s+$|.*\\{|\\}", "", out)
out <- out[out != ""]
class(out) <- classes[x] ; out }),
c("StartPos", "Str", "ColName", "ColWidth"))
read.fwf(dat, widths = metadata[["ColWidth"]],
col.names = metadata[["ColName"]])
}
在错误检查、泛化函数等方面,您仍然需要做很多事情。例如,此函数不适用于重叠列,正如 @thelatemail 添加到您的问题的示例中所示。“StartPos[n] + ColWidth[n]”形式的一些错误检查应该等于“StartPos[n+1]”,如果这不是真的,则可以使用错误消息来停止读取文件。此外,还可以从函数生成的“元数据”列表中提取结果数据的类别,并read.fwf
使用colClasses
参数进行分配。
这是一个 dat 文件和一个 dct 文件来演示:
将以下两行复制并粘贴到文本编辑器中,并将其作为“test.dat”保存在您的工作目录中。
C1245A101George Costanza
B1223B011Cosmo Kramer
将以下行复制并粘贴到文本编辑器中,并将其作为“test.dct”保存在您的工作目录中
dictionary using test.dat {
_column(1) str1 code %1s
_column(2) int call %4f
_column(6) str1 city %1s
_column(7) int neigh %3f
_column(10) str16 name %16s
}
现在,运行函数:
read.dat.dct(dat = "test.dat", dct = "test.dct")
# code call city neigh name
# 1 C 1245 A 101 George Costanza
# 2 B 1223 B 11 Cosmo Kramer
更新:改进的功能(仍有很大的改进空间)
read.dat.dct <- function(dat, dct, labels.included = "no") {
temp <- readLines(dct)
temp <- temp[grepl("_column", temp)]
switch(labels.included,
yes = {
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+)[a-z]\\s+(.*)"
classes <- c("numeric", "character", "character", "numeric", "character")
N <- 5
NAMES <- c("StartPos", "Str", "ColName", "ColWidth", "ColLabel")
},
no = {
pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+).*"
classes <- c("numeric", "character", "character", "numeric")
N <- 4
NAMES <- c("StartPos", "Str", "ColName", "ColWidth")
})
metadata <- setNames(lapply(1:N, function(x) {
out <- gsub(pattern, paste("\\", x, sep = ""), temp)
out <- gsub("^\\s+|\\s+$", "", out)
out <- gsub('\"', "", out, fixed = TRUE)
class(out) <- classes[x] ; out }), NAMES)
metadata[["ColName"]] <- make.names(gsub("\\s", "", metadata[["ColName"]]))
myDF <- read.fwf(dat, widths = metadata[["ColWidth"]],
col.names = metadata[["ColName"]])
if (labels.included == "yes") {
attr(myDF, "col.label") <- metadata[["ColLabel"]]
}
myDF
}
它如何处理您的数据?
temp <- read.dat.dct(dat = "http://dl.getdropbox.com/u/18116710/21600-0009-Data.txt",
dct = "http://dl.getdropbox.com/u/18116710/21600-0009-Setup.dct",
labels.included = "yes")
dim(temp) # How big is the dataset?
# [1] 180 40
head(temp[, 1:6]) # What do the first few columns & rows look like?
# CASEID AID RRELNO RPREGNO H3PC1.H3PC1 H3PC2.H3PC2
# 1 1 57118381 5 1 1 1
# 2 2 57134970 1 2 1 1
# 3 3 57135078 1 1 1 1
# 4 4 57135078 5 1 1 1
# 5 5 57164981 1 1 7 3
# 6 6 57191909 1 3 1 1
head(attr(temp, "col.label")) # What are the variable labels?
# [1] "CASE IDENTIFICATION NUMBER" "RESPONDENT IDENTIFIER"
# [3] "ROMANTIC RELATIONSHIP NUMBER" "RELATIONSHIP PREGNANCY NUMBER"
# [5] "S23Q1 1 TOLD PARTNER PREGNANT-W3" "S23Q2 MONTHS PREG WHEN TOLD PARTNER-W3"
原来的例子呢?
read.dat.dct("test.dat", "test.dct", labels.included = "no")
# code call city neigh name
# 1 C 1245 A 101 George Costanza
# 2 B 1223 B 11 Cosmo Kramer