r - 如何将变量未存储在同一行且缺少列到列的标准分隔符的文本文件读入R？

Question

我正在尝试将文本文件（https://www.bls.gov/bdm/us_age_naics_00_table5.txt）读入R，但我不确定如何解析它。如您所见，列名（年份）并非全部位于同一行，并且列与列之间的数据间距不一致。我熟悉使用read.csv()and read.delim()，但我不确定如何读取像这样的复杂文件。

score 0 · Accepted Answer

这是一个手动解析：

require(readr)
string = read_lines(file="https://www.bls.gov/bdm/us_age_naics_00_table5.txt")
string = string[nchar(string) != 0]
string = string[-c(1,2)]  # don't contain information
string = string[string != " "]
string = string[-151]     # footnote
sMatrix = matrix(string, nrow = 30)
dfList = sapply(1:ncol(sMatrix), function(x) readr::read_table(paste(sMatrix[,x])))
df = do.call(cbind,dfList)
df = df[,!duplicated(colnames(df))] # removes columns with duplicate names

如果您想将“_”重新编码为NA，并格式化数字：

df[df == "_"] = NA
df = as.data.frame(sapply(df, function(x) gsub(",","",x)))
i <- apply(df, 2, function(x) !any(is.na(as.numeric(na.omit(x))))) # if a column can be converted to numeric without any NAs, e.g. column 1 can't
df[,i] = lapply(df[,i], as.numeric)

r - 如何将变量未存储在同一行且缺少列到列的标准分隔符的文本文件读入R？

1 回答 1

Related

Reference