2

这是输入文件:http ://www.yourfilelink.com/get.php?fid=841283 。我执行了

options(stringsAsFactors=FALSE)
x=read.csv("test1.csv", header = FALSE, sep="'"). 

结果是这样的:http ://www.yourfilelink.com/get.php?fid=841284

我只得到 7 行,而不是 135 行!列数正确,为 13。 x[6,10] 也包含其后行的内容,只是在字符串中用 \n 分隔。

请帮助我。我被这个问题困住了!:/

4

2 回答 2

5

带有多个“\n”的极长项目的描述症状表明您可能需要处理不匹配的引号。如果名称或地址条目中有引号,则解析器将等待下一个引号,然后再考虑完成条目。尝试”

x=read.csv("test1.csv", header = FALSE, sep="'", quote="")

这实际上不适用于我下载的文件。(请注意, sep 参数将被忽略read.csv。)我需要先使用带有该分隔符的 count.fields ,然后使用read.tablewith fill =TRUE。结果仍然有点混乱,有几列用逗号填充,但至少有一些东西可以使用:

table( count.fields("~/Downloads/test1.txt", sep="'", quote=""))

 10  13 
  5 130 
 x <- read.table("~/Downloads/test1.txt", header = FALSE, sep="'", quote="", stringsAsFactors=FALSE, skip=5)
#Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
#  line 6 did not have 13 elements
 x <- read.table("~/Downloads/test1.txt", header = FALSE, sep="'", 
                  quote="", stringsAsFactors=FALSE, fill=TRUE)
 str(x)
 #########################################################
'data.frame':   135 obs. of  13 variables:
 $ V1 : chr  "INSERT INTO message VALUES (52," "INSERT INTO message VALUES (53," "INSERT INTO message VALUES (54," "INSERT INTO message VALUES (55," ...
 $ V2 : chr  "press.release@enron.com" "office.chairman@enron.com" "office.chairman@enron.com" "press.release@enron.com" ...
 $ V3 : chr  "," "," "," "," ...
 $ V4 : chr  "2000-01-21 04:51:00" "2000-01-24 01:37:00" "2000-01-24 02:06:00" "2000-02-02 10:21:00" ...
 $ V5 : chr  "," "," "," "," ...
 $ V6 : chr  "<12435833.1075863606729.JavaMail.evans@thyme>" "<29664079.1075863606676.JavaMail.evans@thyme>" "<15300605.1075863606629.JavaMail.evans@thyme>" "<10522232.1075863606538.JavaMail.evans@thyme>" ...
 $ V7 : chr  "," "," "," "," ...
 $ V8 : chr  "ENRON HOSTS ANNUAL ANALYST CONFERENCE PROVIDES BUSINESS OVERVIEW AND GOALS FOR 2000" "Over $50 -- You made it happen!" "Over $50 -- You made it happen!" "ROAD-SHOW.COM Q4i.COM CHOOSE ENRON TO DELIVER FINANCIAL WEB CONTENT" ...
 $ V9 : chr  "," "," "," "," ...
 $ V10: chr  "HOUSTON - Enron Corp. hosted its annual equity analyst conference today in==20Houston.  Ken Lay, Enron chairman and chief execu"| __truncated__ "On Wall Street, people are talking about Enron.  At Enron, we re talking=20about people...our people.  You are the driving forc"| __truncated__ "On Wall Street, people are talking about Enron.  At Enron, we re talking=20about people...our people.  You are the driving forc"| __truncated__ "HOUSTON =01) Enron Broadband Services (EBS), a wholly owned subsidiary of E=nron=20Corp. and a leader in the delivery of high-b"| __truncated__ ...
 $ V11: chr  "" "," "," "," ...
 $ V12: chr  "" "Robert_Badeer_Aug2000Notes FoldersPress releases" "Robert_Badeer_Aug2000Notes FoldersPress releases" "Robert_Badeer_Aug2000Notes FoldersPress releases" ...
 $ V13: chr  "" ");" ");" ");" ...

read.*使用逗号作为分隔符并仅使用单引号而不是-functions 使用的默认单引号或双引号,我得到了更好的结果:

x2 <- read.table("~/Downloads/test1.txt", header = FALSE, sep=",",
                  quote="'", stringsAsFactors=FALSE, fill=TRUE)
 str(x2)
于 2013-07-20T15:52:02.587 回答
1

检查你的文本,想想当你是电脑时你对它的期望。它开始时没有分隔符 ('),在 中看到第一个 (') press releases,然后开始做一些愚蠢的事情。不要计算您读取的第一个条目,首先检查输出。

INSERT INTO message VALUES (52,'press.release@enron.com','2000-01-21 04:51:00','<12435833.1075863606729.JavaMail.evans@thyme>','ENRON HOSTS
于 2013-07-20T15:16:08.973 回答