-1

我有一个文本数据分隔 ny“逗号”即“”。数据示例如下(第一行表示列名):

userID,appName,startTime,endTime,endResult
chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2

我正在使用以下语法:

appsession <- read.table("C:/.../AppSession.txt", sep = ",", 
  col.names = c("userID","appName","startTime","endTime","endResult"), 
  fill = FALSE, strip.white = TRUE)

我收到此错误:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 5 elements
4

3 回答 3

3

我认为skip = 2如果你有一个空行并且打算使用 'col.names' 而不使用header=TRUE. 目前,您的代码可以通过简单的文本读取(无论如何都可以正常工作)“

> txt <- "userID,appName,startTime,endTime,endResult
+ chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
+ chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
+ chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
+ chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
+ chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
+ chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
+ chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
+ chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
+ chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
+ chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2
+ "
> appsession <- read.table(text=txt, sep = ",", 
+   col.names = c("userID","appName","startTime","endTime","endResult"), 
+   fill = FALSE, strip.white = TRUE)
> 
> appsession
    userID      appName           startTime             endTime endResult
1   userID      appName           startTime             endTime endResult
2  chhieut gms.mos.test 2012-07-01 02:47:16 2012-07-01 02:47:46         1
3  chhieut gms.mos.test 2012-07-01 03:11:46 2012-07-01 03:12:25         2
4  chhieut gms.mos.test 2012-07-01 03:13:36 2012-07-01 03:14:03         2
5  chhieut gms.mos.test 2012-07-01 03:18:26 2012-07-01 03:18:58         2
6  chhieut gms.mos.test 2012-07-01 04:10:36 2012-07-01 04:10:54         2
7  chhieut gms.mos.test 2012-07-01 04:38:26 2012-07-01 04:38:48         2
8  chhieut gms.mos.test 2012-07-01 04:48:56 2012-07-01 04:49:04         3
9  chhieut gms.mos.test 2012-07-01 05:49:46 2012-07-01 05:50:14         2
10 chhieut gms.mos.test 2012-07-01 06:19:07 2012-07-01 06:19:25         2
11 chhieut gms.mos.test 2012-07-01 07:09:17 2012-07-01 07:09:47         2

您应该使用标题或跳过标题行(加上跳过任何空白行。)查看有多少行是空白的一种方法是查看countfields( ..., sep=","). 另一种查看read.*scan函数“看到”的方法是执行此代码(适当替换省略号):

appLines <- readLines("C:/.../AppSession.txt")
appLines[1:5] # will display the first 5 lines from that file 
              # with no attempt to deal with any separators.
于 2012-09-27T07:40:34.210 回答
2

您需要提供指向实际数据集的链接,因为您提供的数据可以正常工作:

d = read.csv(textConnection("userID,appName,startTime,endTime,endResult
chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2"), header=TRUE)

快速检查:

R> head(d, 1)
   userID      appName           startTime             endTime endResult
1 chhieut gms.mos.test 2012-07-01 02:47:16 2012-07-01 02:47:46         1
R> dim(d)
[1] 10  5

确保您的实际文件中没有空行 - 这真的会填满东西。

于 2012-09-27T07:37:34.313 回答
2

使用经过适当编辑的数据版本(即删除所有空行!),可以通过read.csv(). 请注意,我使用包含数据的文本连接来避免将数据写入文件。只需conread.csv().

con <- textConnection("userID,appName,startTime,endTime,endResult
chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2
")

dat <- read.csv(con,
                colClasses = c(rep("character", 2), rep("POSIXct", 2),
                               "numeric"))
close(con) ## closing connection, not needed with a file

另请注意,通过指定colclasses参数,我们在读取数据之前告诉 R 数据是什么,这会在以后保存一些格式,尤其是 DateTime 数据。我们可以在此处执行此操作,因为您以正确的格式存储了 DateTime 变量。

R> head(dat)
   userID      appName           startTime             endTime endResult
1 chhieut gms.mos.test 2012-07-01 02:47:16 2012-07-01 02:47:46         1
2 chhieut gms.mos.test 2012-07-01 03:11:46 2012-07-01 03:12:25         2
3 chhieut gms.mos.test 2012-07-01 03:13:36 2012-07-01 03:14:03         2
4 chhieut gms.mos.test 2012-07-01 03:18:26 2012-07-01 03:18:58         2
5 chhieut gms.mos.test 2012-07-01 04:10:36 2012-07-01 04:10:54         2
6 chhieut gms.mos.test 2012-07-01 04:38:26 2012-07-01 04:38:48         2
R> str(dat)
'data.frame':   10 obs. of  5 variables:
 $ userID   : chr  "chhieut" "chhieut" "chhieut" "chhieut" ...
 $ appName  : chr  "gms.mos.test" "gms.mos.test" "gms.mos.test" "gms.mos.test" ...
 $ startTime: POSIXct, format: "2012-07-01 02:47:16" "2012-07-01 03:11:46" ...
 $ endTime  : POSIXct, format: "2012-07-01 02:47:46" "2012-07-01 03:12:25" ...
 $ endResult: num  1 2 2 2 2 2 3 2 2 2
于 2012-09-27T07:47:33.280 回答