4

新的 R 用户。我正在尝试根据这个问题中的过程使用 cut 来拆分基于十分位数的数据集。我想将十分位值添加为数据框中的新列,但是当我这样做时,由于某种原因,最低值被列为 NA。无论 include.lowest=TRUE 还是 FALSE,都会发生这种情况。有人知道为什么吗?

当我使用这个样本集时也会发生,所以它不是我的数据独有的。

数据 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

> decile <- cut(data, quantile(data, (0:10)/10, labels=TRUE, include.lowest=FALSE))

> df <- cbind(data, decile)

> df

      data decile
 [1,]    1     NA
 [2,]    2      1
 [3,]    3      2
 [4,]    4      2
 [5,]    5      3
 [6,]    6      3
 [7,]    7      4
 [8,]    8      4
 [9,]    9      5
[10,]   10      5
[11,]   11      6
[12,]   12      6
[13,]   13      7
[14,]   14      7
[15,]   15      8
[16,]   16      8
[17,]   17      9
[18,]   18      9
[19,]   19     10
[20,]   20     10
4

1 回答 1

4

有两个问题,首先你的cut电话有几个问题。我想你的意思是

cut(data, quantile(data, (0:10)/10), include.lowest=FALSE)
##                                ^missing parenthesis

此外,labels应该是FALSE、或包含所需标签NULL的向量。length(breaks)

其次,主要问题是因为您设置了include.lowest=FALSE, data[1]is 1,它对应于定义的第一个中断

> quantile(data, (0:10)/10)
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 1.0  2.9  4.8  6.7  8.6 10.5 12.4 14.3 16.2 18.1 20.0

该值1不属于任何类别;它超出了您的休息时间定义的类别的下限。

我不确定您想要什么,但是您可以尝试以下两种选择之一,具体取决于您想1参加的课程:

> cut(data, quantile(data, (0:10)/10), include.lowest=TRUE)
 [1] [1,2.9]     [1,2.9]     (2.9,4.8]   (2.9,4.8]   (4.8,6.7]   (4.8,6.7]  
 [7] (6.7,8.6]   (6.7,8.6]   (8.6,10.5]  (8.6,10.5]  (10.5,12.4] (10.5,12.4]
[13] (12.4,14.3] (12.4,14.3] (14.3,16.2] (14.3,16.2] (16.2,18.1] (16.2,18.1]
[19] (18.1,20]   (18.1,20]  
10 Levels: [1,2.9] (2.9,4.8] (4.8,6.7] (6.7,8.6] (8.6,10.5] ... (18.1,20]
> cut(data, c(0, quantile(data, (0:10)/10)), include.lowest=FALSE)
 [1] (0,1]       (1,2.9]     (2.9,4.8]   (2.9,4.8]   (4.8,6.7]   (4.8,6.7]  
 [7] (6.7,8.6]   (6.7,8.6]   (8.6,10.5]  (8.6,10.5]  (10.5,12.4] (10.5,12.4]
[13] (12.4,14.3] (12.4,14.3] (14.3,16.2] (14.3,16.2] (16.2,18.1] (16.2,18.1]
[19] (18.1,20]   (18.1,20]  
11 Levels: (0,1] (1,2.9] (2.9,4.8] (4.8,6.7] (6.7,8.6] ... (18.1,20]
于 2013-07-29T19:56:29.483 回答