2

我想以某种方式对数据框进行分类R
假设有如下数据框:

> data = sample(1:500, 5000, replace = TRUE)

为了对这个数据框进行分类,我正在制作这些类:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
   (0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

如果我想0被包括在内,我只需要添加include.lowest = TRUE

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
    > table(data.cl)
data.cl
   [0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

在此示例中,这没有显示任何差异,因为0此数据帧中根本没有出现。但是,如果它会,例如,在 class 中4会有元素106而不是元素:102[0,10]

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
    > table(data.cl)
data.cl
   [0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      106        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500] 
     1002      1492      1318       194 

更改班级限制还有另一种选择。的默认选项cut()right = FALSE。如果你改变它,right = TRUE你会得到:

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
   [0,10)   [10,20)   [20,30)   [30,40)   [40,50) 
       92        81        87       111       118 
  [50,60)   [60,70)   [70,80)   [80,90)  [90,100) 
      103        89        94       103       103 
[100,200) [200,350) [350,480) [480,500] 
     1003      1497      1320       199 

include.lowest现在变为“<code>include.highest”,代价是更改类限制,因此在某些类中返回不同数量的类成员,因为类限制略有变化。
但是如果我想要数据框

> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
   (0,10]   (10,20]   (20,30]   (30,40]   (40,50] 
      102        80        87       113       117 
  (50,60]   (60,70]   (70,80]   (80,90]  (90,100] 
      101        89        95       106       104 
(100,200] (200,350] (350,480] (480,500) 
     1002      1492      1318       194

排除 500,我该怎么办?
当然,人们可以说:“只写data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499))而不是data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500)),因为您正在处理整数。”<br> 没错,但如果不是这种情况,我会使用浮点数来代替? 那我怎么排除500呢?

4

0 回答 0