0

我有一个数据框如下:

df <- data.frame(as.date=c("14/06/2016","15/06/2016","16/06/2016","17/06/2016","18/06/2016","19/06/2016","20/06/2016","21/06/2016","22/06/2016","23/06/2016",
                    "24/06/2016","04/07/2016","05/07/2016","06/07/2016","07/07/2016","08/07/2016","09/07/2016","10/07/2016","11/07/2016","12/07/2016",
                    "13/07/2016","14/07/2016","15/07/2016","17/07/2016","18/07/2016","19/07/2016","20/07/2016","21/07/2016","22/07/2016","01/08/2016",
                    "02/08/2016","03/08/2016","04/08/2016","05/08/2016","06/08/2016","07/08/2016","08/08/2016","09/08/2016","10/08/2016","11/08/2016",
                    "12/08/2016","13/08/2016","14/08/2016","15/08/2016","16/08/2016","17/08/2016","18/08/2016","19/08/2016","20/08/2016","21/08/2016",
                    "22/08/2016","23/08/2016","24/08/2016","25/08/2016","26/08/2016","27/08/2016","28/08/2016","29/08/2016","30/08/2016","31/08/2016",
                    "01/09/2016","02/09/2016","03/09/2016","04/09/2016","05/09/2016","06/09/2016","07/09/2016","08/09/2016","09/09/2016","10/09/2016",
                    "11/09/2016","12/09/2016","13/09/2016","14/09/2016","15/09/2016","16/09/2016","17/09/2016","18/09/2016","19/09/2016","20/09/2016"),
             wear=c("0","55","0","0","0","0","8","8","15","25","30","37","43","49","52","52","55","57","57","61","67","69","2","2","7",
                    "10","13","14","16","16","19","22","22","24","25","26","29","29","33","34","34","36","38","44","45","48","50","55",
                    "56","58","0","4","0","4","4","6","9","9","12","14","16","17","25","25","33","36","44","46","48","52","55","59",
                    "8","9","9","12","24","33","36","44"))

数据是机器上一种金属的磨损率的一个例子,它随着时间的推移而增加,它们下降到 0,表明一个事件或变化,

但是我遇到的问题是磨损值没有下降到0,从数据中可以看出,有2个变量

as.date = 随着时间推移的日期,wear = 随着时间推移零件上的金属磨损

变化之间的范围是:55-0、60-2、58-0、59-8

当它从一个大数字下降到 0 时很容易编码,我使用以下代码进行更改,并添加名为 Status & id 的新变量

{Creates 2 new columns status & id
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$Status <- 0# fills in column status with all zeros
df$Status[wear > -10 & wear == 0] <- 1 # fill in 1s when wear = 0
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$id <-1# fills in column status with '1's

for(i in 2:nrow(df)){
  if(df$Status[i-1]==0){
    df$id[i]=df$id[i-1]
  }
  else {
    df$id[i]=df$id[i-1]+1
  }
}
}

将磨损值下降到 0 可以正常工作,但如果没有,如数据示例中所示,磨损下降发生在 55-0、69-2、58-0、59-8 范围内真实数据集有时磨损值下降为负数,不确定实现这一点的正确方法,我尝试对数据进行分箱和分组,但没有成功。

这是数据的一个样本,在真实数据集中有 100 多个事件,主要是磨损值下降到 0,但有 10-20 次下降到负值或值 < 10。

4

1 回答 1

0

我认为for循环效率低下。dplyr我们可以使用andlubridate包来做这样的事情。

library(dplyr)
library(lubridate)

df2 <- df %>%
  # Convert the as.date column to date class
  # Convert the wear column to numeric 
  mutate(as.date = dmy(as.date), 
         wear = as.numeric(as.character(wear))) %>%
  # Create column show the wear of previous record
  mutate(wear2 = lag(wear)) %>%
  mutate(Diff = wear - wear2)

思路是将wear列移1,然后计算磨损日期与前一个日期的差值。结果在新列中保存为Diff. 这是新数据框的样子。

head(df2)
#      as.date wear wear2 Diff
# 1 2016-06-14    0    NA   NA
# 2 2016-06-15   55     0   55
# 3 2016-06-16    0    55  -55
# 4 2016-06-17    0     0    0
# 5 2016-06-18    0     0    0
# 6 2016-06-19    0     0    0

在此之后,您可以定义一个阈值Diff来过滤掉一个时期的结束。例如,这里我将阈值定义为-50。您可以看到该filter函数成功识别了四个时段。

# Filter Diff <= -50
df2 %>% filter(Diff <= -50)
#      as.date wear wear2 Diff
# 1 2016-06-16    0    55  -55
# 2 2016-07-15    2    69  -67
# 3 2016-08-22    0    58  -58
# 4 2016-09-13    8    59  -51

最后一点,在您的原始数据框中,该wear列是因子,但您将其计算为数字。这是危险的。我曾经wear = as.numeric(as.character(wear))将该列转换为数字,但如果您可以首先创建数字列,那就太好了。

于 2017-12-05T13:50:48.830 回答