0

对于我的硕士论文,我必须在现有数据集上检查不同的填空方法。因此我必须添加不同长度的人工间隙(1h,5h ..),这样我就可以用不同的方法填充它们。是否有一个简单的功能可以做到这一点?

这是数据框的示例:

   structure(list(DateTime = structure(c(1420074000, 1420077600, 
1420081200, 1420084800, 1420088400, 1420092000, 1420095600, 1420099200, 
1420102800, 1420106400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    `Dd 1-1` = c(0.0186269166666667, 0.0242605625, 0.00373020138888889, 
    0.000966965277777778, 0.0119253611111111, 0.0495888958333333, 
    0.02014125, 0.0306862638888889, 0.0324395694444444, 0.0191942152777778
    ), `Dd 1-3` = c(0.0242500833333333, 0.0349086388888889, 0, 
    0.00135595138888889, 0.0221090138888889, 0.0600941527777778, 
    0.0462282986111111, 0.0171887638888889, 0.0481975347222222, 
    0.0226582152777778), `Dd 1-5` = c(0.0212732152777778, 0.0284445347222222, 
    0.00276098611111111, 0.0142581875, 0.0276248958333333, 0.0328644027777778, 
    0.0495009166666667, 0.0173377777777778, 0.0384788194444444, 
    0.017663875), luecken = c(0.0186269166666667, 0.0242605625, 
    0.00373020138888889, 0.000966965277777778, 0.0119253611111111, 
    0.0495888958333333, 0.02014125, 0.0306862638888889, 0.0324395694444444, 
    0.0191942152777778)), row.names = c(NA, 10L), class = c("tbl_df", 
"tbl", "data.frame"))
4

1 回答 1

0

如果我正确理解了您的问题,一种可能的解决方案是:

set.seed(4) # make it reproducable

del <- sort(sample(1:nrow(df), 4, replace=FALSE)) # get 4 random indexex from the total number of rows and sort them

del2 <-  del[diff(del) !=1] # delete those values that have a difference of 1 (meaning "connected")

df[del2, c(2:5)] <- NA # set column 2 to 5 NA for the indices we calculated above

   DateTime             `Dd 1-1` `Dd 1-3` `Dd 1-5`   luecken
   <dttm>                  <dbl>    <dbl>    <dbl>     <dbl>
 1 2015-01-01 01:00:00  0.0186    0.0243    0.0213  0.0186  
 2 2015-01-01 02:00:00  0.0243    0.0349    0.0284  0.0243  
 3 2015-01-01 03:00:00 NA        NA        NA      NA       
 4 2015-01-01 04:00:00  0.000967  0.00136   0.0143  0.000967
 5 2015-01-01 05:00:00  0.0119    0.0221    0.0276  0.0119  
 6 2015-01-01 06:00:00  0.0496    0.0601    0.0329  0.0496  
 7 2015-01-01 07:00:00  0.0201    0.0462    0.0495  0.0201  
 8 2015-01-01 08:00:00  0.0307    0.0172    0.0173  0.0307  
 9 2015-01-01 09:00:00 NA        NA        NA      NA       
10 2015-01-01 10:00:00  0.0192    0.0227    0.0177  0.0192 

只是要明确一点:清理连接间隙的步骤并不完全正确,因为在随机数为 1 - 4 的情况下,这将下降 2、3 和 4,但在大数据上,如果你不计划,它应该是一个足够的解决方案与整个数据集相比,删除许多值

现在介绍如何创建更大的间隙(我将使用 3h,因为您的示例数据只有 10 行)

set.seed(4)

del <- sort(sample(1:nrow(df), 3, replace=FALSE))

del2 <- del[diff(del) > 3] #set difference to more than maximum size of gap wanted

del3 <- c(del2, del2 + 1, del2 + 2) # build vector with +1 and +2 to get indices conecting conecting to the onces you have

del4 <- del3[del3 <= nrow(df)] # make sure it is not out of bound (max index should be 10 even if gap starts at line 10

df[del4, c(2:5)] <- NA

    DateTime            `Dd 1-1` `Dd 1-3` `Dd 1-5` luecken
   <dttm>                 <dbl>    <dbl>    <dbl>   <dbl>
 1 2015-01-01 01:00:00   0.0186   0.0243   0.0213  0.0186
 2 2015-01-01 02:00:00   0.0243   0.0349   0.0284  0.0243
 3 2015-01-01 03:00:00  NA       NA       NA      NA     
 4 2015-01-01 04:00:00  NA       NA       NA      NA     
 5 2015-01-01 05:00:00  NA       NA       NA      NA     
 6 2015-01-01 06:00:00   0.0496   0.0601   0.0329  0.0496
 7 2015-01-01 07:00:00   0.0201   0.0462   0.0495  0.0201
 8 2015-01-01 08:00:00   0.0307   0.0172   0.0173  0.0307
 9 2015-01-01 09:00:00  NA       NA       NA      NA     
10 2015-01-01 10:00:00  NA       NA       NA      NA     
于 2020-11-10T14:34:49.960 回答