r - 删除 'NA's 之后的文件行

Question

我的数据以特定方式排列，没有标题，并且列不一定包含相同类型的信息。它的一部分可以使用：

data <- textConnection("rs123,22,337647,C,T
1,7385,0.4156,-0.0019,0.0037
1,16550,0.959163800640972,-0.0241,0.0128
1,17218,0.0528,0.015,0.039
rs193,22,366349,C,T
1,7385,0.3708,0.0017,0.0035
1,16550,0.793259111116741,-0.0028,0.009
1,17218,0.9547,-0.016,0.033
rs194,22,366300,NA,NA
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
rs118,22,301327,C,T
1,7385,0.0431,-0.0085,0.0077
1,16550,0.789981059331214,0.0036,0.0092
1,17218,0.99,-0.057,0.062
rs120,22,497528,C,G
1,7385,0.0716,0.0012,0.0073
1,16550,0.233548238634496,-0.0033,0.0064
1,17218,0.4563,-0.002,0.015
rs109,22,309825,A,G
1,5520,0.8611,2e-04,0.0044
0,0,0,0,0
1,17218,0.9762,0.076,0.044
rs144,22,490068,C,T
0,0,0,0,0
0,0,0,0,0
1,17218,0.2052,-0.013,0.032")
mydata <- read.csv(data, header = F, sep = ",", stringsAsFactors=FALSE)

我的问题是这样的：我可以写一行到 grep/awk 包含“NA”的行（这些是不包含数据的 SNP）

grep -v 'NA' file.in > file.out

但是，我怎样才能同时指定以下 3 行也被删除？我不想删除包含全零的每一行，只删除包含所有零的行，这些行跟随包含带有“NA”的 SNP 的行。

谢谢您的意见！

score 3 · Accepted Answer

使用GNU sed（因为地址后面的行数是扩展名）：

sed -e '/NA/,+3 d' infile

编辑以添加awk解决方案：

awk '/NA/ { for ( i = 1; i <= 4; i++ ) { getline; } } { print }' infile

score 1 · Accepted Answer

更新：我之前的回答可能是错误的，所以我有这个选择：

nas <- apply(mydata, 1, function(x) any(is.na(x)))
s <- apply(mydata == 0, 1, all)
out <- which(nas)
for (i in which(nas)) {
  j <- i + 1
  while (!is.na(s[j]) && s[j]) {
    out <- c(out, j)
    j <- j + 1
  }
}
mydata2 <- mydata[-out,]

起初我以为你只关心 NA 之后的前 3 行，但似乎你真的想删除每个 NA 后所有连续的行，全为零。

（这是我之前的回答：）

nas <- apply(mydata, 1, function(x) any(is.na(x)))
whereToLook <- sort(which(nas) + 1:3)
s <- apply(mydata == 0, 1, prod)
zeros <- which(s == 1)
whereToErase <- zeros[zeros %in% whereToLook]
whereToErase <- c(which(nas), whereToErase)

score 1 · Accepted Answer

导入 R 后，您可以执行以下操作：

# identify the rows containing any NA's
narows <- which(apply(mydata,1,function(x) any(is.na(x))))
# identify the rows containing all 0's
zerorows <- which(apply(mydata==0,1,all))

# get the rows that either contain NAs, or are all 0 and are 
# within 3 of the NA rows
rowstodelete <- c(narows,
                  intersect(
                    (sapply(c(narows),function(x) seq(x,x+3))),
                    zerorows
                  )
                )

# subset mydata to only remove the NA rows + the following 3 "zero rows"
mydata[-rowstodelete,]

score 0 · Accepted Answer

这可能对您有用（GNU sed）：

 sed '/\<NA\>/!b;:a;$!N;s/\n\(0,\)\+0$//;ta;D' file

这将删除任何包含的行NA和任何以下0,...0行

r - 删除 'NA's 之后的文件行

4 回答 4

Related

Reference