r - Remove rows that are assoicated to certain columns value

Question

I am new to R, I have 0's and 1's X matrix and associated with y's as the data. I need to remove the observations that have less than 10 one's so I add the columns for x and i return the column name to a vector. then drop the y's that associated with the one's then I need to remove the columns because it will be column with zero. so I am getting this error and I dont know how to fix and improve the code Error in -Col[i] : invalid argument to unary operator

Here is the code

a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3),nrow=40,ncol=7)
nam <- paste("V",1:7,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- apply(dat,2,sum)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}

for(i in 1:length(Col)){
indx <- dat[,Col[i]]==0
datnw <- dat[indx,]
datnw2 <- datnw[,-Col[i]]
}

Can some one help please? I am not sure if there is a way to get the position for the columns in Col vector. I have around 1500 columns on my original data.

Thanks

score 0 · Accepted Answer

This should do the trick

   datnw2 <- dat[, -which(toSum<10)]

This allows you to avoid the loop

 head(datnw2)
            y V1 V2 V3 V4 V7
[1,] 60.88166  1  0  1  0  1
[2,] 54.35388  1  1  1  0  1
[3,] 39.78881  1  0  1  0  1
[4,] 44.20074  1  1  1  0  1
[5,] 42.27351  1  0  1  0  1
[6,] 43.52390  1  1  1  0  1

Edit: Some pointers

toSum<10 will give a logical vector to you, the length of this vector is the same as length(toSum) which(toSum<10) will give you the positions of those elements meeting the condition

Since you want to select those columns from dat which the associated toSum<10 is FALSE, then you have to left those columns out by doing dat[, -which(toSum<10)], this means: chose all columns but 6 and 7 which are the ones meeting condition toSum<10

score 0 · Accepted Answer

Using your example data, if you want to find which rows (i.e. observations) have fewer than 10 1s

rs <- rowSums(dat[, -1]) < 10

If you want to know which columns (i.e. variables) have less than 10 "presences" then

cs <- colSums(dat[, -1]) < 10

R> cs
   V1    V2    V3    V4    V5    V6    V7 
FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Both rs and cs are logical variables that can be used to index to remove rows/columns.

To get rid of the columns we use:

dat2 <- dat
dat2 <- dat2[, !cs]
head(dat2)

R> head(dat2)
            y V1 V2 V3 V6 V7
[1,] 47.61253  1  0  1  1  1
[2,] 60.51697  1  1  1  1  1
[3,] 53.69815  1  0  1  1  1
[4,] 53.79534  1  1  1  1  1
[5,] 49.04329  1  0  1  1  1
[6,] 42.04286  1  1  1  1  1

Next it seems that you are concerned that some rows will now be all zero? Is that what you are trying to do with the final step? That doesn't appear to be the case here, so perhaps the way or removing the columns I show has solved that problem too?

R> rowSums(dat2[,-1])
 [1] 4 5 4 5 4 5 4 5 3 4 3 4 3 4 3 4 3 4 3 4 2 3 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[39] 1 2

r - Remove rows that are assoicated to certain columns value

2 回答 2

Related

Reference