r - Concatenating positions into genomic segments

Question

I would like to concatenate all rows which have more than 0.955 of similarity score. The Aboand Bel columns represents the similarity score with above and below rows, respectively. In the following input df I have 10 genomic probes (NAME column) which is concatenated in just 4 genomic segments (dfout).

df <- " NAME Abo  Bel Chr GD Position
 BovineHD0100009217 NA 1.0000000   1  0  31691781
 BovineHD0100009218 1.0000000 0.6185430   1  0  31695808
 BovineHD0100019600 0.6185430 0.9973510   1  0  69211537
 BovineHD0100019601 0.9973510 1.0000000   1  0  69213650
 BovineHD0100019602 1.0000000 1.0000000   1  0  69214650
 BovineHD0100019603 1.0000000 0.6600000   1  0  69217942
 BovineHD0100047112 0.6600000 1.0000000   1  0  93797691
 BovineHD0100026604 1.0000000 1.0000000   1  0  93815774
 BovineHD0100026605 1.0000000 0.4649007   1  0  93819471
 BovineHD0100029861 0.4649007 NA   1  0 105042452"
df <- read.table(text=df, header=T)

My expected output dfout:

dfout <- "Chr start end startp endp nprob 
           1  31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
           1  69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
           1  93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
           1  105042452 105042452 BovineHD0100029861 BovineHD0100029861 1"
dfout <- read.table(text=dfout, header=T)

Any ideas?

score 4 · Accepted Answer

I couldn't think of any pretty solution using basic dataframe manipulation, so here's a bad-looking one that works:

First, add stringsAsFactors to df creation:

df <- read.table(text=df, header=T, stringsAsFactors = FALSE)

start <- df$Position[1]
end <- integer()
output <- NULL
count <- 1
for (i in 1:(nrow(df)-1)) {
  if(df$Bel[i] < 0.955)  {
    end <- df$Position[i]
    output <- rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count))
    start <- df$Position[i+1]
    count <- 0
  } 
  count <- count + 1
}
end <- df$Position[nrow(df)]
output <- as.data.frame(rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count)))
colnames(output) <- c("start", "end", "startp", "endp", "nprob")

The basic idea here is looping through the rows and checking if the next should be added to the current segment (Bel > 0.955) or if a new segment should start (Bel <= 0.955). When a new sequence has to be started, the endrow is defined, the respective row added to the output and the new starting segment also defined. A count is used to add the number of rows used to create the segment (nprob).

Finally the last segment is added, outside the for loop, and the output receives its column names and is converted to a dataframe. I did not use Chr because 1. They are all equal, 2. if they weren't you didn't give any way to choose/summarize them.

Result:

> output
      start       end             startp               endp nprob
1  31691781  31695808 BovineHD0100009217 BovineHD0100009218     2
2  69211537  69217942 BovineHD0100019600 BovineHD0100019603     4
3  93797691  93819471 BovineHD0100047112 BovineHD0100026605     3
4 105042452 105042452 BovineHD0100029861 BovineHD0100029861     1

I'm pretty sure that you or someone else can work on this to make it shorter and more concise.

score 2 · Accepted Answer

Here is dplyr version. First we need to define groups, that is what mutate bit is doing, then simple summarise function within the groups.

library(dplyr)

df %>% 
  mutate(
   Abo955=ifelse(Abo<0.955,NA,Abo),
   myGroup=cumsum(is.na(Abo955)*1)) %>%
  group_by(myGroup) %>% 
  summarise(
    Chr=min(Chr),
    start=min(Position),
    end=max(Position),
    startp=first(NAME),
    lastp=last(NAME),
    nprob=n()) %>% 
  select(-myGroup)

score 1 · Accepted Answer

This solution is purely based on logical vectors and works with the provided example.

As Molx said, let's add stringsAsFactors=F

df <- read.table(text=df, header=T, stringAsFactors = F)

An just so that the logical evaluations work let's change NA to 0s

df(is.na(df)) <- 0

Now, for the consecutive rows that will be concatenated lets find the "start" and "end" rows using logical evaluations

starts <- df$Bel >= 0.955 &  df$Abo < 0.955
ends <- df$Bel < 0.955 &  df$Abo >= 0.955

With this we can already construct a data.frame concatenating rows that need to be concatenated

concatenated <- data.frame(Chr = df[starts, "Chr"], 
                            start = df[starts, "Position"], 
                            end = df[ends, "Position"],
                            startp = df[starts, "NAME"],
                            endp = df[ends, "NAME"],
                            nprob = c( diff (which(starts))[1]  ,diff (which(ends)))
                            )

And let's also construct a data.frame with the rows that are not concatenated, i.e. the ones that do not have the desired similarity score with neither the above nor below row

notConcatenate <- df$Abo < 0.955 & df$Bel < 0.955

non_concatenated <- data.frame(Chr = df[notConcatenate, "Chr"], 
                            start = df[notConcatenate, "Position"], 
                            end = df[notConcatenate, "Position"],
                            startp = df[notConcatenate, "NAME"],
                            endp = df[notConcatenate, "NAME"],
                            nprob = 1
                            )

And finally bind the two data.frames

dfout <- rbind(concataneted,non_concatenated)

Resulting in

> dfout
  Chr     start       end             startp               endp nprob
1   1  31691781  31695808 BovineHD0100009217 BovineHD0100009218     2
2   1  69211537  69217942 BovineHD0100019600 BovineHD0100019603     4
3   1  93797691  93819471 BovineHD0100047112 BovineHD0100026605     3
4   1 105042452 105042452 BovineHD0100029861 BovineHD0100029861     1

NOTE: This code assumes that correlated probes are within the same chromosome

Cheers!

r - Concatenating positions into genomic segments

3 回答 3

Related

Reference