0

So far i have been able to merge two files and get the following dataframe (df1):

ID  someLength  someLongerSeq   someSeq someMOD someValue
A   16  XCVBNMHGFDSTHJGF    NMH T3(P)   7
A   16  XCVBNMHGFDSTHJGF    NmH M3(O); S4(P); S6(P) 1
B   24  HDFGKJSDHFGKJSDFHGKLSJDF    HFGKJSDFH   S9(P)   5
C   22  QIOWEURQOIWERERQWEFFFF  RQoIWERER   Q16(D); S19(P)  7
D   19  HSEKDFGSFDKELJGFZZX KELJ    S7(P); C9(C); S10(P)    1

i am looking for a way to do a regex match based on "someSeq" column to look for that substring in the "someLongersSeq" column and get the start location of the match and then add that to the whole numbers that are attached to the characters such as T3(P).

Example:

For the second row "ID:A","someSeq":"NmH" matches starts at location 4 of the someLongerSeq (after to upper conversion of NmH). So i want to add that number 4 to someMOD fields M3(O);S4(P);S6(P) so that i get M7(O);S8(P);S10(P) and then overwrite the new value in the someMOD column.

And do that for each row. Regex is per row bases. Any help is really appreciated. Thanks.

4

1 回答 1

2

首先,我应该提到很难读取您的数据。我稍微修改了一下(我从someMOD列中删除了空格)来阅读它们。这不是问题,因为您已经将数据放入 data.frame 中。所以我读了这样的数据:

dat <- read.table(text='ID  someLength  someLongerSeq   someSeq someMOD someValue
A   16  XCVBNMHGFDSTHJGF    NMH T3(P)   7
A   16  XCVBNMHGFDSTHJGF    NmH M3(O);S4(P);S6(P) 1
B   24  HDFGKJSDHFGKJSDFHGKLSJDF    HFGKJSDFH   S9(P)   5
C   22  QIOWEURQOIWERERQWEFFFF  RQoIWERER   Q16(D);S19(P)  7
D   19  HSEKDFGSFDKELJGFZZX KELJ    S7(P);C9(C);S10(P)    1',header=TRUE)

那么想法是:

  • 使用逐行处理apply
  • 用于gregexpr获取 someSeq 的索引someLongerSeq
  • 用于gsubfn将前一个索引添加到它的someMOD数字

这里是整个解决方案:

library(gsubfn)
res <- t(apply(dat,1,function(x){
        idx <- gregexpr(x['someSeq'],x['someLongerSeq'],
                        ignore.case = TRUE)[[1]][1]
        x[['someMOD']] <- gsubfn("[[:digit:]]+", 
                                  function(x) as.numeric(x)+idx, 
                                  x[['someMOD']])
        x

}))


 as.data.frame(res)
  ID someLength            someLongerSeq   someSeq              someMOD someValue
1  A         16         XCVBNMHGFDSTHJGF       NMH                T8(P)         7
2  A         16         XCVBNMHGFDSTHJGF       NmH   M8(O);S9(P);S11(P)         1
3  B         24 HDFGKJSDHFGKJSDFHGKLSJDF HFGKJSDFH               S18(P)         5
4  C         22   QIOWEURQOIWERERQWEFFFF RQoIWERER        Q23(D);S26(P)         7
5  D         19      HSEKDFGSFDKELJGFZZX      KELJ S18(P);C20(C);S21(P)         1
于 2013-06-02T01:05:07.523 回答