1

我有一个数据框如下:

data = {'CHROM':['chr1', 'chr2', 'chr1', 'chr3', 'chr1'],
        'POS':[939570,3411794,1043223,22511093,24454031],
        'REF':['T', 'T', 'CCT', 'CTT', 'CT'],
        'ALT':['TCCCTGGAGGACC', 'C', 'C', 'CT', 'CTT'],
        'Len_REF':[1,1,3,3,2], 'Len_ALT':[13,1,1,2,3]
       }
df1 = pd.DataFrame(data)

它看起来如下:df1

    CHROM   POS     REF  ALT            Len_REF   Len_ALT
0   chr1    939570   T   TCCCTGGAGGACC    1         13
1   chr2    3411794  T   C                1          1
2   chr1    1043223  CCT C                3          1
3   chr3    22511093 CTT CT               3          2
4   chr1    24454031 CT  CTT              2          3

我想根据列值向数据框添加新列,使其如下所示:

Positions             Allele         Combined
1:939570-939570       CCCTGGAGGACC   1:939570-939570:CCCTGGAGGACC
2:3411794-3411794     C              2:3411794-3411794:C
1:1043223-1043225     -              1:1043223-1043225:-
3:22511093-22511095   -              3:22511093-22511095:-
1:24454031-24454032   T             1:24454031-24454032:T

df1['Positions']基于CHROM&中的值POS相对于 和 的变化REF而生成的ALT

df1['Allele']使用REF&ALT

4

1 回答 1

1
  1. Positions\D+列:使用和从CHROM 列中删除非数字值,并str.repalce根据需要操作字符串的其余部分
  2. Allele列:您可以根据值动态比较ALTLen_REF逐行和索引。确保通过:ALTLen_REFaxis=1

df2['Positions'] =  (df2['CHROM'].str.replace('\D+', '').astype(str)
                     + ':' + df2['POS'].astype(str) 
                     + '-' + (df2['POS'] + df2['Len_REF'] - 1).astype(str))
df2['Allele'] = df2.apply(lambda x: x['ALT'][x['Len_REF']:], axis=1).replace('','-')
df2['Combined'] = df2['Positions'] + ':' + df2['Allele']
df2.iloc[:,-3:]

Out[1]: 
             Positions        Allele                      Combined
0      1:939570-939570  CCCTGGAGGACC  1:939570-939570:CCCTGGAGGACC
1    2:3411794-3411794             -           2:3411794-3411794:-
2    1:1043223-1043225             -           1:1043223-1043225:-
3  3:22511093-22511095             -         3:22511093-22511095:-
4  1:24454031-24454032             T         1:24454031-24454032:T
于 2021-04-09T17:15:00.113 回答