0

我正在尝试将制表符分隔的 sam 文件导入为 pandas 数据框。

NB501670:42:HJL7WAFXX:1:11209:17120:18358   83  chr1    13182   0   86M =   13178   -90 CAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCGGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTG  EEEAEEE6EEEAE//EEAEEEAAA/EEEEEAEAEEEEEAEEEEEEE//EEAAEAEEEEEEEEEAEEEEEE/EEEEEEEEEEAEEEE  MC:Z:7S90M20S   MD:Z:50A35  RG:Z:Sample NM:i:1  AS:i:81 XS:i:81 RX:Z:TCCAAGAA
NB501670:42:HJL7WAFXX:3:11411:9444:15777    83  chr1    19434   0   20M =   19335   -119    GGTGGAGGGGCTGCAGACTC    AEAAE/EEEEE/AEEAEE/E    MC:Z:20S39M MD:Z:20 RG:Z:Sample NM:i:0  AS:i:20 XS:i:20 RX:Z:TACTCTTC
NB501670:42:HJL7WAFXX:1:11212:2247:4550 99  chr1    22984   0   115M8S  =   22984   115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEAEEEAEEEAEEAEEEEAEEEEEEEEEEEAEEEEAEEEAAAEEE XA:Z:chr9,+23097,115M8S,0;chr19,+64592,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3;   MC:Z:18S115M8S  MD:Z:115    RG:Z:Sample NM:i:0  AS:i:115    XS:i:115    RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:3:11508:18628:11422   99  chr1    22984   0   115M8S  =   22984   115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEE/EEEEEEAAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEA/EAAAEAAA XA:Z:chr19,+64592,115M8S,0;chr9,+23097,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3;   MC:Z:18S115M8S  MD:Z:115    RG:Z:Sample NM:i:0  AS:i:115    XS:i:115    RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:2:21203:5598:10862    83  chr1    25804   0   130M    =   25783   -151    AGTGGGGCCCTTGGTTGCAACACAAGTAGGTGGGGATGGATGAGTGTGGCATGAAGGGCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTGCCGTCTCTGAATCTGATCCTTTC  EEEAAEEEEEEAEAEE<EEE/EEAEEEEEEEE/AAEEEEEEEEEEEEEEEE<EEEEEAEEAEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  XA:Z:chr19,-67412,130M,1;chr12,+77744,130M,2;chr15,+102505230,130M,2;chr9,-25917,130M,2;chr2,+114345085,130M,2; MC:Z:9S142M MD:Z:31A98  RG:Z:Sample NM:i:1  AS:i:125    XS:i:125    RX:Z:GTTCGATA
NB501670:42:HJL7WAFXX:1:21308:24556:17558   83  chr1    25843   0   5M1I111M    =   25848   -111    ATGAGATGTGGCATGAAGGCCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTTCCGTCTCTGAATCTGATCCTTTCGATTTCCCATTCTCCAAACTGAGAA   AA<E<EEAA<A<<EE6A/AEEAEAAAEAEEAAEEEAAA6EEEEAEEEEEEEA<EEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE   XA:Z:chr9,-25956,5M1I111M,3;chr19,-67451,5M1I111M,3;chr12,+77719,111M1I5M,4;chr15,+102505205,111M1I5M,4;    MC:Z:8S111M20S  MD:Z:18G48G48   RG:Z:Sample NM:i:3  AS:i:101    XS:i:101    RX:Z:TGTGGTAT

下面是我将文件作为数据框读取的代码。

InSamFile = r'truq_chr1_10M.R1.sorted.txt'
max_n=22
df = pd.read_csv(InSamFile, sep='\t',comment='#', dtype=str, names=range(max_n))
df.head()

上面的代码导入文件,如下所示: 在此处输入图像描述

在数据框中,最初的 11 列可用于所有行,因此它们被正确导入。但是,在某些行中,当存在 MC:Z:xxxxx 标记时,它会与该列中的 MD:Z:xxxx 标记混合。因此,某些列在导入期间会发生偏移。

您能否建议在检查列表的开头时如何 read_csv,例如,如果它以 MD 值开头,则将所有值放入 M​​D 列中,当它以 RG、NM 等开头时以及没有值时相同找到特定的标签,然后把 NA 或保持为空?可以跳过前 12 列进行此类检查,因为它们始终以正确的顺序出现在所有行中。这样,对于大多数行,带有 MC 标签的列将为空。

在读取文件时或稍后在处理数据帧时实施的任何建议将不胜感激。我可以使用 awk 通过一一读取所有列并匹配列的开头(如果 MD/MC/等)来执行此操作。并相应地分配。但我是 python 新手,正在寻求帮助。

阿米特

4

0 回答 0