我正在尝试将制表符分隔的 sam 文件导入为 pandas 数据框。
NB501670:42:HJL7WAFXX:1:11209:17120:18358 83 chr1 13182 0 86M = 13178 -90 CAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCGGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTG EEEAEEE6EEEAE//EEAEEEAAA/EEEEEAEAEEEEEAEEEEEEE//EEAAEAEEEEEEEEEAEEEEEE/EEEEEEEEEEAEEEE MC:Z:7S90M20S MD:Z:50A35 RG:Z:Sample NM:i:1 AS:i:81 XS:i:81 RX:Z:TCCAAGAA
NB501670:42:HJL7WAFXX:3:11411:9444:15777 83 chr1 19434 0 20M = 19335 -119 GGTGGAGGGGCTGCAGACTC AEAAE/EEEEE/AEEAEE/E MC:Z:20S39M MD:Z:20 RG:Z:Sample NM:i:0 AS:i:20 XS:i:20 RX:Z:TACTCTTC
NB501670:42:HJL7WAFXX:1:11212:2247:4550 99 chr1 22984 0 115M8S = 22984 115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEAEEEAEEEAEEAEEEEAEEEEEEEEEEEAEEEEAEEEAAAEEE XA:Z:chr9,+23097,115M8S,0;chr19,+64592,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3; MC:Z:18S115M8S MD:Z:115 RG:Z:Sample NM:i:0 AS:i:115 XS:i:115 RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:3:11508:18628:11422 99 chr1 22984 0 115M8S = 22984 115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEE/EEEEEEAAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEA/EAAAEAAA XA:Z:chr19,+64592,115M8S,0;chr9,+23097,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3; MC:Z:18S115M8S MD:Z:115 RG:Z:Sample NM:i:0 AS:i:115 XS:i:115 RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:2:21203:5598:10862 83 chr1 25804 0 130M = 25783 -151 AGTGGGGCCCTTGGTTGCAACACAAGTAGGTGGGGATGGATGAGTGTGGCATGAAGGGCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTGCCGTCTCTGAATCTGATCCTTTC EEEAAEEEEEEAEAEE<EEE/EEAEEEEEEEE/AAEEEEEEEEEEEEEEEE<EEEEEAEEAEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE XA:Z:chr19,-67412,130M,1;chr12,+77744,130M,2;chr15,+102505230,130M,2;chr9,-25917,130M,2;chr2,+114345085,130M,2; MC:Z:9S142M MD:Z:31A98 RG:Z:Sample NM:i:1 AS:i:125 XS:i:125 RX:Z:GTTCGATA
NB501670:42:HJL7WAFXX:1:21308:24556:17558 83 chr1 25843 0 5M1I111M = 25848 -111 ATGAGATGTGGCATGAAGGCCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTTCCGTCTCTGAATCTGATCCTTTCGATTTCCCATTCTCCAAACTGAGAA AA<E<EEAA<A<<EE6A/AEEAEAAAEAEEAAEEEAAA6EEEEAEEEEEEEA<EEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE XA:Z:chr9,-25956,5M1I111M,3;chr19,-67451,5M1I111M,3;chr12,+77719,111M1I5M,4;chr15,+102505205,111M1I5M,4; MC:Z:8S111M20S MD:Z:18G48G48 RG:Z:Sample NM:i:3 AS:i:101 XS:i:101 RX:Z:TGTGGTAT
下面是我将文件作为数据框读取的代码。
InSamFile = r'truq_chr1_10M.R1.sorted.txt'
max_n=22
df = pd.read_csv(InSamFile, sep='\t',comment='#', dtype=str, names=range(max_n))
df.head()
在数据框中,最初的 11 列可用于所有行,因此它们被正确导入。但是,在某些行中,当存在 MC:Z:xxxxx 标记时,它会与该列中的 MD:Z:xxxx 标记混合。因此,某些列在导入期间会发生偏移。
您能否建议在检查列表的开头时如何 read_csv,例如,如果它以 MD 值开头,则将所有值放入 MD 列中,当它以 RG、NM 等开头时以及没有值时相同找到特定的标签,然后把 NA 或保持为空?可以跳过前 12 列进行此类检查,因为它们始终以正确的顺序出现在所有行中。这样,对于大多数行,带有 MC 标签的列将为空。
在读取文件时或稍后在处理数据帧时实施的任何建议将不胜感激。我可以使用 awk 通过一一读取所有列并匹配列的开头(如果 MD/MC/等)来执行此操作。并相应地分配。但我是 python 新手,正在寻求帮助。
阿米特