1

我有一个数据格式,如:

ATOM 124 N GLU B 12
ATOM 125 O GLU B 12
ATOM 126 OE1 GLU B 12
ATOM 127 C GLU B 12
ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
ATOM 133 C GLU B 15
ATOM 134 CA GLU B 15
ATOM 135 OE2 GLU B 15
ATOM 136 O GLU B 15
             .....100+ lines

从这里开始,我想根据col[5](从 0 开始的列数)和col[2]. 如果或恰好只有一次value,则要丢弃数据集。但是对于if和both 存在的每个值,它将被保留。 过滤后的所需数据: col[5]OE1OE2col[5]OE1OE2

ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14

我试过使用search_string像:

for item in stored_list:
    search_str_a = 'OE1'+item[3]+item[4]+item[5]
    search_str_b = 'OE2'+item[3]+item[4]+item[5]
    target_str = item[2]+item[3]+item[4]+item[5]

这有助于col在搜索OE1or时保持其他相似OE2,但如果其中一个(或两者)丢失,则无助于过滤和消除。

任何想法在这里都会非常好。

4

2 回答 2

2

下面的代码需要 pandas,你可以从http://pandas.pydata.org/pandas-docs/stable/install.html下载

import pandas as pd

file_read_path = "give here source file path"
df = pd.read_csv(file_read_path, sep= " ", names = ["col0","col1","col2","col3","col4","col5"])
group_series =  df.groupby("col5")["col2"].apply(lambda x: "%s" % ', '.join(x))

filtered_list = []
for index in group_series.index:
    str_col2_group = group_series[index]
    if "OE1" in str_col2_group and "OE2" in str_col2_group:
        filtered_list.append(index)

df = df[df.col5.isin(filtered_list)]
output_file_path = "give here output file path"
df.to_csv(output_file_path,sep = " ",index = False,header = False)

这会很有帮助http://pandas.pydata.org/pandas-docs/stable/tutorials.html

输出结果

ATOM 128 O GLU B 14
ATOM 129 N GLU B 14
ATOM 130 OE1 GLU B 14
ATOM 131 OE2 GLU B 14
ATOM 132 CA GLU B 14
于 2015-08-09T08:30:23.293 回答
0

使用csv,它带有python

import csv
import operator

file_read_path = "give here source file path"
with open(file_read_path) as f_pdb:
    rdr = csv.DictReader(f_pdb,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
    sorted_bio = sorted(rdr,key=operator.itemgetter('col5'),reverse=False)
    col5_tmp = None
    tmp_list = []
    perm_list = []
    tmp_str = ""
    col5_v = ""
    for row in sorted_bio:
        col5_v = row["col5"]
        if col5_v != col5_tmp:
            if "OE1" in tmp_str and "OE2" in tmp_str:
                perm_list.extend(tmp_list)
            tmp_list = []
            tmp_str = ""
            col5_tmp = col5_v
        tmp_list.append(row)
        tmp_str = tmp_str +","+ row["col2"]

    if col5_v != col5_tmp:
        if "OE1" in tmp_str and "OE2" in tmp_str:
            perm_list.extend(tmp_list)


csv_file = open("give here output file path","w")
dict_writer = csv.DictWriter(csv_file,delimiter=' ', fieldnames = ["col0","col1","col2","col3","col4","col5"])
for row in perm_list:
    dict_writer.writerow(row)
csv_file.close()
于 2015-08-09T09:48:27.380 回答