0

我正在尝试合并两个文本格式 (PDB) 文件。一个(较大的)包含描述蛋白质的完整数据集,第二个包含非常小的数据集,仅更改一小部分(坐标集)。

例子:

基本文件(部分):

ATOM    605  CD2 LEU A  92      11.727  14.051  55.011  1.00 75.51      4pxz C  
ATOM    606  N   ARG A  93      10.555  10.636  58.260  1.00 62.79      4pxz N  
ATOM    607  CA  ARG A  93      11.357   9.429  58.493  1.00 59.89      4pxz C  
ATOM    608  C   ARG A  93      10.429   8.207  58.562  1.00 62.83      4pxz C  
ATOM    609  O   ARG A  93      10.760   7.168  57.994  1.00 61.39      4pxz O  
ATOM    610  CB  ARG A  93      12.236   9.564  59.757  1.00 58.23      4pxz C  
ATOM    611  CG  ARG A  93      13.088   8.333  60.120  1.00 60.51      4pxz C  
ATOM    612  CD  ARG A  93      13.985   7.822  58.995  1.00 61.21      4pxz C  
ATOM    613  NE  ARG A  93      14.503   6.485  59.295  1.00 60.36      4pxz N  
ATOM    614  CZ  ARG A  93      15.012   5.642  58.400  1.00 66.21      4pxz C  
ATOM    615  NH1 ARG A  93      15.074   5.979  57.116  1.00 52.54      4pxz N  
ATOM    616  NH2 ARG A  93      15.455   4.453  58.780  1.00 48.93      4pxz N  
ATOM    617  N   THR A  94       9.247   8.357  59.192  1.00 60.68      4pxz N  
ATOM    618  CA  THR A  94       8.227   7.305  59.271  1.00 59.92      4pxz C

辅助文件(带有要替换的坐标集):

ATOM     39  CA  ARG A  93      11.357   9.429  58.493  1.00 59.89      hatp C  
ATOM     40  CB  ARG A  93      12.236   9.564  59.757  1.00 58.23      hatp C  
ATOM     41  CG  ARG A  93      11.569   9.166  61.087  1.00 60.51      hatp C  
ATOM     42  CD  ARG A  93      12.319   8.102  61.886  1.00 61.21      hatp C  
ATOM     43  NE  ARG A  93      11.978   6.754  61.425  1.00 60.36      hatp N  
ATOM     44  CZ  ARG A  93      11.731   5.714  62.217  1.00 66.21      hatp C  
ATOM     45  NH2 ARG A  93      11.430   4.535  61.694  1.00 48.93      hatp N  
ATOM     46  NH1 ARG A  93      11.793   5.843  63.538  1.00 52.54      hatp N  

预期结果:-> 改变坐标 <-

ATOM    604  CD1 LEU A  92       9.685  13.033  54.000  1.00 73.10      4pxz C
ATOM    605  CD2 LEU A  92      11.727  14.051  55.011  1.00 75.51      4pxz C
ATOM    606  N   ARG A  93      10.555  10.636  58.260  1.00 62.79      4pxz N
ATOM    607  CA  ARG A  93   -> 11.357   9.429  58.493<- 1.00 59.89      4pxz C
ATOM    608  C   ARG A  93      10.429   8.207  58.562  1.00 62.83      4pxz C
ATOM    609  O   ARG A  93      10.760   7.168  57.994  1.00 61.39      4pxz O
ATOM    610  CB  ARG A  93   -> 12.236   9.564  59.757<- 1.00 58.23      4pxz C
ATOM    611  CG  ARG A  93   -> 11.569   9.166  61.087<- 1.00 60.51      4pxz C
ATOM    612  CD  ARG A  93   -> 12.319   8.102  61.886<- 1.00 61.21      4pxz C
ATOM    613  NE  ARG A  93   -> 11.978   6.754  61.425<- 1.00 60.36      4pxz N
ATOM    614  CZ  ARG A  93   -> 11.731   5.714  62.217<- 1.00 66.21      4pxz C
ATOM    615  NH1 ARG A  93   -> 11.793   5.843  63.538<- 1.00 52.54      4pxz N
ATOM    616  NH2 ARG A  93   -> 11.430   4.535  61.694<- 1.00 48.93      4pxz N
ATOM    617  N   THR A  94       9.247   8.357  59.192  1.00 60.68      4pxz N
ATOM    618  CA  THR A  94       8.227   7.305  59.271  1.00 59.92      4pxz C

我尝试通过以下方式这样做:

  • 构建列表并将每一行附加为两个文件的单个条目

  • 从两个文件中提取原子类型、残基名称、链和残基编号(例如 CD1 LEU A 92,分别)并附加到另一个列表

  • 比较提取列表

  • 根据第 3 点,从第 1 点写入包含混合列表的文件。

代码:

import re

aminoacid_pattern = re.compile(r"\w.{2,3}.\b(\w[A-Z]\w*)\b\s.\s\d+")
coords_pattern = re.compile(r"\w.{2,3}.\b(\w[A-Z]\w*)\b\s.\s\d+")

class fileSaver:
    protein = "4pxzclean.pdb"
    flexres = "ARGA93.pdb.tmp"
    def __init__(self):
        pass

    def aminoacid_to_substitute(self, flexres, data = []):

        with open(flexres, 'r') as flex:
                for line in flex:
                    if aminoacid_pattern != None:
                        data.append(line)
        return data

    def parse_rigid(self, rigidprot, test = []):

        with open(rigidprot, 'r') as rigid:
            for line in rigid:
                if aminoacid_pattern != None:
                    test.append(line)
        return test

class fileComparer:
    def __init__(self):
        pass

    def compare_data(self, data_flex, data_rigid, cleanflex = [], cleanrigid = []):

        for el in data_flex:
            if aminoacid_pattern != None:
                cleanflex.append(re.findall(r".\w\s+\w.{2,3}\s\w\s*\d{2,3}",str(el)))


        for el in data_rigid:
            if aminoacid_pattern != None:
                cleanrigid.append(re.findall(r".\w\s+\w.{2,3}\s\w\s*\d{2,3}",str(el)))

        with open("test.txt", 'a+') as test:
            for rig_el in data_rigid:
                for flex_el in data_flex:
                    for rg_el in cleanrigid:
                        if rg_el not in cleanflex:
                            test.write(rig_el)
                        if rg_el in cleanflex:
                            test.write(flex_el)



if __name__ == '__main__':
    initialize = fileSaver()
    flex = initialize.aminoacid_to_substitute("ARGA93.pdb.tmp")
    rigid = initialize.parse_rigid("4pxzclean.pdb")
    comparer = fileComparer()
    comparer.compare_data(flex,rigid)

不幸的是,它提供了无限长的文件,没有任何更改的行。你能告诉我哪里出错了吗?

4

0 回答 0