0

我有一个酶序列数据集和一个要预测的目标变量。
我正在做的过程是将序列转换为微笑,然后为机器学习模型获取数字输入。
问题是:rdkit 无法转换某些序列,但不是全部。In this case the transformation was stopped for index = 5 which corresponds to the following sequence: 'PQITLWQRPIVTIKIGGQLIEALLDTGADDTVLEXXNLPGRWKPKXIGGIGGFXKVRQYDQVPIEIXGHKTXSTVLVGPTPVNIIGRNLMTQIGCTLNFPISPIETVPVKLKPGMDGPKXKQWPLTEEKIKALMEICKELEEEGKISKIGPENPYNTPVFAIKKKNSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKRKKSVTVLDVGDAYFSIPLDKDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYVDDLYVGSDLEIEQHRTKIKELRQYLWKWGFYTPDXKHQEEPPFHWXGYELHPDKWTVQPIVLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRG' 在此处输入图像描述

4

1 回答 1

1

看起来问题是您的序列中有 X 。这不是氨基酸代码,而是未知/非典型氨基酸的占位符。似乎 RDKit 无法处理这种情况:

amino_acids = {'G', 'A', 'L', 'M', 'F', 'W', 'K', 'Q', 'E', 'S', 'P', 'V', 'I', 'C', 'Y', 'H', 'R', 'N', 'D', 'T'}
seq = 'PQITLWQRPIVTIKIGGQLIEALLDTGADDTVLEXXNLPGRWKPKXIGGIGGFXKVRQYDQVPIEIXGHKTXSTVLVGPTPVNIIGRNLMTQIGCTLNFPISPIETVPVKLKPGMDGPKXKQWPLTEEKIKALMEICKELEEEGKISKIGPENPYNTPVFAIKKKNSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKRKKSVTVLDVGDAYFSIPLDKDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYVDDLYVGSDLEIEQHRTKIKELRQYLWKWGFYTPDXKHQEEPPFHWXGYELHPDKWTVQPIVLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRG'

edited_seq = ''
for aa in seq:
    if aa not in amino_acids:
        print('Non-standard/missing amino acid:', aa)
    else:
        edited_seq += aa

m1 = Chem.MolFromSequence(seq)
m2 = Chem.MolFromSequence(edited_seq)

print('Read seq successfully:', m1 is not None)
print('Read edited_seq successfully:', m2 is not None)

[Out]:

Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Read seq successfully: False
Read edited_seq successfully: True

当我们移除 Xs 时,RDKit 正确地解析了序列。我并不是说仅仅删除这些是正确的解决方案,只是强调问题。处理这些情况可能有更好的方法。

于 2021-02-23T09:07:34.400 回答