我有一个药物数据库的大文本文件。每种药物都以:
#BEGIN DRUGCARD _______
在每种药物中都有几个类别。我感兴趣的类别是:
# Drug_Target_1_ID
(some number i.e. 3421)
# Drug_Target_1_Name NOTE: this is the corresponding name for the ID above
(some target i.e. hormone receptor)
此时我陆续生成了代码来解析文件并找到这些感兴趣的值并将它们放入一个表中。不过,接下来的步骤有点棘手:
每种药物都有不同数量的目标:即一些药物具有:
`# Drug_Target_1_ID`
`# Drug_Target_1_Name`
`# Drug_Target_12_ID`
`# Drug_Target_12_Name`
...
`# Drug_Target_4_ID`
`# Drug_Target_4_Name`
(NOTE these are also not necessarily in numerical order!)
我在文件中看到的最大数字约为 100,但大多数药物平均在 1-10 之间。
有重复!每个药物 ID 都是唯一的,并且对应于药物名称,但某些重复。ie 许多药物可能具有相同的目标,因此 ID/名称会再次出现。
我想做的是创建一个包含所有唯一 ID/名称的主列表/表。所以基本上扫描所有具有的字段Drug_Target_#_ID/ Drug_Target_#_Name
并将每个唯一的字段添加到以后可以操作的表中。在原始文件中,目标 ID 没有任何顺序。
我是否必须创建代码来搜索每个药卡(#BEGIN DRUGCARD ____
)并Drug_Target_#
单独搜索
我效率低下的代码:
TO_GET = ["# Drug_Target_1_ID", \
"# Drug_Target_2_ID", \
"# Drug_Target_3_ID", \
"# Drug_Target_4_ID", \
"# Drug_Target_5_ID", \
"# Drug_Target_6_ID", \
"# Drug_Target_7_ID", \
"# Drug_Target_8_ID", \
"# Drug_Target_9_ID", \
"# Drug_Target_10_ID", \
"# Drug_Target_11_ID", \
"# Drug_Target_12_ID", \
"# Drug_Target_13_ID", \
"# Drug_Target_14_ID", \
"# Drug_Target_15_ID", \
"# Drug_Target_16_ID", \
"# Drug_Target_17_ID", \
"# Drug_Target_18_ID", \
"# Drug_Target_19_ID", \
"# Drug_Target_20_ID", \
"# Drug_Target_21_ID", \
"# Drug_Target_22_ID", \
"# Drug_Target_23_ID", \
"# Drug_Target_24_ID", \
"# Drug_Target_25_ID", \
"# Drug_Target_26_ID", \
"# Drug_Target_27_ID", \
"# Drug_Target_28_ID", \
"# Drug_Target_29_ID", \
"# Drug_Target_30_ID", \
"# Drug_Target_1_Name", \
"# Drug_Target_2_Name", \
"# Drug_Target_3_Name", \
"# Drug_Target_4_Name", \
"# Drug_Target_5_Name", \
"# Drug_Target_6_Name", \
"# Drug_Target_7_Name", \
"# Drug_Target_8_Name", \
"# Drug_Target_9_Name", \
"# Drug_Target_10_Name", \
"# Drug_Target_11_Name", \
"# Drug_Target_12_Name", \
"# Drug_Target_13_Name", \
"# Drug_Target_14_Name", \
"# Drug_Target_15_Name", \
"# Drug_Target_16_Name", \
"# Drug_Target_17_Name", \
"# Drug_Target_18_Name", \
"# Drug_Target_19_Name", \
"# Drug_Target_20_Name", \
"# Drug_Target_21_Name", \
"# Drug_Target_22_Name", \
"# Drug_Target_23_Name", \
"# Drug_Target_24_Name", \
"# Drug_Target_25_Name", \
"# Drug_Target_26_Name", \
"# Drug_Target_27_Name", \
"# Drug_Target_28_Name", \
"# Drug_Target_29_Name", \
"# Drug_Target_30_Name", \
然后后来:
try:
target_name = drugbank_all[drug]["# Drug_Target_1_Name"]
target_id= "INSERT INTO TargetID_TargetName (Drug_Target_1_ID, Drug_Target_1_Name) VALUES(\"%s\", \"%s\");" % (accession[0], target_name[0])
cur.execute (target_id)
print "target_id done"
except KeyError:
print "Drug target 1 not found, skipping..."
pass
...每一个人?
这种方法的问题:
- 邋遢
- 不忽略重复项
- 不知道有多少 Drug_Target 它只是尝试了我指定的次数(这不是很简洁)
问题:
- 有没有办法做一个 for 循环,在解析之前搜索每个 DRUGCARD 有多少 Drug_targets,然后进行相应的解析?
- 有没有办法快速扫描列表以查看一个数字是否已经归属,然后忽略该条目?
更新:实际数据
#BEGIN_DRUGCARD DB00097
# AHFS_Codes:
Not Available
# ATC_Codes:
G03GA08
# Absorption:
The mean absolute bioavailability following a single subcutaneous injection to healthy female volunteers is about 40%.
...
# Drug_Target_1_HPRD_ID:
01073
# Drug_Target_1_ID:
148
# Drug_Target_1_Locus:
2p21
# Drug_Target_1_Molecular_Weight:
78617
# Drug_Target_1_Name:
Lutropin-choriogonadotropic hormone receptor
...
# Drug_Target_2_HGNC_ID:
HGNC:3969
# Drug_Target_2_HPRD_ID:
00639
# Drug_Target_2_ID:
430
# Drug_Target_2_Locus:
2p21-p16
# Drug_Target_2_Molecular_Weight:
78296
# Drug_Target_2_Name:
Follicle-stimulating hormone receptor