我正在尝试从 drugbank 下载转换 xml 文件。每当我尝试在 excel 2007 中导入它时,它都会说无法导入。也许是因为尺寸。谁能建议我是否可以通过其他任何方式打开此文件,以便将其保存为制表符分隔符?它的第一个文件(所有药物,包括目标、转运、载体和酶信息)在这里,http://www.drugbank.ca/downloads xml 格式
1 回答
这是对我原始答案的完全重写。
对于我最初的答案,我对 drugbank.xml 进行了有限的分析。我有点犹豫,但表示结构太复杂,无法转换为任何标准的制表符分隔文件。我的意思是一个可以由任何标准程序处理的制表符分隔文件。我支持该声明,但可以创建一个可能有用的非标准分隔文件。
下表显示了 drugbank.xml 的结构。
这些列是索引、级别、名称、父级和重复。对于元素 drug 和 partner,Repeats 是实际的重复数。对于其他元素,它是其父元素出现中的最大重复次数。
Inx Lvl Name------------------------------------ Pnt Repeats
1 1 drugs 0 1
2 2 drug 1 6711
3 3 drugbank-id 2 1
4 3 name 2 1
5 3 description 2 1
6 3 cas-number 2 1
7 3 general-references 2 1
8 3 synthesis-reference 2 1
9 3 indication 2 1
10 3 pharmacology 2 1
11 3 mechanism-of-action 2 1
12 3 toxicity 2 1
13 3 biotransformation 2 1
14 3 absorption 2 1
15 3 half-life 2 1
16 3 protein-binding 2 1
17 3 route-of-elimination 2 1
18 3 volume-of-distribution 2 1
19 3 clearance 2 1
20 3 secondary-accession-numbers 2 1
21 4 secondary-accession-number 20 5
22 3 groups 2 1
23 4 group 22 3
24 3 taxonomy 2 1
25 4 kingdom 24 1
26 4 substructures 24 1
27 5 substructure 26 35
28 3 synonyms 2 1
29 4 synonym 28 82
30 3 salts 2 1
31 4 salt 30 17
32 3 brands 2 1
33 4 brand 32 230
34 3 mixtures 2 1
35 4 mixture 34 340
36 5 name 35 1
37 5 ingredients 35 1
38 3 packagers 2 1
39 4 packager 38 173
40 5 name 39 1
41 5 url 39 1
42 3 manufacturers 2 1
43 4 manufacturer 42 91
44 3 prices 2 1
45 4 price 44 172
46 5 description 45 1
47 5 cost 45 1
48 5 unit 45 1
49 3 categories 2 1
50 4 category 49 11
51 3 affected-organisms 2 1
52 4 affected-organism 51 3
53 3 dosages 2 1
54 4 dosage 53 22
55 5 form 54 1
56 5 route 54 1
57 5 strength 54 1
58 3 atc-codes 2 1
59 4 atc-code 58 36
60 3 ahfs-codes 2 1
61 4 ahfs-code 60 11
62 3 patents 2 1
63 4 patent 62 5
64 5 number 63 1
65 5 country 63 1
66 5 approved 63 1
67 5 expires 63 1
68 3 food-interactions 2 1
69 4 food-interaction 68 6
70 3 drug-interactions 2 1
71 4 drug-interaction 70 246
72 5 drug 71 1
73 5 name 71 1
74 5 description 71 1
75 3 protein-sequences 2 1
76 4 protein-sequence 75 10
77 5 header 76 1
78 5 chain 76 1
79 3 calculated-properties 2 1
80 4 property 79 18
81 5 kind 80 1
82 5 value 80 1
83 5 source 80 1
84 3 experimental-properties 2 1
85 4 property 84 4
86 5 kind 85 1
87 5 value 85 1
88 5 source 85 1
89 3 external-identifiers 2 1
90 4 external-identifier 89 13
91 5 resource 90 1
92 5 identifier 90 1
93 3 external-links 2 1
94 4 external-link 93 4
95 5 resource 94 1
96 5 url 94 1
97 3 targets 2 1
98 4 target 97 144
99 5 actions 98 1
100 6 action 99 2
101 5 references 98 1
102 5 known-action 98 1
103 3 enzymes 2 1
104 4 enzyme 103 19
105 5 actions 104 1
106 6 action 105 3
107 5 references 104 1
108 3 transporters 2 1
109 4 transporter 108 24
110 5 actions 109 1
111 6 action 110 3
112 5 references 109 1
113 3 carriers 2 1
114 4 carrier 113 6
115 5 actions 114 1
116 6 action 115 1
117 5 references 114 1
118 2 partners 1 1
119 3 partner 118 4227
120 4 name 119 1
121 4 general-function 119 1
122 4 specific-function 119 1
123 4 gene-name 119 1
124 4 locus 119 1
125 4 reaction 119 1
126 4 signals 119 1
127 4 cellular-location 119 1
128 4 transmembrane-regions 119 1
129 4 theoretical-pi 119 1
130 4 molecular-weight 119 1
131 4 chromosome 119 1
132 4 species 119 1
133 5 category 132 1
134 5 name 132 1
135 5 uniprot-name 132 1
136 5 uniprot-taxon-id 132 1
137 4 essentiality 119 1
138 4 references 119 1
139 4 external-identifiers 119 1
140 5 external-identifier 139 9
141 6 resource 140 1
142 6 identifier 140 1
143 4 synonyms 119 1
144 5 synonym 143 38
145 4 protein-sequence 119 1
146 5 header 145 1
147 5 chain 145 1
148 4 gene-sequence 119 1
149 5 header 148 1
150 5 chain 148 1
151 4 pfams 119 1
152 5 pfam 151 15
153 6 identifier 152 1
154 6 name 152 1
155 4 go-classifiers 119 1
156 5 go-classifier 155 49
157 6 category 156 1
158 6 description 156 1
我有一个实用程序,它是为无法处理发送的大量 XML 文档的客户开发的。我将选定的信息提取到一个分隔文件中。尽管这些 XML 文档非常庞大,但结构很简单,在 2 级元素中没有重复。我想知道是否可以增强实用程序以接受重复并将数据输出到分隔文件,尽管是非标准分隔文件。我现在知道我可以,虽然我不确定分隔文件有多大用处。
我的输出有 97 列,每个叶子元素一列。有六个标题行,每个级别一个。其中列出了叶元素及其父元素。当一个元素重复时,该值将放置在下一个可用行上。我希望前三个药物文件的行中的几列可以说明这一点。请注意,此显示的第 61 列已被截断。
|Column 1 |Column 2 |Column 18 |Column 25 |Column 56 |Column 60 |Column 61 |Column 62 |
|drugs |drugs |drugs |drugs |drugs |drugs |drugs |drugs |
|drug |drug |drug |drug |drug |drug |drug |drug |
|drugbank-id|name |secondary-accession-numbers|mixtures |external-identifiers |targets |targets |targets |
| | |secondary-accession-number |mixture |external-identifier |target |target |target |
| | | |name |resource |actions |references |known-action|
| | | | | |action | | |
|DB00001 |Lepirudin |BIOD00024 | |Drugs Product Database (DPD)|inhibitor |# Turpie AG: Anticoagulants in|yes |
| | |BTD00024 | |National Drug Code Directory| | | |
| | | | |PharmGKB | | | |
| | | | |UniProtKB | | | |
|DB00002 |Cetuximab |BIOD00071 | |National Drug Code Directory|antagonist|# Hosokawa N, Yamamoto S, Ueha|yes |
| | |BTD00071 | |GenBank | |# Snyder LC, Astsaturov I, Wei|unknown |
| | | | |PharmGKB | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Negri DR, Tosi E, Valota O, |unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
|DB00003 |Dornase Alfa|BIOD00001 |Cauterex |Drugs Product Database (DPD)| |# Cramer GW, Bosso JA: The rol|yes |
| | |BTD00001 |Clorfibrase|GenBank | | | |
| | | |Elase |PharmGKB | | | |
| | | |Fibrabene |UniProtKB | | | |
| | | |Fibrase SA | | | | |
| | | |Fibrolan | | | | |
| | | |Parkelase | | | | |
| | | |Ridasa | | | | |
| | | | | | | | |
结果文件有 135,713 行,长度为 52,171,387 字节。这或一些简单的变化会有用吗?