我刚刚创建了一个 matlab 文件,该文件从包含 [Term] 的文本文件中获取数据并制作向量,并包含有关 is_a 关系和部分关系的信息(生物信息学领域)
the code is as follows:
clear all;
% This code is for opening and getting information from a text file
s={};
fid = fopen('gos.txt');
tline = fgetl(fid);
while ischar(tline)
s=[s;tline];
tline = fgetl(fid);
end
%To generate the GO_Terms vector from the text file
tok = regexp(s, '^id: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
GO_Terms = cellfun(@(x)x{1}, {tok{idx}})'
%To generate the is_a relations vector from the text file
tok = regexp(s, '^is_a: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
is_a_relations = cellfun(@(x)x{1}, {tok{idx}})'
%To generate the part_of relaions vector from the text file
tok = regexp(s, '^relationship: part_of (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
part_of_relations = cellfun(@(x)x{1}, {tok{idx}})'
结果如下:
s=
'[Term]'
'id: GO:0008150'
'name: biological_process'
'namespace: biological_process'
[1x180 char]
[1x445 char]
'[Term]'
'id: GO:0016740'
'name: transferase activity'
'namespace: molecular_function'
'xref: Reactome:REACT_25050 "Molybdenum ion transfer onto molybdopterin, Homo sapiens"'
'//is_a: GO:0003674 ! molecular_function'
'is_a: GO:0008150 ! molecular_function (added by Zaid, To be Removed Later)'
'//relationship: part_of GO:0008150 ! biological_process'
'[Term]'
'id: GO:0016787'
'name: hydrolase activity'
'namespace: molecular_function'
'xref: Reactome:REACT_110436 "Hydrolysis of phosphatidylcholine, Bos taurus"'
[1x92 char]
'xref: Reactome:REACT_87959 "Hydrolysis of phosphatidylcholine, Gallus gallus"'
'//is_a: GO:0003674 ! molecular_function'
'is_a: GO:0016740 ! molecular_function (added by Zaid, to be removed later)'
'relationship: part_of GO:0008150 ! biological_process'
'[Term]'
'id: GO:0006810'
'name: transport'
'namespace: biological_process'
'alt_id: GO:0015457'
'alt_id: GO:0015460'
[1x255 char]
'subset: goslim_aspergillus'
'synonym: "transport accessory protein activity" RELATED [GOC:mah]'
'is_a: GO:0016787 ! biological_process'
'relationship: part_of GO:0008150 ! biological_process'
'[Term]'
'id: GO:0006412'
'name: translation'
'namespace: biological_process'
'alt_id: GO:0006416'
[1x522 char]
'subset: gosubset_prok'
'synonym: "protamine kinase activity" NARROW []'
'is_a: GO:0016740 ! transferase activity'
'//relationship: part_of GO:0006464 ! cellular protein modification process'
'[Term]'
'id: GO:0016779'
'name: nucleotidyltransferase activity'
'namespace: molecular_function'
'is_a: GO:0016740 ! transferase activity'
'[Term]'
'id: GO:0004386'
'helicases, Xenopus tropicalis"'
[1x100 char]
'is_a: GO:0016787 ! hydrolase activity'
'[Term]'
'id: GO:0003774'
'name: motor activity'
'namespace: molecular_function'
[1x178 char]
'is_a: GO:0016787 ! hydrolase activity'
[1x110 char]
'[Term]'
'id: GO:0016298'
'name: lipase activity'
'namespace: molecular_function'
'holesterol ester + H2O -> cholesterol + fatty acid, Caenorhabditis elegans"'
'is_a: GO:0016787 ! hydrolase activity'
'[Term]'
'id: GO:0016192'
'name: vesicle-mediated transport'
'namespace: biological_process'
'alt_id: GO:0006899'
[1x429 char]
'subset: goslim_aspergillus'
'synonym: "vesicular transport" EXACT [GOC:mah]'
'is_a: GO:0006810 ! transport'
'[Term]'
'id: GO:0005215'
'name: transporter activity'
'namespace: molecular_function'
[1x92 char]
'//is_a: GO:0003674 ! molecular_function'
'is_a: GO:0006412 ! molecular_function (to be removed later)'
'relationship: part_of GO:0006810 ! transport'
'[Term]'
'id: GO:0030533'
'name: triplet codon-amino acid adaptor activity'
'namespace: molecular_function'
'is_a: GO:0004672 ! RNA binding (added by Zaid, to be removed later)'
'relationship: part_of GO:0005215 ! translation'
GO_Terms =
'GO:0008150'
'GO:0016740'
'GO:0016787'
'GO:0006810'
'GO:0006412'
'GO:0004672'
'GO:0016779'
'GO:0004386'
'GO:0003774'
'GO:0016298'
'GO:0016192'
'GO:0005215'
'GO:0030533'
is_a_relations =
'GO:0008150'
'GO:0016740'
'GO:0016787'
'GO:0008150'
'GO:0016740'
'GO:0016740'
'GO:0016787'
'GO:0016787'
'GO:0016787'
'GO:0006810'
'GO:0006412'
'GO:0004672'
part_of_relations =
'GO:0008150'
'GO:0008150'
'GO:0006810'
'GO:0016192'
'GO:0006810'
'GO:0005215'
我想在一个多维数组中收集这些数据,第一列是:'GO_Term',第二列是:'is_a_relations',第三列是:'part_of_relations'
问题是文本文件中的所有 [Terms] 都不包含第二列和第三列(“is_a”和“部分关系”)......所以我如何通过它的 is_a 和部分关系映射每个 GO_Term(如果有)对于文本文件中的每个 [Term] 段落。