matlab - 从文本文件映射数据并在matlab中创建多维数组

Question

我刚刚创建了一个 matlab 文件，该文件从包含 [Term] 的文本文件中获取数据并制作向量，并包含有关 is_a 关系和部分关系的信息（生物信息学领域）

the code is as follows:  

    clear all;
    % This code is for opening and getting information from a text file
    s={}; 
    fid = fopen('gos.txt'); 
    tline = fgetl(fid); 
    while ischar(tline) 
       s=[s;tline]; 
       tline = fgetl(fid); 
    end 
     %To generate the GO_Terms vector from the text file
    tok = regexp(s, '^id: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    GO_Terms   = cellfun(@(x)x{1}, {tok{idx}})'


    %To generate the is_a relations vector from the text file
    tok = regexp(s, '^is_a: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    is_a_relations  = cellfun(@(x)x{1}, {tok{idx}})'


    %To generate the part_of relaions vector from the text file
    tok = regexp(s, '^relationship: part_of (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    part_of_relations = cellfun(@(x)x{1}, {tok{idx}})'

结果如下：

s= 



   '[Term]'    
    'id: GO:0008150'
    'name: biological_process'
    'namespace: biological_process'
    [1x180 char]
    [1x445 char]

    '[Term]'    
    'id: GO:0016740'
    'name: transferase activity'
    'namespace: molecular_function'
    'xref: Reactome:REACT_25050 "Molybdenum ion transfer onto molybdopterin, Homo sapiens"'
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0008150 ! molecular_function (added by Zaid, To be Removed Later)'
    '//relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0016787'
    'name: hydrolase activity'
    'namespace: molecular_function'
    'xref: Reactome:REACT_110436 "Hydrolysis of phosphatidylcholine, Bos taurus"'
    [1x92  char]
    'xref: Reactome:REACT_87959 "Hydrolysis of phosphatidylcholine, Gallus gallus"'
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0016740 ! molecular_function (added by Zaid, to be removed later)'
    'relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0006810'
    'name: transport'
    'namespace: biological_process'
    'alt_id: GO:0015457'
    'alt_id: GO:0015460'
    [1x255 char]
    'subset: goslim_aspergillus'
    'synonym: "transport accessory protein activity" RELATED [GOC:mah]'
    'is_a: GO:0016787 ! biological_process'
    'relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0006412'
    'name: translation'
    'namespace: biological_process'
    'alt_id: GO:0006416'
    [1x522 char]
    'subset: gosubset_prok'
    'synonym: "protamine kinase activity" NARROW []'
    'is_a: GO:0016740 ! transferase activity'
    '//relationship: part_of GO:0006464 ! cellular protein modification process'

    '[Term]'    
    'id: GO:0016779'
    'name: nucleotidyltransferase activity'
    'namespace: molecular_function'
    'is_a: GO:0016740 ! transferase activity'

    '[Term]'    
    'id: GO:0004386'
    'helicases, Xenopus tropicalis"'
    [1x100 char]
    'is_a: GO:0016787 ! hydrolase activity'

    '[Term]'    
    'id: GO:0003774'
    'name: motor activity'
    'namespace: molecular_function'
    [1x178 char]
    'is_a: GO:0016787 ! hydrolase activity'
    [1x110 char]

    '[Term]'    
    'id: GO:0016298'
    'name: lipase activity'
    'namespace: molecular_function'
    'holesterol ester + H2O -> cholesterol + fatty acid, Caenorhabditis elegans"'
    'is_a: GO:0016787 ! hydrolase activity'

    '[Term]'    
    'id: GO:0016192'
    'name: vesicle-mediated transport'
    'namespace: biological_process'
    'alt_id: GO:0006899'
    [1x429 char]
    'subset: goslim_aspergillus'
    'synonym: "vesicular transport" EXACT [GOC:mah]'
    'is_a: GO:0006810 ! transport'

    '[Term]'    
    'id: GO:0005215'
    'name: transporter activity'
    'namespace: molecular_function'
    [1x92  char]
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0006412 ! molecular_function (to be removed later)'
    'relationship: part_of GO:0006810 ! transport'

    '[Term]'    
    'id: GO:0030533'
    'name: triplet codon-amino acid adaptor activity'
    'namespace: molecular_function'
    'is_a: GO:0004672 ! RNA binding (added by Zaid, to be removed later)'
    'relationship: part_of GO:0005215 ! translation'





GO_Terms = 

    'GO:0008150'
    'GO:0016740'
    'GO:0016787'
    'GO:0006810'
    'GO:0006412'
    'GO:0004672'
    'GO:0016779'
    'GO:0004386'
    'GO:0003774'
    'GO:0016298'
    'GO:0016192'
    'GO:0005215'
    'GO:0030533'


is_a_relations = 

    'GO:0008150'
    'GO:0016740'
    'GO:0016787'
    'GO:0008150'
    'GO:0016740'
    'GO:0016740'
    'GO:0016787'
    'GO:0016787'
    'GO:0016787'
    'GO:0006810'
    'GO:0006412'
    'GO:0004672'


part_of_relations = 

    'GO:0008150'
    'GO:0008150'
    'GO:0006810'
    'GO:0016192'
    'GO:0006810'
    'GO:0005215'

我想在一个多维数组中收集这些数据，第一列是：'GO_Term'，第二列是：'is_a_relations'，第三列是：'part_of_relations'

问题是文本文件中的所有 [Terms] 都不包含第二列和第三列（“is_a”和“部分关系”）......所以我如何通过它的 is_a 和部分关系映射每个 GO_Term（如果有）对于文本文件中的每个 [Term] 段落。

score 1 · Accepted Answer

在这种情况下，您必须逐个学期并在途中创建地图：

% find start and end positions of every [Term] marker in s 
terms = [find(~cellfun('isempty', regexp(s, '\[Term\]'))); numel(s)+1];

% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns
map = cell(0,3);
for term=1:numel(terms)-1
    % extract single [Term]  data
    s_term = s(terms(term):terms(term+1)-1);

    % match regexps
    %To generate the GO_Terms vector from the text file
    tok = regexp(s_term, '^id: (GO:\w*)', 'tokens');
    idx = ~cellfun('isempty', tok); 
    GO_Terms   = cellfun(@(x)x{1}, {tok{idx}})';

    %To generate the is_a relations vector from the text file
    tok = regexp(s_term, '^is_a: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    is_a_relations  = cellfun(@(x)x{1}, {tok{idx}})';

    %To generate the part_of relaions vector from the text file
    tok = regexp(s_term, '^relationship: part_of (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    part_of_relations = cellfun(@(x)x{1}, {tok{idx}})';

    % map. note the end+1 - here we create a new map row. Only once!
    map{end+1,1} = GO_Terms;
    map{end,  2} = is_a_relations;
    map{end,  3} = part_of_relations;
end

map现在是一个 3 列的元胞数组。[Term]有些条目是空的，这意味着该特定条目没有对应的值。

matlab - 从文本文件映射数据并在matlab中创建多维数组

1 回答 1

Related

Reference