1

我目前正在制作一个脚本来分析一些遗传数据,然后在彩色 Word 文档中生成输出。该脚本有效,但是,脚本中的一种方法写得不好,即创建 Word 文档的方法。

创建文档的方法会创建一个独立的 HTML 文件,然后使用“docx”扩展名保存该文件,这允许我为文档的不同部分赋予不同的样式。

以下是使其正常工作的最低要求。它包括一些示例输入数据,这些数据将在最后一步之前以不同的方法创建并存储在哈希中,以及必要的方法。

require 'bio'

def make_hash(input_file)
  input_read = Hash.new
  biofastafile = Bio::FlatFile.open(Bio::FastaFormat, input_file) 
  biofastafile.each_entry do |entry|
    input_read[entry.definition] = entry.aaseq
  end
  return input_read
end

def to_doc(hash, output, motif)
  output_file = File.new(output, "w")
  output_file.puts "<!DOCTYPE html><html><head><style> .id{font-weight: bold;} .signalp{color:#000099; font-weight: bold;} .motif{color:#FF3300; font-weight: bold;} h3 {word-wrap: break-word;} p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}</style></head><body>"
  hash.each do |id, seq|
    sequence = seq.to_s.gsub("\[\"", "").gsub("\"\]", "")
    id.scan(/(\w+)(.*)/) do |id_start, id_end|
      output_file.puts "<p><span class=\"id\"> >#{id_start}</span><span>#{id_end}</span><br>"
      output_file.puts "<span class=\"signalp\">"
      sequence.scan(/(\w+)-(\w+)/) do |signalp, seq_end|
        output_file.puts signalp + "</span>" + seq_end.gsub(/#{motif}/, '<span class="motif">\0</span>')
        output_file.puts "</p>"
      end
    end
  end
  output_file.puts "</body></html>"
  output_file.close   
end

hash = make_hash("./sample.txt")
to_doc = to_doc(hash, "output.docx", "WL|KK|RR|KR|R..R|R....R"

这是一些示例数据。实际上,在分析一个物种的遗传数据时,这可以由数十万个序列组成:

>isotig00001_f4_14 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00001_f4_15 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00003_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00003_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00004_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00004_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00009_f2_3 - Signal P Cleavage Site => 22:23
MLKCFSIIMGLILLLEIGGGCA-IYFYRAQIQAQFQKSLTDVTITDYRENADFQDLIDALQSGLSCCGVNSYEDWDNNIYFNCSGPANNPEALWCAFLLLYTGSSKRSSQHPVRLWSSFPRTTKYFPHKDLHHWLCGYVYNVD
>isotig00009_f3_9 - Signal P Cleavage Site => 16:17
MKTGIIIFISTVVVLP-ITLKPCGVPFSCCIPDQASGVANTQCGYGVRSPEQQNTFHTKIYTTGCADMFTMWINRYLYYIAGIAGVIVLVELFGFCFAHSLINDIKRQKARWAHR
>isotig00009_f6_13 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00009_f6_14 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL

每个读取由两部分组成:seq id(以 a 开头的行>)和序列。这是拆分的,并存储在make_hash方法中的散列中。这个例子:

>isotig00001_f4_14 - Signal P Cleavage Site => 11:12

MMHLLCIVLLL-KWWLLL 

由。。。制成由。。。做成:

>isotig00001_f4_14  (the first part of the id - class="id")

Signal P Cleavage Site => 11:12 (the second part of the id - normal writing)

(new line)

MMHLLCIVLLL (first part of the sequence - class="signalp")

KW WL LL  (the second part of the sequence - the motif KW will be class="motif")

在 HTML 中它会产生:

<p>
  <span class="id"> >isotig00001_f4_14</span><span>Signal P Cleavage Site => 11:12</span>
<br>
  <span class="signalp">MMHLLCIVLL</span><span>KW</span><span class="motif">KW</span><span>LL</span>

基本上,我想to_doc使用适当的 HTML 模板脚本(例如 SLIM/HAML/NOKOGIRI/ERB)重写该方法。我试图完成这件事。

由于某种原因,循环中的循环不起作用,并且创建一个全局变量来存储这些变量也不起作用。

上面的脚本有效,只需将示例数据保存为“sample.txt”,然后运行脚本。

我将非常感谢任何帮助。

4

1 回答 1

1

这是一个起点:

require 'haml'

haml_doc = <<EOT
%html
  %head
    :css
      .id {font-weight: bold;}
      .signalp {color:#000099; font-weight: bold;}
      .motif {color:#FF3300; font-weight: bold;}
      h3 {word-wrap: break-word;}
      p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
  %body
EOT

engine = Haml::Engine.new(haml_doc)
puts engine.render

运行时输出:

<html>
  <head>
    <style>
      .id {font-weight: bold;}
      .signalp {color:#000099; font-weight: bold;}
      .motif {color:#FF3300; font-weight: bold;}
      h3 {word-wrap: break-word;}
      p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
    </style>
  </head>
  <body></body>
</html>

从那里,您可以使用以下命令轻松写入文件:

File.write(output, engine.render)

而不是用于将puts其输出到控制台。

要使用它,您需要haml_doc用额外的 Haml 充实 以循环输入数据并将其按摩到可以干净地迭代的数组或哈希中,而无需嵌入各种scan条件逻辑。视图应该主要用于输出内容,而不是操作数据。

就在engine = Haml...您想要读取输入数据并对其进行处理的行上方,并将其存储在 Haml 可以迭代的实例变量中。您在原始代码中有基本想法,但不是尝试输出 HTML,而是创建一个可以传递给 Haml 的对象或子哈希。

通常,这将被分成模型、视图和控制器的单独文件,就像在 Rails 或大型 Sinatra 应用程序中一样,但这真的不是一个大应用程序,因此您可以将它们全部放在一个文件中。保持你的逻辑干净,它会没事的。

如果没有样本输入数据和预期输出,很难做更多的事情,但这会给你一个起点。


根据数据样本,这里有一些让你大开眼界的东西。我不会润色它,因为毕竟你必须做一些,但这是一个合理的开始。第一部分是模拟您在代码中引用的 Bio 类似的东西,但我从未见过。您不需要这部分,但可能需要查看它:

module Bio

  FastaFormat = 1

SAMPLE_DATA = <<-EOT
>isotig00001_f4_14 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00001_f4_15 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00003_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00003_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00004_f6_8 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00004_f6_9 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
>isotig00009_f2_3 - Signal P Cleavage Site => 22:23
MLKCFSIIMGLILLLEIGGGCA-IYFYRAQIQAQFQKSLTDVTITDYRENADFQDLIDALQSGLSCCGVNSYEDWDNNIYFNCSGPANNPEALWCAFLLLYTGSSKRSSQHPVRLWSSFPRTTKYFPHKDLHHWLCGYVYNVD
>isotig00009_f3_9 - Signal P Cleavage Site => 16:17
MKTGIIIFISTVVVLP-ITLKPCGVPFSCCIPDQASGVANTQCGYGVRSPEQQNTFHTKIYTTGCADMFTMWINRYLYYIAGIAGVIVLVELFGFCFAHSLINDIKRQKARWAHR
>isotig00009_f6_13 - Signal P Cleavage Site => 11:12
MMHLLCIVLLL-KWWLLL
>isotig00009_f6_14 - Signal P Cleavage Site => 10:11
MHLLCIVLLL-KWWLLL
EOT

  class FlatFile

    class Entry
      attr_reader :definition, :aaseq

      def initialize(definition, aaseq)
        @definition = definition
        @aaseq = aaseq
      end
    end

    def initialize
    end

    def self.open(filetype, filename)
      SAMPLE_DATA.split("\n").each_slice(2).map{ |seq_id, sequence| Entry.new(seq_id, sequence) }
    end

    def each_entry
      @sample_data.each do |_entry|
        yield _entry
      end
    end

  end
end

这就是乐趣开始的地方。我修改了你的get_hash例程来解析字符串我会怎么做。它返回一个散列数组,而不是散列。每个子哈希都已准备好使用,换句话说,数据已解析并准备好输出:

include Bio

def make_array_of_hashes(input_file)
  Bio::FlatFile.open(
    Bio::FastaFormat,
    input_file
  ).map { |entry|

    id_start, id_end = entry.definition.split('-').map(&:strip)
    signalp, seq_end = entry.aaseq.split('-')
    motif = seq_end.scan(/(?:WL|KK|RR|KR|R..R|R....R)/)

    {
      :id_start => id_start,
      :id_end => id_end,
      :signalp => signalp,
      :motif => motif
    }
  }
end

这是在脚本主体内定义 HAML 文档的简单方法。我只输出,模板中除了循环之外没有任何逻辑。其他一切都在处理视图之前处理:

haml_doc = <<EOT
!!!
%html
  %head
    :css
      .id {font-weight: bold;}
      .signalp {color:#000099; font-weight: bold;}
      .motif {color:#FF3300; font-weight: bold;}
      h3 {word-wrap: break-word;}
      p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
  %body
  - data.each do |d|
    %p
      %span.id= d[:id_start]
      %span= d[:id_end]
      %br/
      %span.signalp= d[:signalp]
      - d[:motif].each do |m|
        %span= m
EOT

以下是如何使用它:

require 'haml'

data = make_array_of_hashes('sample.txt')

engine = Haml::Engine.new(haml_doc)
puts engine.render(Object.new, :data => data)

其中,运行时输出:

<!DOCTYPE html>
<html>
  <head>
    <style>
      .id {font-weight: bold;}
      .signalp {color:#000099; font-weight: bold;}
      .motif {color:#FF3300; font-weight: bold;}
      h3 {word-wrap: break-word;}
      p {word-wrap: break-word; font-family:Courier New, Courier, Mono;}
    </style>
  </head>
  <body></body>
  <p>
    <span class='id'>>isotig00001_f4_14</span>
    <span>Signal P Cleavage Site => 11:12</span>
    <br>
    <span class='signalp'>MMHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00001_f4_15</span>
    <span>Signal P Cleavage Site => 10:11</span>
    <br>
    <span class='signalp'>MHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00003_f6_8</span>
    <span>Signal P Cleavage Site => 11:12</span>
    <br>
    <span class='signalp'>MMHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00003_f6_9</span>
    <span>Signal P Cleavage Site => 10:11</span>
    <br>
    <span class='signalp'>MHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00004_f6_8</span>
    <span>Signal P Cleavage Site => 11:12</span>
    <br>
    <span class='signalp'>MMHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00004_f6_9</span>
    <span>Signal P Cleavage Site => 10:11</span>
    <br>
    <span class='signalp'>MHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00009_f2_3</span>
    <span>Signal P Cleavage Site => 22:23</span>
    <br>
    <span class='signalp'>MLKCFSIIMGLILLLEIGGGCA</span>
    <span>KR</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00009_f3_9</span>
    <span>Signal P Cleavage Site => 16:17</span>
    <br>
    <span class='signalp'>MKTGIIIFISTVVVLP</span>
    <span>KR</span>
  </p>
  <p>
    <span class='id'>>isotig00009_f6_13</span>
    <span>Signal P Cleavage Site => 11:12</span>
    <br>
    <span class='signalp'>MMHLLCIVLLL</span>
    <span>WL</span>
  </p>
  <p>
    <span class='id'>>isotig00009_f6_14</span>
    <span>Signal P Cleavage Site => 10:11</span>
    <br>
    <span class='signalp'>MHLLCIVLLL</span>
    <span>WL</span>
  </p>
</html>
于 2013-10-24T16:06:47.073 回答