string - “fasta 文件中序列的平均长度”：你能改进这个 Erlang 代码吗？

Question

我正在尝试使用Erlang获得fasta 序列的平均长度。fasta 文件看起来像这样

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)

我尝试使用以下Erlang代码回答这个问题：

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case io:get_line(S,'') of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    {Sequences,Total}=scanLines(standard_io,0,0),
    io:format("~p\n",[Total/(1.0*Sequences)]),
    halt().

编译/执行：

erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16

此代码似乎适用于小型 fasta 文件，但解析较大的文件（>100Mo）需要数小时。为什么？我是 Erlang 新手，你能改进这段代码吗？

score 5 · Accepted Answer

如果您需要真正快速的 IO，那么您必须比平时多做一些诡计。

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.

据我所知，这是最快的 IO，但请注意-noshell -noinput。编译就像erlc +native +"{hipe, [o3]}" g.erl但使用-smp disable

erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

并运行：

time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s

使用-smp enable但原生需要：

$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s

字节码，但有-smp disable（几乎与本机相提并论，因为大部分工作都是在端口中完成的！）：

$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s

仅用于 smp 的完整性字节码：

$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s

为了比较sarnold 版本给了我错误的答案，并且在相同的硬件上需要更多：

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s

编辑：我看过的特征，uniprot_sprot.fasta我有点惊讶。它是 3824397 行和 232MB。这意味着该-smp disabled版本每秒可以处理 118 万行文本（在面向行的 IO 中为 71MB/s）。

score 3 · Accepted Answer

我也在学习 Erlang，感谢这个有趣的问题。

我知道使用 Erlang 字符串作为字符列表可能会非常慢；如果您可以使用二进制文件，您应该会看到一些性能提升。我不知道您将如何将任意长度的字符串与二进制文件一起使用，但如果您可以对其进行排序，它应该会有所帮助。

此外，如果您不介意直接使用文件而不是standard_io.，也许您可以使用file:open(..., [raw, read_ahead]). raw表示文件必须在本地节点的文件系统上，并read_ahead指定 Erlang 应该使用缓冲区执行文件 IO。（考虑使用带有和不带有缓冲的 C 的 stdio 工具。）

我希望这read_ahead能带来最大的不同，但 Erlang 的所有内容都包含“猜测前的基准”这一短语。

编辑

在完整的 uniprot_sprot.fasta 数据集上file:open("uniprot_sprot.fasta", [read, read_ahead])使用get 。1m31s（平均 359.04679841439776。）

使用file:open(.., [read, read_ahead])and file:read_line(S)，我得到0m34s.

使用file:open(.., [read, read_ahead, raw])and file:read_line(S)，我得到0m9s. 是的，九秒。

这就是我现在的立场；如果你能弄清楚如何使用二进制文件而不是列表，它可能会看到更多改进：

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case file:read_line(S) of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
    case F of
    { ok, File } -> 
        {Sequences,Total}=scanLines(File,0,0),
        io:format("~p\n",[Total/(1.0*Sequences)]);
    { error, Reason } ->
            io:format("~s", Reason)
    end,
    halt().

score 2 · Accepted Answer

看起来您的大性能问题已经通过以原始模式打开文件解决了，但是如果您需要进一步优化该代码，这里还有一些想法。

学习和使用 fprof。

您string:strip/1主要用于删除尾随换行符。由于 erlang 值是不可变的，您必须制作列表的完整副本（以及所有相关的内存分配）才能删除最后一个字符。如果您知道文件格式正确，只需从您的计数中减去一个，否则我会尝试编写一个长度函数来计算相关字符的数量并忽略不相关的字符。

我对二进制文件比列表更好的建议持谨慎态度，但考虑到您的处理量很少，这里可能就是这种情况。第一步是以二进制模式打开文件并使用erlang:size/1来查找长度。

Total/(1.0*Sequences)它不会影响性能（显着），但只有在除法中断的语言中才需要乘以 1.0 。Erlang 除法工作正常。

score 1 · Accepted Answer

该调用string:len(string:strip(L))至少遍历列表两次（我不知道 string:strip 实现）。相反，您可以编写一个简单的函数来计算行长 w/0 空格：

stripped_len(L) ->
  stripped_len(L, 0).

stripped_len([$ |L], Len) ->
  stripped_len(L, Len);

stripped_len([_C|L], Len) ->
  stripped_len(L, Len + 1);

stripped_len([], Len) ->
  Len.

同样的方法也可以应用于二进制文件。

score 0 · Accepted Answer

您是否尝试过运行在 Erlang 之上并具有类似于 Ruby 的语法的 Elixir (elixir-lang.org)。Elixir 通过以下方式解决字符串问题：

Elixir 字符串是 UTF8 二进制文件，具有所有原始速度和内存节省。Elixir 有一个内置 Unicode 功能的 String 模块，是编写代码的一个很好的例子。String.Unicode 读取各种 Unicode 数据库转储，例如 UnicodeData.txt，为直接从该数据构建的 String 模块动态生成 Unicode 函数！( http://devintorr.es/blog/2013/01/22/the-excitement-of-elixir/ )

只是想知道 Elixir 是否会更快？

string - “fasta 文件中序列的平均长度”：你能改进这个 Erlang 代码吗？

5 回答 5

Related

Reference