我正在编写这个编码挑战的 Clojure 实现,试图找到 Fasta 格式的序列记录的平均长度:
>1
GATCGA
GTC
>2
GCA
>3
AAAAA
有关更多背景信息,请参阅有关Erlang 解决方案的相关 StackOverflow 帖子。
我的初学者 Clojure 尝试使用lazy-seq 尝试一次读取文件一条记录,以便将其扩展到大文件。然而,它相当消耗内存并且速度很慢,所以我怀疑它没有以最佳方式实现。这是一个使用BioJava库来抽象解析记录的解决方案:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])
(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))
(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))
以及没有外部库的等效方法:
(use '[clojure.contrib.duck-streams :only (read-lines)])
(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))
(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))
当前的实现在一个大文件上需要 44 秒,而 Python 实现需要 7 秒。您能否就加快代码速度并使其更直观提供任何建议?使用lazy-seq 是否按预期正确地逐条解析文件记录?