clojure - 改进用于迭代文本解析的 clojure 惰性序列

Question

我正在编写这个编码挑战的 Clojure 实现，试图找到 Fasta 格式的序列记录的平均长度：

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

有关更多背景信息，请参阅有关Erlang 解决方案的相关 StackOverflow 帖子。

我的初学者 Clojure 尝试使用lazy-seq 尝试一次读取文件一条记录，以便将其扩展到大文件。然而，它相当消耗内存并且速度很慢，所以我怀疑它没有以最佳方式实现。这是一个使用BioJava库来抽象解析记录的解决方案：

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

以及没有外部库的等效方法：

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

当前的实现在一个大文件上需要 44 秒，而 Python 实现需要 7 秒。您能否就加快代码速度并使其更直观提供任何建议？使用lazy-seq 是否按预期正确地逐条解析文件记录？

score 3 · Accepted Answer

这可能无关紧要，但average要抓住长度序列的头部。
以下是一种完全未经测试但更懒惰的方式来做我认为你想要的事情。

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))

score 1 · Accepted Answer

你的average函数是非惰性的——它需要coll在抓住它的头时实现整个参数。更新：刚刚意识到我最初的答案包括一个关于如何解决上述问题的荒谬建议......啊。幸运的是 ataggart 已经发布了一个正确的解决方案。

除此之外，您的代码乍一看确实很懒惰，尽管read-lines目前不鼓励使用（line-seq改为使用）。

如果文件非常大并且您的函数将被多次调用，则在--seq-iter的参数向量中进行类型提示，如果您使用的是 Clojure 1.1，则使用 -- 可能会产生重大影响。实际上，然后编译您的代码并添加类型提示以删除所有反射警告。seq-length^NameOfBiojavaSeqIterClass seq-iter#^^(set! *warn-on-reflection* true)

clojure - 改进用于迭代文本解析的 clojure 惰性序列

2 回答 2

Related

Reference