parsing - 如何在 Clojure 中懒惰地解析大 XHTML 文件？

Question

我有一张大表的有效 XHTML 文件（100 兆字节的数据）。第一个 tr 是列（用于数据库），所有其他 tr 都是数据。它是整个文档中唯一的表，结构为 html->body->div->table。

如何在 Clojure 中以惰性方式解析它？

我知道data.xml，但因为我是 Clj 初学者，所以我很难让它工作。特别是因为 REPL 在处理这么大的文件时非常慢。

score 15 · Accepted Answer

data.xmldocs 说它创建了一个文档的惰性树：parse。我在当地查了一下，似乎是真的：

; Load libs
(require '[clojure.data.xml :as xml])
(require '[clojure.java.io :as io])

; standard.xml is 100MB xml file from here http://www.xml-benchmark.org/downloads.html
(def xml-tree (xml/parse (io/reader "standard.xml")))
(:tag xml-tree) => :site

(def child (first (:content xml-tree)))
(:tag child) => :regions

(dorun (:content xml-tree)) => REPL hangs for ~30 seconds on my computer because it tries to parse whole file

parsing - 如何在 Clojure 中懒惰地解析大 XHTML 文件？

1 回答 1

Related

Reference