clojure - 带有 BOM 的 UTF-8 文件中第一行的长度

Question

下午好。假设我有一个带有单个字母的 utf-8 文件，比如“f”（没有 \n 和空格），并且我尝试获取一系列行长。

(with-open [rdr (reader "test.txt")] 
  (doall (map #(.length %) (line-seq rdr))))

我得到

=> (2)

为什么？有没有什么优雅的方法来获得第一个字符串的正确长度？

score 7 · Accepted Answer

Java 中的 BOM 问题在阅读 UTF-8-BOM 标记中有介绍。似乎可以使用Apache 的 Commons 中的BOMInputStream将其抽象出来，或者必须手动将其删除，即

(defn debomify
  [^String line]
  (let [bom "\uFEFF"]
    (if (.startsWith line bom)
      (.substring line 1)
      line)))

(doall (map #(.length %) (.split (debomify (slurp "test.txt")) "\n")))

如果您想使用懒惰地读取文件line-seq，例如因为它很大，您必须使用 . 处理第一行debomify。其余的可以正常读取。因此：

(defn debommed-line-seq
  [^java.io.BufferedReader rdr]
  (when-let [line (.readLine rdr)]
    (cons (debomify line) (lazy-seq (line-seq rdr)))))

clojure - 带有 BOM 的 UTF-8 文件中第一行的长度

1 回答 1

Related

Reference