1

Text的结构是这样的;

Tag001
 0.1, 0.2, 0.3, 0.4
 0.5, 0.6, 0.7, 0.8
 ...
Tag002
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
 ...

文件可以有任意数量的 TagXXX 事物,每个标签可以有任意数量的 CSV 值行。

==== 购买力平价。(对不起这些东西:-)

更多改进;现在我的 atom 笔记本电脑上 31842 行数据需要 1 秒左右,比原始代码快 7 倍。但是,C 版本比这快 20 倍。

(defn add-parsed-code [accu code]
  (if (empty? code)
    accu
    (conj accu code)))

(defn add-values [code comps]
  (let [values comps
        old-values (:values code)
        new-values (if old-values
                     (conj old-values values)
                     [values])]
    (assoc code :values new-values)))

(defn read-line-components [file]
  (map (fn [line] (clojure.string/split line #","))
       (with-open [rdr (clojure.java.io/reader file)]
         (doall (line-seq rdr)))))

(defn parse-file [file]
  (let [line-comps (read-line-components file)]
    (loop [line-comps line-comps
           accu []
           curr {}]
      (if line-comps
        (let [comps (first line-comps)]
          (if (= (count comps) 1) ;; code line?
            (recur (next line-comps)
                   (add-parsed-code accu curr)
                   {:code (first comps)})
            (recur (next line-comps)
                   accu
                   (add-values curr comps))))
        (add-parsed-code accu curr)))))

==== PPS。

虽然我不明白为什么第一个比第二个快 10 倍,但不是 slurp,map 和 with-open 确实使阅读速度更快;虽然整个阅读/处理时间并没有减少(从 7 秒到 6 秒)

(time
 (let [lines (map (fn [line] line)
                  (with-open [rdr (clojure.java.io/reader
                                   "DATA.txt")]
                    (doall (line-seq rdr))))]
   (println (last lines))))

(time (let [lines
            (clojure.string/split-lines
             (slurp "DATA.txt"))]
        (println (last lines))))

==== PS。Skuro 的解决方案确实有效。但是解析速度不是那么快,所以我必须使用基于 C 的解析器(它在 1~3 秒内读取 400 个文件,而 clojure 确实需要 1~4 秒的单个文件;是的文件大小相当大)用于读取和仅为统计分析部分构建 DB 和 clojure。

4

2 回答 2

5

下面解析上述文件,保持任何值行分隔。如果这不是您想要的,您可以更改该add-values功能。解析状态保存在curr变量中,同时accu保存先前解析的标签(即,在找到“TagXXX”之前出现的所有行)。它允许没有标签的值:

更新:副作用现在封装在一个专用load-file函数中

(defn tag? [line]
  (re-matches #"Tag[0-9]*" line))

; potentially unsafe, you might want to change this:
(defn parse-values [line]
  (read-string (str "[" line "]")))

(defn add-parsed-tag [accu tag]
  (if (empty? tag)
      accu
      (conj accu tag)))

(defn add-values [tag line]
  (let [values (parse-values line)
        old-values (:values tag)
        new-values (if old-values
                       (conj old-values values)
                       [values])]
    (assoc tag :values new-values)))

(defn load-file [path]
  (slurp path))

(defn parse-file [file]
  (let [lines (clojure.string/split-lines file)]
    (loop [lines lines ; remaining lines 
           accu []     ; already parsed tags
           curr {}]    ; current tag being parsed
          (if lines
              (let [line (first lines)]
                (if (tag? line)
                    ; we recur after starting a new tag
                    ; if curr is empty we don't add it to the accu (e.g. first iteration)
                    (recur (next lines)
                           (add-parsed-tag accu curr)
                           {:tag line})
                    ; we're parsing values for a currentl tag
                    (recur (next lines)
                           accu
                           (add-values curr line))))
              ; if we were parsing a tag, we need to add it to the final result
              (add-parsed-tag accu curr)))))

我对上面的代码不是很兴奋,但它确实完成了工作。给定一个文件,如:

Tag001
 0.1, 0.2, 0.3, 0.4
 0.5, 0.6, 0.7, 0.8
Tag002
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
Tag003
 1.1, 1.2, 1.3, 1.4
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
 1.5, 1.6, 1.7, 1.8

它产生以下结果:

user=> (clojure.pprint/print-table [:tag :values] (parse-file (load-file "tags.txt")))
================================================================
:tag   | :values
================================================================
Tag001 | [[0.1 0.2 0.3 0.4] [0.5 0.6 0.7 0.8]]
Tag002 | [[1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8]]
Tag003 | [[1.1 1.2 1.3 1.4] [1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8] [1.5 1.6 1.7 1.8]]
================================================================
于 2013-03-14T09:51:40.423 回答
1

这可以使用 partition-by 函数来完成。阅读起来可能有些神秘,但可以轻松提高可读性。这个函数在我的 mini-mac 上执行了大约 500 毫秒。

首先,我使用以下函数创建了测试数据。

(defn write-data[fname]
   (with-open [wrtr (clojure.java.io/writer fname) ]
     (dorun 
        (for [ x (take 7500 (range)) ]
          (do
             (.write wrtr (format "Tag%010d" x))
             (.write wrtr "
                            1.1, 1.2, 1.3, 1.4
                            1.1, 1.2, 1.3, 1.4
                            1.5, 1.6, 1.7, 1.8
                            1.5, 1.6, 1.7, 1.8
                           " ))))))

(write-data "my-data.txt")

; "a b c d " will be converted to [ a b c d ]
(defn to-vec[st]
   (load-string (str "[" st "]")))


(defn my-transform[fname]
   (let [tag (atom {:tag nil})]
      (with-open [rdr (clojure.java.io/reader fname)]
         (doall 
           (into {} 
               (map 
                  (fn[xs] {(first xs) (map to-vec (rest xs))}) 
                     ( partition-by 
                          (fn[y] 
                             (if(.startsWith 
                                  (str y) "Tag") 
                                  (swap! tag assoc :tag y) @tag)) 
                       (line-seq rdr))))))))


(time (count (my-transform "my-data.txt")))
;Elapsed time: 517.23 msecs
于 2013-10-04T00:17:59.400 回答