clojure - Clojure - 计算序列中向量的唯一值

Question

作为 Clojure 的新手，我似乎无法弄清楚如何做一些看起来应该很简单的事情。我只是看不到它。我有一个向量序列。假设每个向量都有两个值，分别代表客户编号和发票编号，每个向量代表一件商品的销售。所以它看起来像这样：

([ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])

我想计算唯一客户和唯一发票的数量。所以这个例子应该产生向量

[ 2 3 ]

在 Java 或另一种命令式语言中，我将遍历 seq 中的每个向量，将客户编号和发票编号添加到一个集合中，然后计算每个集合中值的数量并返回它。我看不到执行此操作的功能方法。

谢谢您的帮助。

编辑：我应该在我最初的问题中指定向量的序列是数百万的 10 并且实际上不止两个值。因此，我只想通过 seq 一次，并在通过 seq 的那一次计算这些唯一计数（以及一些总和）。

score 12 · Accepted Answer

在 Clojure 中，您可以使用几乎相同的方式进行操作 - 首先调用distinct以获取唯一值，然后用于count计算结果：

(def vectors (list [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]))
(defn count-unique [coll] 
   (count (distinct coll)))
(def result [(count-unique (map first vectors)) (count-unique (map second vectors))])

请注意，在这里您首先获取向量的第一个和第二个元素的列表（映射第一个/第二个向量），然后分别对每个元素进行操作，从而对集合进行两次迭代。如果性能确实很重要，您可以对迭代（参见形式或尾递归）和集合做同样的事情loop，就像在 Java 中所做的那样。要进一步提高性能，您还可以使用transients. 虽然对于像你这样的初学者，我会推荐第一种方式distinct。

UPD。这是带循环的版本：

(defn count-unique-vec [coll]
  (loop [coll coll, e1 (transient #{}), e2 (transient #{})]
    (cond (empty? coll) [(count (persistent! e1)) (count (persistent! e2))]
          :else (recur (rest coll)
                       (conj! e1 (first (first coll)))
                       (conj! e2 (second (first coll)))))))
(count-unique-vec vectors)    ==> [2 3]

如您所见，不需要原子或类似的东西。首先，您将状态传递给每次下一次迭代（重复调用）。其次，您使用瞬态来使用临时可变集合（详细了解瞬态），从而避免每次都创建新对象。

UPD2。这是reduce扩展问题的版本（带价格）：

(defn count-with-price
  "Takes input of form ([customer invoice price] [customer invoice price] ...)  
   and produces vector of 3 elements, where 1st and 2nd are counts of unique    
   customers and invoices and 3rd is total sum of all prices"
  [coll]
  (let [[custs invs total]
        (reduce (fn [[custs invs total] [cust inv price]]
                  [(conj! custs cust) (conj! invs inv) (+ total price)])
            [(transient #{}) (transient #{}) 0]
            coll)]
    [(count (persistent! custs)) (count (persistent! invs)) total]))

在这里，我们将中间结果保存在一个向量[custs invs total]中，每次解包、处理并将它们打包回一个向量。如您所见，用这种非平凡的逻辑实现reduce更难（写入和读取）并且需要更多代码（在looped 版本中，为 price 循环添加一个参数就足够了）。所以我同意@ammaloy 的观点，更简单的情况reduce更好，但更复杂的事情需要更多的低级构造，比如loop/recurpair。

score 10 · Accepted Answer

就像使用序列时经常发生的情况一样，reduce比loop这里更好。你可以这样做：

(map count (reduce (partial map conj) 
                   [#{} #{}]
                   txn))

或者，如果你真的很喜欢瞬变：

(map (comp count persistent!)
     (reduce (partial map conj!) 
             (repeatedly 2 #(transient #{}))
             txn))

这两种解决方案都只遍历输入一次，并且它们使用的代码比循环/递归解决方案少得多。

score 4 · Accepted Answer

或者您可以使用集合来为您处理重复数据删除，因为集合最多可以具有任何特定值之一。

(def vectors '([100 2000] [100 2000] [101 2001] [100 2002]))    
[(count (into #{} (map first vectors)))  (count (into #{} (map second vectors)))]

score 1 · Accepted Answer

这是使用地图和高阶函数执行此操作的好方法：

(apply 
  map 
  (comp count set list) 
  [[ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]])

=> (2 3)

score 0 · Accepted Answer

上述好的解决方案还有其他解决方案：

(map (comp count distinct vector) [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])

其他用thread-last宏写的：

(->> '([100 2000] [100 2000] [101 2001] [100 2002]) (apply map vector) (map distinct) (map count))

两者都返回 (2 3)。

clojure - Clojure - 计算序列中向量的唯一值

5 回答 5

Related

Reference