java - 提高矩阵/表聚合和搜索的性能

Question

有一个产品特征矩阵。它有数千行（产品）和数百个功能。它具有显示产品是否具有此功能的二进制值。所以它可能是一个有 40 000 行和 900 列的表。

Product-feature matrix pr f1 f2 f3 fn ... 01 0 1 1 1 02 0 0 0 0 03 1 0 1 0 04 1 0 1 0 .....

首先，我必须找到具有给定特征集 QEg Q=(f1=1, f5=1, f27=1) 的产品。简单的说，找蓝车，两厢，3门。

Result 1 Given Q=(f1=1, f5=1, f27=1) Relevant products: 03, 04, 08...

其次，也是最重要的，我必须找出有多少产品具有一组特征 Q，也有一个特征 f_i（其中 i - 1..n）。换句话说，我们正在选择满足 Q 的行，并计算每列中有多少个 1（进行 SUM 聚合）。比如有多少蓝色车，两厢车，3门也有：柴油机，汽油机，氙气灯。

Result 2 Given Q=(f1=1, f5=1, f27=1) sum f2 = 943 sum f3 = 543 sum f4 = 7 sum f6 = 432 ....

当然，可以使用 RDBMS 解决此任务，但它不是那么有效 - 一般情况下，它需要全扫描才能在每列中查找产品和聚合。至少我不知道如何为这个任务建立一个有效的 b-tree 索引。Oracle 位图索引可能会有所帮助，但我不能使用 Oracle。

目前，我们正在使用 MySQL 来完成这项任务，但效果并不理想。实际上，我们使用整数表示（我们对特征进行分组并将整数存储在列中，而不是布尔值）来减少列的数量。

可以将此矩阵视为稀疏二进制矩阵。而且完全存储在内存中也不是什么大问题。我想知道是否可以应用一些算法来处理稀疏矩阵、向量空间（SVD、矩阵向量乘法等）。但它可能有助于找到满足向量 Q 的产品，而不是聚合。问题更多在于聚合的时间，而不是空间。

可能，可以将矩阵存储为多链表，这将有助于查找产品并为每一列进行聚合。

最后，问题是如何处理这个任务。找到具有给定特征的产品然后计算具有附加特征的产品（按每列汇总）的最有效算法是什么。

score 1 · Accepted Answer

如果我了解您当前的解决方案，那么您有一个表格，每个产品都有一行，每个功能都有一个列。这是解决问题的一种非常低效的方法。

三张桌子怎么样

"products" (product_ref, product_name) index product_ref (产品列表)

"features" (feature_ref, description) index feature_ref (可能的特征列表)

和

"productfeatures" (product_ref, feature_ref) index product_ref,feature_ref 和 feature_ref,product_ref (每一行代表一个产品的一个特征)

然后，您可以在表之间执行连接

select * from products t1 join productfeatures t2 join productfeatures t3 where t1.product_ref=t2.product_ref and t1.product_ref=t3.product_ref and t2.feature_ref=45 and t3.feature_ref=67 etc.

score 1 · Accepted Answer

请看一下我前段时间做的这个例子，它遵循了 jaydee 正确概述的内容，但更详细，并且针对 1.25 亿张海报类别（car_features），运行时间为 0.02 秒 - 你的，最坏的情况是 40K++ * 900 cols = 3600 万行，即，它是一个婴儿！

重写mysql select以减少时间并将tmp写入磁盘

select count(*) from category
count(*)
========
500,000


select count(*) from poster
count(*)
========
1,000,000

select count(*) from poster_category
count(*)
========
125,675,688

select count(*) from poster_category where cat_id = 623
count(*)
========
342,820

explain
select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

id  select_type table   type    possible_keys   key     key_len ref                         rows
==  =========== =====   ====    =============   ===     ======= ===                         ====
1   SIMPLE      c       const   PRIMARY         PRIMARY 3       const                       1   
1   SIMPLE      p       index   PRIMARY         name    257     null                        32  
1   SIMPLE      pc      eq_ref  PRIMARY         PRIMARY 7       const,foo_db.p.poster_id    1   

select
 p.*,
 c.*
from
 poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
 pc.cat_id = 623
order by
 p.name
limit 32;

0.021: Query OK

score 1 · Accepted Answer

您可以按列排列数据。即有一个BitSet 用于列出具有该功能的汽车/行的列。

这使您可以利用 BitSet 提供的快速和/或操作。

使用这些功能，您应该能够在不到 2 微秒的时间内完成每个功能的搜索和计数。

对于 40,000 * 900 数据集，以下打印

average search time 1396 ns.
average count time 1234 ns.

这应该比使用 RDBMS 数据库快几个数量级。即使是一百万行，每行花费的时间也不到 50 微秒。

public static void main(String... args) throws IOException {
    final int rows = 40 * 1000;
    final int cols = 900;

    List<BitSet> features = new ArrayList<BitSet>();
    features.add(new BitSet(rows));
    features.add(new BitSet(rows));
    for (int i = 2; i < cols; i++) {
        final BitSet bs = new BitSet(rows);
        for (int j = 0; j < rows; j++)
            bs.set(j, j % i == 0);
        features.add(bs);
    }

    // perform the search
    int[] factors = new int[]{2, 5, 7};
    BitSet matches = new BitSet();
    long runs =  1000*1000;
    {
        long start = System.nanoTime();
        for (int i = 0; i < runs; i++) {
            // perform lookup.
            lookup(matches, features, factors);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average search time " + avgTime  + " ns.");
    }
    {
        long start = System.nanoTime();
        int count9 = 0;
        BitSet col9matched = new BitSet(cols);
        for (int i = 0; i < runs; i++) {
            final int index = 9;
            final BitSet feature = features.get(index);
            count9 = countMatches(col9matched, matches, feature);
        }
        long avgTime = (System.nanoTime() - start) / runs;
        System.out.println("average count time " + avgTime + " ns.");
    }
}

private static int countMatches(BitSet scratch, BitSet matches, BitSet feature) {
    // recycle.
    scratch.clear();
    scratch.or(matches);
    scratch.and(feature);
    return scratch.cardinality();
}

private static void lookup(BitSet matches, List<BitSet> data, int[] factors) {
    matches.clear();
    matches.or(data.get(factors[0]));
    for (int i = 1, factorsLength = factors.length; i < factorsLength; i++) {
        matches.and(data.get(factors[i]));
    }
}

java - 提高矩阵/表聚合和搜索的性能

3 回答 3

Related

Reference