1

I read that Apriori algorithm is used to fetch association rules from the dataset like a set of tuples. It helps us to find the most frequent 1-itemsets, 2-itemsets and so-on. My problem is bit different. I have a dataset, which is a set of tuples, each of varying size - as follows :

(1, 234, 56, 32) (25, 4575, 575, 464, 234, 32) . . . different size tuples

The domain for entries is huge, which means that I cannot have a binary vector for each tuple, that tells me if item 'x' is present in tuple. Hence, I do not see Apriori algorithm fitting here.

My target is to answer questions like :

  1. Give me the ranked list of 5 numbers, that occur with 234 most of the time
  2. Give me the top 5 subsets of size 'k' that occur most frequently together

Requirements : Exact representation of numbers in output (not approximate), Domain of numbers can be thought of as 1 to 1 billion.

I have planned to use the simple counting methods, if no standard algorithm fits here. But, if you guys know about some algorithm that can help me, please let me know

4

2 回答 2

2

我曾在 Apriori 从事数据挖掘工作。问题是,你有所有这些物品吗?您实际上有多少个单独的项目 ID?我知道项目 ID 可能跨越一个大域,但也许它们并不全部存在。在这种情况下,稀疏的市场篮子表示可能仍然对您有好处,您将能够使用 Apriori。将您的最低支持和信心设置为高值也将消除许多低优先级链接。我使用Orange库来满足我的数据挖掘需求。

于 2012-10-09T16:05:18.120 回答
1

对于 Apriori,您不需要元组或向量。它可以用非常不同的数据类型来实现。常见的数据类型是排序的项目列表,看起来也可以1 13 712 1928 123945 191823476存储为 6 个整数。这本质上等同于稀疏二进制向量,并且通常非常节省内存。另外,APRIORI 实际上设计用于在对您的主内存来说太大的数据集上运行!

APRIORI 的可扩展性是事务数和项目数的混合。根据它们的不同,您可能更喜欢不同的数据结构和算法。

于 2012-10-09T19:50:25.233 回答