I read that Apriori algorithm is used to fetch association rules from the dataset like a set of tuples. It helps us to find the most frequent 1-itemsets, 2-itemsets and so-on. My problem is bit different. I have a dataset, which is a set of tuples, each of varying size - as follows :
(1, 234, 56, 32) (25, 4575, 575, 464, 234, 32) . . . different size tuples
The domain for entries is huge, which means that I cannot have a binary vector for each tuple, that tells me if item 'x' is present in tuple. Hence, I do not see Apriori algorithm fitting here.
My target is to answer questions like :
- Give me the ranked list of 5 numbers, that occur with 234 most of the time
- Give me the top 5 subsets of size 'k' that occur most frequently together
Requirements : Exact representation of numbers in output (not approximate), Domain of numbers can be thought of as 1 to 1 billion.
I have planned to use the simple counting methods, if no standard algorithm fits here. But, if you guys know about some algorithm that can help me, please let me know