algorithm - 找出集合中出现频率最高的元素

Question

让您为您的网站构建一个分析程序，您每次访问时都会在其中记录页面的地址，因此您的 log.txt 可以是

x.com/a
x.com/b
x.com/a
x.com/c
x.com/a

没有计数器，它只是一个日志文件，没有使用 sql，因为它有数千个元素，你的一千个左右的唯一域地址 (x.com/a x.com/b)，最有效的方法是什么通过这个列表并吐出前 10 个网址。

我最好的解决方案是通过日志文件，然后如果哈希表中不存在该域，则将其作为键添加，并增加它的值；然后在哈希上搜索最大的 10 个值。

我不相信这是最好的解决方案，不仅因为空间复杂性（如果唯一域从几千到几百万会发生什么），而且因为我需要对哈希表进行另一次搜索以找到最大的价值观。

score 2 · Accepted Answer

Even for few thousands or few millions entries - your approach is just fine - it has a linear on average (O(n)) run time - so it is not that bad.

However, you can use a map-reduce approach if you want more scalability.

map(file):
   for each entry in file:
         emitIntermediate(getDomainName(entry),"1"))
reduce(domain,list):
   emit(domain,size(list))

The above will efficiently give you the list of (domain,count) tupples, and all you have to do is select the top 10.

Selecting the top 10 can be done using a map-reduce (distributed) sort for scalability - or using a min heap (iterate while maintaining the top 10 elements encountered in the heap). The second is explained with more details in this thread

About space complexity: If you are using a 64 bit system, you can use it as RAM, and let the OS do what it cans (by swapping elements to disk when needed), it is very unlikely that you will need more then the amount of Virtual Memory you have on a 64 bits machine. An alternative is to use a hash table (or a B+ tree) optimized for file systems and do the same algorithm on it.

However, if this is indeed the case - and the data does not fit RAM, and you cannot use map-reduce, I suspect sorting and iterating - though will be O(nlogn) will be more efficient (using external sort) - because the number of DISK ACCESSES will be minimized, and disk access is much slower then RAM access.

score 1 · Accepted Answer

不要重新发明轮子。Coreutils'sort并且uniq可以处理您的日志文件

sort log.txt | uniq -c | sort -n -r

Coreutils 在 *nix 系统上可用，并已移植到 Windows。

如果您确实需要在自己的代码中汇总此处理，请查阅您的语言的可用库以获取其multiset版本。例如，Python 是Counter类，它会很高兴地告诉你most_common([n]).

score 1 · Accepted Answer

我建议每次检查文件都是错误的方法。更好的解决方案可能是解析文件并将数据推送到数据库中（ravendb 或其他 no-sql 应该是最简单的）。一旦到了那里，查询数据就变得微不足道了，即使是非常大量的数据。

algorithm - 找出集合中出现频率最高的元素

3 回答 3

Related

Reference