我想知道是否有一种算法可以计算“最常见的项目”而不必对每个项目进行计数?例如,假设我是一个搜索引擎,想要跟踪 10 个最受欢迎的搜索。我不想做的是为每个查询保留一个计数器,因为我可能有太多查询无法计数(而且大多数都是单例)。有一个简单的算法吗?也许是概率性的东西?谢谢!
4 回答
好吧,如果您有大量查询(可能像搜索引擎那样),那么您可以对查询进行“抽样”。因此,您每秒可能会收到 1,000 个查询,但如果您只保持每秒一个计数,那么在较长的一段时间内,您会得到一个相对接近“真实”答案的答案。
Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.
One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.
Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.