algorithm - 维护内存中的前 100 名列表

Question

我正在制作一个实时统计系统，它将在内存中维护一个最常访问的 URL 路径列表（仅路径，参数被剥离）。

我考虑过“最大堆”，但鉴于 URI 模式各不相同（您无法预测新模式），我无法使用该数据结构。

我想到的是，您需要记录每个不同 URI 的计数，例如

www.google.com/pathA   5 times
www.google.com/pathB   3 times
...

因此，每当发现新的 URI 模式时，您都需要为其初始化一个条目，否则您可能会忽略一个关键 URI。

你不能真的“保留前 100 名”。

如果不消耗大量内存空间，它看起来是不可能实现的。

有什么建议吗？

score 2 · Accepted Answer

尽管它并不能完全满足您的要求，但我认为splay 树是您所需要的。它是一种出色的数据结构，具有将最近访问的元素和最常访问的元素保持在更靠近根的属性的特性。

如果这对您不起作用，请使用堆并在需要时更新元素的优先级。你不能用内置的堆来做到这一点，但实现起来并不难。

score 1 · Accepted Answer

如果您想确定，您列出的是前 100 名，那么您是对的。

您可以为此编写一些启发式方法。例如，您可以记录前 100 个和最后 100 个。新的 100 将是第二个将是第二个列表，其中的 url 可能成为前 100 个之一。最后 100 个您可以算作前 100 个。如果您访问的 url 不是在前 100 名和最后 100 名中，您将从最后 100 名中删除 sth，即最后访问的 url。

如果 sb 一个一个地访问 101 个 url，它不会起作用，但这是一个好的开始。您可以考虑不同的策略，应该删除哪个策略等等。

示例实现：

top100 : list<(URL, count)>
last100: list<(URL, count, score)>

process(URL){
    if(URL in top100) incrementCount top100[URL];
    elif(URL in last100){
        incrementScore last100[URL];
        newCount := incrementCount last100[URL];
        if (newCount > top100.lowestCount)
            swap this URL between last100 and top100 
        }
    else{
        //perform check if should change sth in last100, i.e.:
        if(exists score=0 in last100)
            remove score0 from last100.
            put (URL, 1, 0) to last100;
        }
        else{
            decrement all score in last100
        }
     }
 }

top/last 3 而不是 100 的简单运行。让我们从中间开始，当：top3 = [ (A, 10), (B, 4), (C, 3) ] last3 = [ (E, 2, 0) , (F, 1, 0), (G, 1, 0) ] （A..G 是 URL）

G: last3 = [ (E, 2, 0), (G, 2, 1), (F, 1, 0) ] //inc G score, count

G: last3 = [ (E, 2, 0), (G, 3, 2), (F, 1, 0) ] //inc G score, count

H: last3 = [ (E, 2, 0), (G, 3, 2), (H, 1, 0) ] //用 H 代替 F

F: last3 = [ (E, 2, 0), (G, 3, 2), (F, 1, 0) ] //用 F 代替 G

G：top3 = [ (A, 10), (B, 4), (G, 4) ], [ (E, 2, 0), (C, 3, 2), (F, 1, 0) ] / /交换GC

G: top3 = [ (A, 10), (B, 4), (G, 5) ] // 增加 G 计数

F: last3 = [ (E, 2, 0), (G, 3, 2), (F, 2, 1) ] //inc F score, count

E: last3 = [ (E, 3, 1), (G, 3, 2), (F, 2, 1) ] //inc E score, count

H: last3 = [ (E, 3, 0), (G, 3, 1), (F, 2, 0) ] //no el with score=0, dec all scores

H: last3 = [ (E, 3, 0), (G, 3, 1), (H, 1, 0) ] //用 H 代替 F

所以F和G经常出现，但不幸的是他们阻止了对方保持在last3，并进入top3。在具有 last/top100（或更多）的实际情况中，很难发生这种情况。

更复杂的策略应该操纵分数和计数，以更好地决定是否应该放置新的 url，如果是，应该删除哪个 url。您应该准备一些样本数据并创建高质量的策略。

score -2 · Accepted Answer

更新：对不起，我的解决方案是最近使用的而不是最流行的。在回答之前我没有正确阅读问题。

我认为您要查找的是 LRU 缓存或最近最少使用的缓存我们将使用排序模式“true”扩展 LinkedHashMap 以保持排序。并在大小超过最大条目时覆盖“removeEldestEntry”以返回 true。在您的情况下，maxEntries = 100。

请查看 ( http://docs.oracle.com/javase/6/docs/api/java/util/LinkedHashMap.html ) 了解有关 LinkedHashMap 的更多详细信息

private class LruCache<A, B> extends LinkedHashMap<A, B> {
    private final int maxEntries;

    public LruCache(final int maxEntries) {
       /* Using constructor LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder) 
         Which Constructs an empty LinkedHashMap instance with the specified initial   
         capacity, load factor and ordering mode. */
        super(maxEntries + 1, 1.0f, true);
        this.maxEntries = maxEntries;
    }

   /* Returns true if this <code>LruCache</code> has more entries than the 
      maximum specified when it was created.*/
    @Override
    protected boolean removeEldestEntry(final Map.Entry<A, B> eldest) {
        return super.size() > maxEntries;
    }
}

Map<String, String> example = Collections.synchronizedMap(new LruCache<String, String>(CACHE_SIZE));

algorithm - 维护内存中的前 100 名列表

3 回答 3

Related

Reference