algorithm - 从日志文件中获取前 100 个 URL

Question

我的一位朋友在一次采访中被问到以下问题。谁能告诉我如何解决它？

我们有一个相当大的日志文件，大约 5GB。日志文件的每一行都包含一个用户在我们网站上访问过的 url。我们想弄清楚我们的用户访问的最受欢迎的 100 个网址是什么。怎么做？

score 7 · Accepted Answer

如果我们有超过 10GB 的 RAM，只需使用 hashmap 直接进行即可。

否则，使用散列函数将其分成几个文件。然后处理每个文件并获得前 5 名。每个文件都有“前 5 名”，很容易获得整体前 5 名。

另一种解决方案可以使用任何外部排序方法对其进行排序。然后扫描文件一次以计算每次出现的次数。在此过程中，您不必跟踪计数。你可以安全地扔掉任何没有进入top5的东西。

score 2 · Accepted Answer

只需根据 URL 对日志文件进行排序（如果您选择堆排序或快速排序等算法，则需要恒定空间），然后计算每个 URL 出现的次数（很简单，具有相同 URL 的行彼此相邻）。

总体复杂度为 O(n*Log(n))。

为什么拆分多个文件并为每个文件只保留前 3 个（或前 5 个或前 N 个）是错误的：

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 从未进入单个文件的前 3 名，但总体上是最好的。

score 1 · Accepted Answer

因为日志文件相当大，您应该使用流式阅读器读取日志文件。不要在记忆中全部阅读。我希望在我们处理日志文件时在内存中拥有可能的不同链接的数量是可行的。

// Pseudo
Hashmap map<url,count>
while(log file has nextline){
    url = nextline in logfile
    add url to map and update count
}

List list 
foreach(m in map){
    add m to list         
}

sort the list by count value
take top n from the list

运行时间为 O(n) + O(m*log(m))，其中 n 是日志文件的行数，m 是找到的不同链接的数量。

这是伪代码的 C# 实现。没有提供实际的文件阅读器和日志文件。而是提供了使用内存中的列表读取日志文件的简单模拟。

该算法使用哈希图来存储找到的链接。排序算法随后找到前 100 个链接。排序算法使用简单的数据容器数据结构。

内存复杂度取决于预期的不同链接。hashmap 必须能够包含找到的不同链接，否则此算法将不起作用。

// Implementation
using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main(string[] args)
    {
        RunLinkCount();
        Console.WriteLine("press a key to exit"); 
        Console.ReadKey();
    }

    class LinkData : IComparable
    {
        public string Url { get; set; }
        public int Count { get; set; }
        public int CompareTo(object obj)
        {
            var other = obj as LinkData;
            int i = other == null ? 0 : other.Count;
            return i.CompareTo(this.Count);
        }
    }

    static void RunLinkCount()
    {
        // Data setup
        var urls = new List<string>();
        var rand = new Random();
        const int loglength = 500000;
        // Emulate the log-file
        for (int i = 0; i < loglength; i++)
        {
            urls.Add(string.Format("http://{0}.com", rand.Next(1000)
                 .ToString("x")));
        }

        // Hashmap memory must be allocated 
        // to contain distinct number of urls
        var lookup = new Dictionary<string, int>();

        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        // Algo-time
        // O(n) where n is log line count
        foreach (var url in urls) // Emulate stream reader, readline
        {
            if (lookup.ContainsKey(url))
            {
                int i = lookup[url];
                lookup[url] = i + 1;
            }
            else
            {
                lookup.Add(url, 1);
            }
        }

        // O(m) where m is number of distinct urls
        var list = lookup.Select(i => new LinkData 
             { Url = i.Key, Count = i.Value }).ToList();
        // O(mlogm)
        list.Sort();
        // O(m)
        var top = list.Take(100).ToList(); // top urls

        stopwatch.Stop();
        // End Algo-time

        // Show result
        // O(1)
        foreach (var i in top)
        {
            Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
        }

        Console.WriteLine(string.Format("Time elapsed msec: {0}",
           stopwatch.ElapsedMilliseconds));
    }
}

编辑：此答案已根据评论更新

补充：运行时间和内存复杂度分析
补充：伪代码
补充：解释我们如何管理一个相当大的日志文件

algorithm - 从日志文件中获取前 100 个 URL

3 回答 3

Related

Reference