string - 从文件中读取 URL 字符串列表并查找前 10 个读取最多的 URL

Question

好的，

我在很多采访中都遇到过这个问题，我想我需要一些帮助来解决它。

您有大量的 URL 作为字符串数组或从文件中读取。您现在需要获取前十个阅读次数最多的内容，如文件中前十个最常用的 URL。

我的方法是：

         Read them into a String Array, 
         Iterate through each String/URL,
             At every Iteration, put them into Hashtable, incrementing the count.
         Iterate again and find feed the scores into an array
         Sort and find the top 10 scores OR use max-heap to get the top 10.
         Iterate again and remove the URL's with the top 10 scores.

这是一个非常糟糕的答案吗？有人可以进一步帮助我吗？

score 3 · Accepted Answer

您可以使用最少的内存和基本上无限大小的文件来执行此操作：

Use the OS-supplied sort utility to sort the URLs on disk
Create a priority queue (binary heap, for example)
For each URL in the file
    if the URL is the same as the previous URL
        increase count
    else
        AddToQueue(previous_url, count)
        previous_url = current_url
        count = 1
EndFor
AddToQueue(previous_url, count)

此时，访问量最大的 10 个 URL 将在优先级队列中。

该AddToQueue函数如下所示：

AddToQueue(url, count)
    if (queue.Count < 10)
        add new url with count to queue
    else if (count > top_of_queue.count)
        remove top URL from queue
        add new url with count to queue

如果您有足够的内存来加载所有 URL，则可以将它们加载到一个数组中并对其进行排序。但是，如果您对所有 URL 都有足够的内存，那么基于字典的方法可能会更快。

score 0 · Accepted Answer

首先，请注意您不必要地使用了额外的内存。为什么要将所有内容读入一个数组，然后遍历该数组以将所有内容插入哈希表中？除非您有非常充分的理由这样做，否则您应该在阅读时将其放入哈希表中。

这减少了对数组的需求，并将内存使用量减少了 O(n)。其他步骤听起来很合理。为前 10 个分数维护一个包含 10 个条目的数组的想法是一个不错的方法，但它提出了如何有效地做到这一点的问题。

此外，使用哈希表可能会引发实现问题，您过于依赖通常在内置库中的东西。为了采访，也许更好的方法是将所有内容读取到二叉搜索树中，其中每个节点都包含一个带有字符串的结构，以及该字符串的出现次数（以及指向左右节点的指针）。二叉搜索树使您能够在 O(log(n)) 时间内查看字符串是否存在。阅读完所有内容并将其放入树中后，您可以使用 shell 排序对树进行排序。Shell 排序是这个练习的一个很好的选择，因为它倾向于快速消除大量的无序。该解决方案在 O(nlog(n)) 时间内运行。

如果你的面试官可以使用哈希表，那么实现树可能不值得麻烦，但是说“我将使用哈希表”你可能会在脚下射击自己，除非你当然实现了哈希桌子。这真的取决于上下文。

score 0 · Accepted Answer

运行时还不错，总体上类似于 O(nlogn)。

Read them into a String Array - O(n)
Iterate through each String/URL - O(n)
    At every Iteration, put them into Hashtable, incrementing the count - O(1)
Iterate again and find feed the scores into an array - O(n)
Sort and find the top 10 scores OR use max-heap to get the top 10 - O(nlogn)
Iterate again and remove the URL's with the top 10 scores. - O(1)

但是，您可能会跳过最后两个步骤。相反，遍历哈希表（或 URL）的条目，并在遍历条目时维护一个数组，其中包含前 10 个分数的 10 个条目；这样你就跳过了排序步骤，整个运行时间将是 O(n)。

string - 从文件中读取 URL 字符串列表并查找前 10 个读取最多的 URL

3 回答 3

Related

Reference