3

这是来自亚马逊的面试问题。

给定在几天内访问过网页u的客户列表,设计一个数据结构来获取在几天内访问过该网站并且应该至少访问过不同页面的客户。wdkm

假设u和是w大数d

我们如何存储这些信息?也许我们可以使用哈希映射、树等。

4

3 回答 3

4

Assumptions:

We have queries which give some web page j and ask for a list of users that have visited the web pages on exactly k days and have visited at least m distinct pages in total.

For the below, assume "map" = "hashmap", which would, ideally, give constant time queries. One can also use "treemap" at a logarithmic cost i.t.o the number of elements. Similarly for "set", "hashset", "treeset".

I'm assuming the input isn't static, but I won't really change my approach if it is.

Basic idea:

  • For each user, have a map of web pages to counts and last visited day. We can have a map of user identifiers to the map of web pages.

  • Whenever a user visits a page:

    • If it's a new page, just create a map entry with the page to a count of 1 and the last visited day

    • otherwise, if the day is the same as the last visited day, do nothing,

    • otherwise, increment the count and update the last visited day.

    • Total time per update = O(1)

  • For queries:

    • Loop through all users - O(u)

    • Look up the page for that user and check that the count matches - O(1)

    • Check that the count of the map is >= m - O(1)

    • Total time per query = O(n)

Improvements:

We can have a separate map of web page to unique ID (then obviously use the unique ID instead of the URL for the above map). It won't really change the complexity, but should improve both the space and time complexity by a constant factor.

At the cost of quite a bit of space, we can do the following (in addition to having the map from Basic idea):

  • For each web page, have an array of sets of user identifiers where index i in the array is a set of all users that visited that page on exactly i days.

  • Whenever a user visits a page:

    • Do the same as in Basic idea

    • Move the user from the current number of visits for that page to the new number of visits (or insert into the 1 visit set if new).

    • Total time per update = O(1)

  • For queries:

    • We can find all users that visited any given page on exactly k days in O(1).

    • From here we run through this smaller list of users (say there are p of them) and check the counts of the maps of each user (in O(1)).

    • Total time per query = O(p)

Feel free to point out if I missed anything.

于 2013-07-08T14:25:48.160 回答
1

使用两个哈希映射实现:

  1. hash map<customerId,HashSet<webpages>>
  2. hash map<customerId,HashSet<days>>

如果新客户 ID 则在其中创建新的哈希集插入,否则将当天或网页放入现有客户的哈希集中。

对于每个客户 ID,请参见length>=m第一个哈希映射中的集合和第二个哈希映射中的集合length=k输出该客户。

于 2013-12-01T15:01:01.223 回答
1

最直接的结构是模拟 RDBMS 会做什么:

1.拥有一个包含所有基本信息的对象:(客户、天数、不同页数)

2.然后,添加一个用作索引的树,使用 (Days-Count, Distinct-Pages-Count) 作为其键(B-tree,或类似的东西)

3.您还可以有另一个索引类型结构,其键为 Customer。这样,您可以快速更新信息

4.最后,您需要更详细的信息来确定被访问的页面是否是唯一的。这将涉及在客户下存储不同的页面信息(从问题中不清楚这是否意味着在特定日期或所有日期都是不同的)

于 2013-07-08T14:40:46.263 回答