3

给定购买事件列表(customer_id,item)

1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws

我正在尝试构建一个数据结构,该结构可以告诉您一件商品与另一件商品一起购买了多少次。不是同时买的,是我开始存数据的时候买的。结果看起来像

{
       hammer : {screwdriver : 1, nails : 2}, 
  screwdriver : {hammer : 1, screws : 1, nails : 1}, 
       screws : {screwdriver : 1, nails : 1}, 
        nails : {hammer : 1, screws : 1, screwdriver : 1}
}

表示用钉子买了两次锤子(人 1,3),用螺丝刀买了一次(人 1),用螺丝刀买了一次螺丝(人 3),依此类推......

我目前的做法是

users = dict 其中 userid 是键,购买的物品列表是值

usersForItem = dict 其中 itemid 是键,购买项目的用户列表是值

userlist = 对当前项目评分的用户的临时列表

pseudo:
for each event(customer,item)(sorted by item):
  add user to users dict if not exists, and add the items
  add item to items dict if not exists, and add the user
----------

for item,user in rows:

  # add the user to the users dict if they don't already exist.
  users[user]=users.get(user,[])

  # append the current item_id to the list of items rated by the current user
  users[user].append(item)

  if item != last_item:
    # we just started a new item which means we just finished processing an item
    # write the userlist for the last item to the usersForItem dictionary.
    if last_item != None:
      usersForItem[last_item]=userlist

    userlist=[user]

    last_item = item
    items.append(item)
  else:
    userlist.append(user)

usersForItem[last_item]=userlist   

所以,在这一点上,我有 2 个字典——谁买了什么,谁买了什么。这就是棘手的地方。现在填充了 usersForItem,我循环遍历它,遍历每个购买该项目的用户,并查看用户的其他购买。我承认这不是最 Pythonic 的做事方式——我试图确保在使用 Python 之前得到正确的结果(我就是这样)。

relatedItems = {}
for key,listOfUsers in usersForItem.iteritems():
  relatedItems[key]={}
  related=[]

  for ux in listOfReaders:
    for itemRead in users[ux]:
      if itemRead != key:
        if itemRead not in related:
          related.append(itemRead)
        relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1    

  calc jaccard/tanimoto similarity between relatedItems[key] and its values

有没有更有效的方法可以做到这一点?此外,如果这种手术有合适的学术名称,我很想听听。

编辑:澄清包括我不限制购买同时购买的物品这一事实。物品可以随时购买。

4

4 回答 4

3

你真的需要预先计算所有可能的配对吗?如果您懒惰地进行,即按需进行,该怎么办?

这可以表示为二维矩阵。行对应于客户,列对应于产品。

每个条目是 0 或 1,表示该列对应的产品是否被该行对应的客户购买。

如果将每一列视为(大约 5000 个)0 和 1 的向量,那么两个产品一起购买的次数就是对应向量的点积!

因此,您可以先计算这些向量,然后根据需要懒惰地计算点积。

计算点积:

现在,只有 0 和 1 的向量的一个很好的表示是一个整数数组,它基本上是一个位图。

对于 5000 个条目,您将需要一个由 79 个 64 位整数组成的数组。

因此,给定两个这样的数组,您需要计算常见的 1 的数量。

要计算两个整数共有的位数,首先可以进行按位与,然后计算结果数中设置的 1 的数量。

为此,您可以使用查找表或一些位计数方法(不确定 python 是否会支持它们),例如:http: //graphics.stanford.edu/~seander/bithacks.html

所以你的算法将是这样的:

  • 为每个产品初始化一个由 79 个 64 位整数组成的数组。

  • 对于每个客户,查看购买的产品并在相应产品中为该客户设置适当的位。

  • 现在给定两个产品的查询,您需要知道一起购买它们的客户数量,只需如上所述的点积。

这应该相当快。

作为进一步的优化,您可以考虑对客户进行分组。

于 2010-06-24T14:03:49.277 回答
2
events = """\
1-hammer 
1-screwdriver 
1-nails 
2-hammer 
2-nails 
3-screws 
3-screwdriver 
4-nails 
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)

from collections import defaultdict
from itertools import groupby

# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
    items = sorted(it[1] for it in items)
    for i,item1 in enumerate(items):
        for item2 in items[i+1:]:
            summary[(item1,item2)] += 1
            summary[(item2,item1)] += 1

# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
    item1, item2 = k
    pairmap[item1][item2] = v

# print the results    
for k,v in sorted(pairmap.items()):
    print k,':',v

给出:

hammer : {'nails': 2, 'screwdriver': 1}
nails : {'screws': 1, 'hammer': 2, 'screwdriver': 1}
screwdriver : {'screws': 1, 'nails': 1, 'hammer': 1}
screws : {'nails': 1, 'screwdriver': 1}

(这解决了您按购买事件对项目进行分组的初始请求。要按用户分组,只需将事件列表的第一个键从事件编号更改为用户 ID。)

于 2010-06-24T16:27:57.827 回答
1

保罗的回答可能是最好的,但这是我在午休时想出的(未经测试,诚然,但仍然是一个有趣的思考练习)。不确定我的算法的速度/优化。我个人建议看一下 MongoDB 之类的东西,一个 NoSQL 数据库,因为它似乎很适合解决这类问题(map/reduce 等等)

# assuming events is a dictionary of id keyed to item bought...
user = {}
for (cust_id, item) in events:
    if not cust_id in users:
        user[cust_id] = set()
    user[cust_id].add(item)
# now we have a dictionary of cust_ids keyed to a set of every item
# they've ever bought (given that repeats don't matter)
# now we construct a dict of items keyed to a dictionary of other items
# which are in turn keyed to num times present
items = {}
def insertOrIter(d, k, v):
    if k in d:
        d[k] += v
    else:
        d[k] = v
for key in user:
    # keep track of items bought with each other
    itemsbyuser = []
    for item in user[key]:
        # make sure the item with dict is set up
        if not item in items:
            items[item] = {}
        # as we see each item, add to it others and others to it
        for other in itemsbyuser:
            insertOrIter(items[other], item, 1)
            insertOrIter(items[item], other, 1)
        itemsbyuser.append(item)
# now, unless i've screwed up my logic, we have a dictionary of items keyed
# to a dictionary of other items keyed to how many times they've been
# bought with the first item. *whew* 
# If you want something more (potentially) useful, we just turn that around to be a
# dictionary of items keyed to a list of tuples of (times seen, other item) and
# you're good to go.
useful = {}
for i in items:
    temp = []
    for other in items[i]:
        temp[].append((items[i][other], other))
    useful[i] = sorted(temp, reverse=True)
# Now you should have a dictionary of items keyed to tuples of
# (number times bought with item, other item) sorted in descending order of
# number of times bought together
于 2010-06-24T17:15:01.200 回答
1

很奇怪,每次您想获取统计信息时,上面的所有解决方案都会遍历整个数据库以获取计数。

建议将数据保持在平坦的索引中,并且只获得特定项目的结果,一次一个。如果您的项目数量很大,我会更有效率。

from collections import defaultdict
from itertools import groupby

class myDB:
    '''Example of "indexed" "database" of orders <-> items on order'''
    def __init__(self):
        self.id_based_index = defaultdict(set) 
        self.item_based_index = defaultdict(set)

    def add(self, order_data):
        for id, item in order_data:
            self.id_based_index[id].add(item)
            self.item_based_index[item].add(id)

    def get_compliments(self, item):
        all_items = []
        for id in self.item_based_index[item]:
            all_items.extend(self.id_based_index[id])
        gi = groupby(sorted(all_items), lambda x: x)
        return dict([(k, len(list(g))) for k, g in gi])

使用它的例子:

events = """1-hammer 
    1-screwdriver 
    1-nails 
    2-hammer 
    2-nails 
    3-screws 
    3-screwdriver 
    4-nails 
    4-screws"""

db = myDB()
db.add(
    [ map(str.strip,e.split('-')) for e in events.splitlines() ]
    )
# index is incrementally increased 
db.add([['5','plunger'],['5','beer']])

# this scans and counts only needed items
assert db.get_compliments('NotToBeFound') == {}
assert db.get_compliments('hammer') == {'nails': 2, 'hammer': 2, 'screwdriver': 1}
# you get back the count for the requested product as well. Discard if not needed.

这一切都很有趣,但是,说真的,只是去真正的数据库存储。因为索引已经内置到任何数据库引擎中,所以 SQL 中的所有上述代码都只是:

select
    p_others.product_name,
    count(1) cnt
from products p
join order_product_map opm
    on p.product_id = opm.product_id
join products p_others
    on opm.product_id = p_others.product_id
where p.product_name in ('hammer')
group by p_others.product_name
于 2010-06-24T20:26:00.757 回答