c++ - 所有三元组组合，一次 6 个值

Question

~~我正在寻找一种算法，通过一次选择 6 个值来有效地生成数据集的所有三个值组合。~~

我正在寻找一种算法来有效地生成一小组 6 元组，这些 6 元组累积地表达数据集的所有可能的 3 元组组合。

例如，计算 6 张牌的扑克牌手牌，表示所有可能的 3 张牌组合。

例如，给定一个数据集：

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

6 个值的第一个“选择”可能是：

['a','b','c','d','e','f']

这涵盖了三值组合：

('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'b', 'e'), ('a', 'b', 'f'), ('a', 'c', 'd'),
('a', 'c', 'e'), ('a', 'c', 'f'), ('a', 'd', 'e'), ('a', 'd', 'f'), ('a', 'e', 'f'),
('b', 'c', 'd'), ('b', 'c', 'e'), ('b', 'c', 'f'), ('b', 'd', 'e'), ('b', 'd', 'f'),
('b', 'e', 'f'), ('c', 'd', 'e'), ('c', 'd', 'f'), ('c', 'e', 'f'), ('d', 'e', 'f')

显然可以通过：

计算所有三元组组合
选择 6 个值
计算这 6 个值的所有三元组组合
从第一次计算中删除这些组合
重复直到所有的都被考虑在内

在这个例子中，有 2600 种可能的三元组组合(26*25*24)/(3*2*1) == 2600，使用上面的“蛮力”方法，所有三元组组合可以表示为大约 301 个 6 值组。

但是，感觉应该有一种更有效的方法来实现这一点。

我的首选语言是python，但我计划在C++.

更新

这是我的python代码“蛮力”它：

from itertools import combinations
data_set = list('abcdefghijklmnopqrstuvwxyz')

def calculate(data_set):
  all_triplets = list(frozenset(x) for x in itertools.combinations(data_set,3))
  data = set(all_triplets)
  sextuples = []
  while data:
    sxt = set()
    for item in data:
      nxt = sxt | item
      if len(nxt) > 6:
        continue
      sxt = nxt
      if len(nxt) == 6:
        break
    sextuples.append(list(sxt))
    covers = set(frozenset(x) for x in combinations(list(sxt),3))
    data = data - covers
    print "%r\t%s" % (list(sxt),len(data))
  print "Completed %s triplets in %s sextuples" % (len(all_triplets),len(sextuples),)

calculate(data_set)

在 301 个六胞胎中完成 2600 个三胞胎

我正在寻找比这更具计算效率的东西。

更新

Senderle 提供了一个有趣的解决方案：将数据集分成对，然后生成所有可能的对的三元组。这绝对比我想出的任何东西都要好。

这是一个快速功能，用于检查是否覆盖了所有三元组并评估三元组覆盖的冗余：

from itertools import combinations
def check_coverage(data_set,sextuplets):
  all_triplets = dict.fromkeys(combinations(data_set,3),0)
  sxt_count = 0
  for sxt in sextuplets:
    sxt_count += 1
    for triplet in combinations(sxt,3):
      all_triplets[triplet] += 1
  total = len(all_triplets)
  biggest_overlap = overlap = nohits = onehits = morehits = 0
  for k,v in all_triplets.iteritems():
    if v == 0:
      nohits += 1
    elif v == 1:
      onehits += 1
    else:
      morehits += 1
      overlap += v - 1
    if v > biggest_overlap:
      biggest_overlap = v
  print "All Triplets in dataset: %6d" % (total,)
  print "Total triplets from sxt: %6d" % (total + overlap,)
  print "Number of sextuples:     %6d\n" % (sxt_count,)
  print "Missed  %6d of %6d: %6.1f%%" % (nohits,total,100.0*nohits/total)
  print "HitOnce %6d of %6d: %6.1f%%" % (onehits,total,100.0*onehits/total)
  print "HitMore %6d of %6d: %6.1f%%" % (morehits,total,100.0*morehits/total)
  print "Overlap %6d of %6d: %6.1f%%" % (overlap,total,100.0*overlap/total)
  print "Biggest Overlap: %3d" % (biggest_overlap,)

使用 Senderle 的sextuplets生成器，我很着迷地观察到重复的三元组是本地化的，并且随着数据集大小的增加，重复按比例变得更加本地化并且峰值重复更大。

>>> check_coverage（范围（26），sextuplets（范围（26）））
数据集中的所有三元组：2600
来自 sxt 的三胞胎总数：5720
六倍数：286

错过 2600 次中的 0 次：0.0%
HitOnce 2288 的 2600：88.0%
HitMore 312 of 2600: 12.0%
2600 的重叠 3120：120.0%
最大重叠：11

>>> check_coverage（范围（40），sextuplets（范围（40）））
数据集中的所有三元组：9880
来自 sxt 的三胞胎总数：22800
六倍数：1140

错过 9880 次中的 0 次：0.0%
9880 中的 HitOnce 9120：92.3%
HitMore 760 of 9880: 7.7%
重叠 12920 与 9880：130.8%
最大重叠：18

>>> check_coverage（范围（80），sextuplets（范围（80）））
数据集中的所有三元组：82160
来自 sxt 的三胞胎总数：197600
六倍数：9880

错过 82160 次中的 0 次：0.0%
82160 中的 HitOnce 79040：96.2%
点击更多 82160 中的 3120：3.8%
82160 的重叠 115440：140.5%
最大重叠：38

score 1 · Accepted Answer

我相信以下会产生正确的结果。它依赖于直觉，即要生成所有必要的六元组，所需要的只是生成任意项目对的所有可能组合。这将值很好地“混合”在一起，以表示所有可能的三元组。

有轻微的皱纹。对于奇数个项目，一对根本就不是一对，因此您无法从中生成六元组，但仍需要表示该值。这做了一些体操来回避这个问题。可能有更好的方法，但我不确定它是什么。

from itertools import izip_longest, islice, combinations

def sextuplets(seq, _fillvalue=object()):
    if len(seq) < 6:
        yield [tuple(seq)]
        return
    it = iter(seq)
    pairs = izip_longest(it, it, fillvalue=_fillvalue)
    sextuplets = (a + b + c for a, b, c in combinations(pairs, 3))
    for st in sextuplets:
        if st[-1] == _fillvalue:
            # replace fill value with valid item not in sextuplet
            # while maintaining original order
            for i, (x, y) in enumerate(zip(st, seq)):
                if x != y:
                    st = st[0:i] + (y,) + st[i:-1]
                    break
        yield st

我在长度为 10 到 80 的项目序列上对其进行了测试，它在所有情况下都生成了正确的结果。我没有证据表明这将为所有序列提供正确的结果。我也没有证据证明这是一组最小的六胞胎。但我很想听听任何一个证据，如果有人能想出一个。

>>> def gen_triplets_from_sextuplets(st):
...     triplets = [combinations(s, 3) for s in st]
...     return set(t for trip in triplets for t in trip)
... 
>>> test_items = [xrange(n) for n in range(10, 80)]
>>> triplets = [set(combinations(i, 3)) for i in test_items]
>>> st_triplets = [gen_triplets_from_sextuplets(sextuplets(i)) 
                   for i in test_items]
>>> all(t == s for t, s in zip(triplets, st_triplets))
True

虽然我已经说过了，但我还是要再次指出，这是一种实际生成三元组的低效方式，因为它会产生重复项。

>>> def gen_triplet_list_from_sextuplets(st):
...     triplets = [combinations(s, 3) for s in st]
...     return list(t for trip in triplets for t in trip)
... 
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(10)))
>>> len(tlist)
200
>>> len(set(tlist))
120
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(80)))
>>> len(tlist)
197600
>>> len(set(tlist))
82160

事实上，虽然理论上你应该得到加速......

>>> len(list(sextuplets(range(80))))
9880

...itertools.combinations仍然优于sextuplets小序列：

>>> %timeit list(sextuplets(range(20)))
10000 loops, best of 3: 68.4 us per loop
>>> %timeit list(combinations(range(20), 3))
10000 loops, best of 3: 55.1 us per loop

sextuplets对于中等规模的序列，它仍然具有竞争力：

>>> %timeit list(sextuplets(range(200)))
10 loops, best of 3: 96.6 ms per loop
>>> %timeit list(combinations(range(200), 3))
10 loops, best of 3: 167 ms per loop

除非您正在处理非常大的序列，否则我不确定这是否值得。（不过，这是一个有趣的问题。）

c++ - 所有三元组组合，一次 6 个值

1 回答 1

Related

Reference