apache-beam - 如何在 Apache Beam 中实现没有重叠的训练测试分割？

Question

我想训练测试拆分具有关联实体的文本列表，因此没有实体重叠拆分。

确保没有重叠是一项挑战，我目前通过 2 次groupby操作来实现。我想知道如何减轻这些groupby操作造成的内存瓶颈，或者是否有更清洁的方法来完成整个过程。

输入

ENTITIES TEXT
e1       TextA
e1, e2   TextB
e3       TextC

我想要输出：

火车分裂

ENTITIES TEXT
e1       TextA
e1, e2   TextB

测试拆分

ENTITIES TEXT
e3       TextC

我的方法

初始groupby实体：

e1 [{"text":"TextA", "entities":["e1"]}, {"text":"TextB", "entities":["e1","e2"]}]
e2 [{"text":"TextB", "entities":["e1","e2"]}]
e3 [{"text":"TextC", "entities":["e3"]}]

接下来我创建一个共现实体键：

e1-e2 {"text":"TextA", "entities":["e1"]}
e1-e2 {"text":"TextB", "entities":["e1","e2"]}
e1-e2 {"text":"TextB", "entities":["e1","e2"]}
e3 {"text":"TextC", "entities":["e3"]}

然后我groupby在这个同时出现的键上：

e1-e2 [{"text":"TextA", "entities":["e1"]}, {"text":"TextB", "entities":["e1","e2"]}]
e3 [{"text":"TextC", "entities":["e3"]}]

我在具有 700 万个条目的大型数据集上的工作在groubpy操作上失败，请参阅下面的错误。

然后进行训练测试拆分，partition最后申请distinct删除重复项。

错误

可悲的是，我的方法在这里失败了：

  logger:  "root:shuffle.py:try_split"   
  message:  "Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7fab8a9d2a58> at b'\x9f|\xe7c\x00\x01': proposed split position is out of range [b'\x95n*A\x00\x01', b'\x9f|\xe7c\x00\x01'). Position of last group processed was b'\x9f|\xe7b\x00\x01'."

  logger:  "root:shuffle.py:request_dynamic_split"   
  message:  "Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7fab8a9d2588> at n3znYwAB"

score 1 · Accepted Answer

这些错误消息与 Dataflow 的动态重新分片有关，与您的特定拆分概念无关。它们不应该对你的工作是致命的。（是吗？）

话虽如此，我认为不可能通过单个分组来做到这一点。例如，想象一个有

ENTITIES TEXT
e1       TextA
e1, e2   TextB
e2, e3   TextC
...
eN, eN+1 TextX

需要 O(N) 组才能发现 TextA 和 TextX 之间的关系。（基本上你在这里尝试做的是寻找不相交的连接组件。）

score 0 · Accepted Answer

要在不使用的情况下解决此问题GroupBy：

def combine_entities(values):
  res = set()
  for value in values:
    res.add(value)
  return list(res)

def split_fn(example,train,test):
  """
  3 cases:
    example["entities"] only contains elements that are also in train --> label as train
    example["entities"] only contains elements that are also in test --> label as test
    example["entities"] contains both elements in train and test --> for this never to happen you need an extra constraint (as @robertwb mentioned) on your data.
  """
    return example, "train"

unique_entities = (p
                   | 'Extract' >> beam.Map(lambda x: x["entities"])
                   | 'CombineSet' >> beam.CombineGlobally(combine_entites))

ttrain,ttest = uniq | 'Split' >> beam.Partition(lambda x: hash(x) % 100 < 80, 2)
res  = (p
        | 'Split' >> beam.Map(split_fn,
                       train=beam.pvalue.AsList(ttrain),
                       test=beam.pvalue.AsList(ttest))

apache-beam - 如何在 Apache Beam 中实现没有重叠的训练测试分割？

2 回答 2

Related

Reference