python - 向量序列的 PyTorch 数据集字段（无词汇）

Question

我有一个“简单”的机器翻译任务，我有一个向量序列要映射到一个或两个单词。（向量为 258 维）

例如：

[[1, ..., 2], [3, ..., 4]]=> “你好”
[[1, ..., 2], [3, ..., 4], [5, ..., 6]]=>“你好世界”

对于目标字段，我使用的是Field(eos_token="<eos>", is_target=True)，在批处理时确实正确地给了我一个带有填充的张量，在这种情况下：

tensor([
  [1, 1], # 1 is "hello"
  [2, 3], # 2 is "world", 3 is <eos>
  [3, 0], # 0 is <pad>
])

但是，该src字段的填充方式与它不同sequential，但没有词汇表 ( Field(use_vocab=False))。

当我src从BucketIterator, 批量大小 > 1 中读取时，我得到：

回溯（最近一次通话最后）：

文件“train.py”，第 50 行，在火车中
for b, batch in enumerate(train_iter):
文件“/torchtext/data/iterator.py”，第 156 行，在iter
yield Batch(minibatch, self.dataset, self.device)
文件“/torchtext/data/batch.py”，第 34 行，在init
setattr(self, name, field.process(batch, device=device))
文件“/torchtext/data/field.py”，第 237 行，正在处理中
tensor = self.numericalize(padded, device=device)
文件“/torchtext/data/field.py”，第 359 行，在数字化中
var = torch.tensor(arr, dtype=self.dtype, device=device)
ValueError：在昏暗 2 处预期长度为 258 的序列（得到 5）

我想要得到的是一个张量：

tensor([
  [[1, ..., 2], [1, ..., 2]],
  [[3, ..., 4], [3, ..., 4]],
  [[5, ..., 6], [0, ..., 0]],
  [[0, ..., 0], [0, ..., 0]],
])

我想我可能有但不知道如何确认的是：

tensor([
  [[1, ..., 2], [1, ..., 2]],
  [[3, ..., 4], [3, ..., 4]],
  [[5, ..., 6], 0],
  [0, 0],
])

python - 向量序列的 PyTorch 数据集字段（无词汇）

0 回答 0

Related

Reference