0

我正在尝试Multi30k使用 google colab 加载 torchtext 数据集。当我加载它时.de它工作正常,但是一旦我改变.de我得到这个错误:

FileNotFoundError: [Errno 2] No such file or directory: '.data/multi30k/train.fr'

这就是我加载的方式.de并且它起作用了:

train_data, valid_data, test_data = datasets.Multi30k.splits(
    root=".data",
    exts=('.de', '.en'),
    fields = (SRC, TRG),
    
)

一旦我通过更改.de.fr错误来更改此代码:

train_data, valid_data, test_data = datasets.Multi30k.splits(
    root=".data",
    exts=('.fr', '.en'),
    fields = (SRC, TRG),
    
)

进口

import torch
from torch import nn
from torch.nn  import functional as F
import spacy, math, random
import numpy as np
from torchtext.legacy import datasets, data
import time
from prettytable import PrettyTable
from matplotlib import pyplot as plt

种子

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

分词器spacy

import spacy
spacy.cli.download('fr_core_news_sm')

spacy_fr = spacy.load('fr_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

def tokenize_fr(sent):
  return [tok.text for tok in spacy_fr.tokenizer(sent)]

def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

字段

SRC = data.Field(
    tokenize= tokenize_fr,
    lower= True,
    init_token = "<sos>",
    eos_token = "<eos>",
    include_lengths =True
)

TRG = data.Field(
    tokenize = tokenize_en,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)

引发错误的单元格

train_data, valid_data, test_data = datasets.Multi30k.splits(
    root=".data",
    exts=('.fr', '.en'),
    fields = (SRC, TRG),
)
4

1 回答 1

1

这是因为数据集本身没有train.fr文件。

如果您列出 pytorch 下载的内容,

$ !ls -al .data/multi30k
total 5.4M
drwxr-xr-x 2 root root 4.0K Jul 15 14:26 .
drwxr-xr-x 3 root root 4.0K Jul 15 14:26 ..
-rw-r--r-- 1 root root  65K Jul 15 14:26 mmt_task1_test2016.tar.gz
-rw-rw-r-- 1 1000 1000  69K Oct 17  2016 test2016.de
-rw-rw-r-- 1 1000 1000  61K Oct 17  2016 test2016.en
-rw-rw-r-- 1 1000 1000  71K Feb 11  2017 test2016.fr
-rw-rw-r-- 1 1000 1000 2.1M Feb  2  2016 train.de
-rw-rw-r-- 1 1000 1000 1.8M Feb  2  2016 train.en
-rw-r--r-- 1 root root 1.2M Jul 15 14:26 training.tar.gz
-rw-rw-r-- 1 1000 1000  75K Feb  2  2016 val.de
-rw-rw-r-- 1 1000 1000  62K Feb  2  2016 val.en
-rw-r--r-- 1 root root  46K Jul 15 14:26 validation.tar.gz
于 2021-07-15T14:35:12.907 回答