0

我正在处理 dailydialog 数据集,我已将其转换为如下所示的 JSON 文件:

[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]

因此,每个项目都有两个键 - 响应和消息。

这是我第一次使用 PyTorch,所以我关注了一些在线可用资源。这些是我的代码的相关片段:

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

src = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

fields = {'response': ('r', src)}

train_data, test_data, validation_data = TabularDataset.splits(     
                                        path = 'FilePath',
                                        train = 'trainset.json',
                                        test = 'testset.json',
                                        validation = 'validationset.json',
                                        format = 'json',
                                        fields = fields        
)

尽管没有引发错误,尽管我的 JSON 文件中有许多项目,但奇怪的是,训练、测试和验证数据集每个只有 1 个示例,如下图所示: 显示训练数据、测试数据和验证数据的长度的图像

如果有人能向我指出错误,我将不胜感激。

编辑:我发现由于文件中缺少缩进,整个文件被视为单个文本字符串。但是,如果我缩进 JSON 文件,TabularDataset 函数会向我抛出 JSONDecodeError,表明它无法再解码该文件。我怎样才能摆脱这个问题?

4

1 回答 1

0

我认为代码没问题,但问题在于您的 JSON 文件。您可以尝试删除文件开头和结尾的方括号(“[]”)吗?可能这就是您的 Python 文件将其作为单个对象读取的原因。

于 2020-07-14T09:56:12.517 回答