我有一个字典列表如下:
[{'text': ['The', 'Fulton', 'County', 'Grand', ...], 'tags': ['AT', 'NP-TL', 'NN-TL', 'JJ-TL', ...]},
{'text': ['The', 'jury', 'further', 'said', ...], 'tags': ['AT', 'NN', 'RBR', 'VBD', ...]},
...]
每个 dict 的每个值都是一个句子单词/标签的列表。这直接来自 NLTK 数据集的布朗语料库,使用以下方法加载:
from nltk.corpus import brown
data = brown.tagged_sents()
data = {'text': [[word for word, tag in sent] for sent in data], 'tags': [[tag for word, tag in sent] for sent in data]}
import pandas as pd
df = pd.DataFrame(training_data, columns=["text", "tags"])
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.2)
train.to_json("train.json", orient='records')
val.to_json("val.json", orient='records')
我想使用以下方法将此 json 加载到 torchtext.data.TabularDataset 中:
TEXT = data.Field(lower=True)
TAGS = data.Field(unk_token=None)
data_fields = [('text', TEXT), ('tags', TAGS)]
train, val = data.TabularDataset.splits(path='./', train='train.json', validation='val.json', format='json', fields=data_fields)
但它给了我这个错误:
/usr/local/lib/python3.6/dist-packages/torchtext/data/example.py in fromdict(cls, data, fields)
17 def fromdict(cls, data, fields):
18 ex = cls()
---> 19 for key, vals in fields.items():
20 if key not in data:
21 raise ValueError("Specified key {} was not found in "
AttributeError: 'list' object has no attribute 'items'
请注意,我不希望 TabularDataset 为我标记句子,因为它已经被 nltk 标记。我该如何处理?(我无法将语料库切换到可以直接从 torchtext.dataset 加载的内容,我必须使用布朗语料库)