我在 1 分 25 秒关注此视频中的代码,其中显示:
def tokenize_and_chunk(texts):
return tokenizer(
texts["text"], truncation=True, max_length=context_length,
return overflowing_tokens=True
)
tokenized_datasets = raw_datasets.map(
tokenize_and_chunk, batched=True, remove_columns=["text"]
)
这是我尝试运行此代码时遇到的错误:
model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
context_length = 1000
def tokenize_and_chunk(texts):
return tokenizer(
texts["text"], truncation=True, max_length=context_length,
return_overflowing_tokens=True,
)
dataset = Dataset.from_pandas(pd.DataFrame([{"id": "123", "text": "Here are many words! "*5000}]))
显示一个精细的数据集:
Dataset({
features: ['id', 'text'],
num_rows: 1
})
好的,让我们运行分词器:
toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])
0%
0/1 [00:00<?, ?ba/s]
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-69-d1216744e2ab> in <module>
----> 1 toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])
ArrowInvalid: Column 1 named id expected length 5 but got length 1