python - Vertex AI 无法将数据导入数据集。它说最多 1M 行，而我的数据集只有 600k

Question

我正在将文本数据集导入 Google Vertex AI 并收到以下错误：

Hello Vertex AI Customer,

Due to an error, Vertex AI was unable to import data into 
dataset [dataset_name].
Additional Details:
Operation State: Failed with errors
Resource Name: [resoure_link]
Error Messages: There are too many rows in the jsonl/csv file. Currently we 
only support 1000000 lines. Please cut your files to smaller size and run 
multiple import data pipelines to import.

我检查了我从 pandas 生成的数据集和实际的 CSV 文件，它只有 600k 行。

有人遇到类似的错误吗？

score 1 · Accepted Answer

所以结果是我的 CSV 格式有错误。

我忘记在我的文本数据集中修剪换行符和额外的空格。这解决了 1M 行数。但在这样做之后，我收到错误消息，告诉我标签太多，而它只有 2 个。

Error Messages: There are too many AnnotationSpecs in the dataset. Up to 
5000 AnnotationSpecs are allowed in one Dataset.

这是因为我在 Pandas 数据框中使用 to_csv() 方法创建了文本数据集。以这种方式创建 CSV 文件，当您的文本仅包含“，”（逗号字符）时，它将自动加上引号。因此 CSV 文件将如下所示：

"this is a sentence, with a comma", 0
this is a sentence without a comma, 1

同时，Vertex AutoML Text 希望 CSV 看起来像这样：

"this is a sentence, with a comma", 0
"this is a sentence without a comma", 1

即你必须在每一行加上引号。

您可以通过编写自己的 CSV 格式化程序来实现，或者如果您坚持使用 Pandas to_csv()，您可以将 csv.QUOTE_ALL 传递给 quoting 参数。它看起来像这样：

import csv
df.to_csv("file.csv", index=False, quoting=csv.QUOTE_ALL, header=False)

python - Vertex AI 无法将数据导入数据集。它说最多 1M 行，而我的数据集只有 600k

1 回答 1

Related

Reference