python - 如何在 Google Colab 上以流模式加载数据集？

Question

我正在尝试节省一些磁盘空间以在 Google Colab 上使用 CommonVoice French 数据集 (19G)，因为我的笔记本总是因磁盘空间不足而崩溃。我从HuggingFace文档中看到，我们可以以流模式加载数据集，这样我们就可以iterate over it directly without having to download the entire dataset.。我尝试在 Google Colab 中使用该模式，但无法使其工作 - 而且我还没有找到任何关于此问题的信息。

!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

然后，我收到以下错误：

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    811         if not config.AIOHTTP_AVAILABLE:
    812             raise ImportError(
--> 813                 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
    814                 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
    815             )

ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Google Colab 不允许流式加载数据集有什么原因吗？

否则，我错过了什么？

score 0 · Accepted Answer

写一个答案以方便将来参考。根据@kkgarg 的评论，流媒体功能似乎尚未实现。

!pip install aiohttp
!pip install datasets
from datasets import load_dataset, load_metric

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

触发以下错误：

/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
    140 
    141     def download_and_extract(self, url_or_urls):

NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet

这意味着尚未实现或支持流式传输功能。也许是因为使用 common_voice 意味着需要解压缩文件并且流不支持（？）。因为该功能肯定已经实现，因为它在文档中......

python - 如何在 Google Colab 上以流模式加载数据集？

1 回答 1

Related

Reference