azure - 天蓝色机器学习 - Azure Blob 存储

Question

如何使用 Azure 机器学习工作室“一次”从存储在 Azure Blob 存储中的多个文件中读取数据？

我尝试使用 Reader 模块，它对一个文件工作得很好，它对多个文件有用吗，还是我必须寻找其他解决方案？

谢谢您的帮助！

score 1 · Accepted Answer

如果没有那么多 blob，您可以将每个地图的多个阅读器添加到您的输入 blob 之一。然后使用“Data Transformation”->“Manipulation”下的模块来执行“Add Rows”或“Join”之类的操作。

score 0 · Accepted Answer

0

使用大量读取不同 blob 的阅读器，然后将它们连接到元数据编辑器。

于 2015-10-20T08:57:58.653 回答

score 0 · Accepted Answer

尽管使用多个Reader模块的方法会奏效，但当输入很多或输入数量不同时，它会变得非常困难。

相反，您可以使用该Execute Python Script模块直接访问 Blob 存储。但是，如果您以前从未这样做过，那么这样做会非常痛苦。以下是问题：

默认情况下，azure.storage.blobPython 包不会加载到 Azure ML 中。但是，这可以手动创建，或从下面的链接下载（截至 2016 年 2 月 11 日的正确版本）。
默认azure.storage.blob.BlobService使用 HTTPS，Azure ML blob 存储访问当前不支持。为此，您可以在protocol='http'BlobService 创建期间传入以强制使用 HTTP：client = BlobService(STORAGE_ACCOUNT, STORAGE_KEY, protocol="http")

以下是使其工作的步骤：

azure.zip提供所需azure.storage.*库的下载： https ://azuremlpackagesupport.blob.core.windows.net/python/azure.zip
将它们作为数据集上传到 Azure ML Studio
将它们连接到Execute Python Script模块上的 Zip 输入，这是第三个输入。
像往常一样编写脚本，确保BlobService使用protocol='http'
运行实验 - 您现在应该能够读取和写入 blob 存储。

一些示例代码可以在这里找到：https ://gist.github.com/drdarshan/92fff2a12ad9946892df

这是使其适用于单个文件的代码。这可以通过访问容器和过滤来扩展以处理大量文件，但这将取决于您的业务逻辑。

from azure.storage.blob import BlobService

def azureml_main(dataframe1 = None, dataframe2 = None):
    account_name = 'mystorageaccount'
    account_key='p8kSy3FACx...redacted...ebz3plQ=='
    container_name = "upload"

    blob_service = BlobService(account_name, account_key, protocol='http')

    file = blob_service.get_blob_to_text(container_name,'myfile.txt')
    # You can also get_blob_to_(bytes|file|path), if you need to do so.

    # Do stuff with your file here
    #   Logic, logic, logic

    # Execute Python Script requires that a dataframe is returned. It can be null.
    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,

有关限制、为什么使用 HTTP 和其他说明的更多信息，请参阅从 Azure ML 实验中访问 Azure 博客存储

azure - 天蓝色机器学习 - Azure Blob 存储

3 回答 3

Related

Reference