2

目标:使用来自较大 FileDataset 的随机抽样生成一个下采样 FileDataset,以用于数据标签项目。


详细信息:我有一个包含数百万张图像的大型 FileDataset。每个文件名都包含有关从中提取的“部分”的详细信息。一个部分可能包含数千张图像。我想随机选择特定数量的部分以及与这些部分关联的所有图像。然后将样本注册为新数据集。

请注意,下面的代码不是直接复制和粘贴,因为出于保密原因,文件路径和变量等元素已被重命名。

import azureml.core
from azureml.core import Dataset, Datastore, Workspace

# Load in work space from saved config file
ws = Workspace.from_config()

# Define full dataset of interest and retrieve it
dataset_name = 'complete_2017'
data = Dataset.get_by_name(ws, dataset_name)

# Extract file references from dataset as relative paths
rel_filepaths = data.to_path()

# Stitch back in base directory path to get a list of absolute paths
src_folder = '/raw-data/2017'
abs_filepaths = [src_folder + path for path in rel_filepaths]

# Define regular expression pattern for extracting source section
import re
pattern = re.compile('\/(S.+)_image\d+.jpg')

# Create new list of all unique source sections
sections = sorted(set([m.group(1) for m in map(pattern.match, rel_filepaths) if m]))

# Randomly select sections
num_sections = 5
set_seed = 221020
random.seed(set_seed)   # for repeatibility
sample_sections = random.choices(sections, k = num_sections)

# Extract images related to the selected sections
matching_images = [filename for filename in abs_filepaths if any(section in filename for section in sample_sections)]

# Define datastore of interest
datastore = Datastore.get(ws, 'ml-datastore')

# Convert string paths to Azure Datapath objects and relate back to datastore
from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, filepath) for filepath in matching_images]

# Generate new dataset using from_files() and filtered list of paths
sample = Dataset.File.from_files(datastore_path)

sample_name = 'random-section-sample'
sample_dataset = sample.register(workspace = ws, name = sample_name, description = 'Sampled sections from full dataset using set seed.')

问题:我在 Python SDK 中编写的代码运行并且新的 FileDataset 注册,但是当我尝试查看数据集详细信息或将其用于数据标签项目时,即使作为Owner ,我也会收到以下错误。

Access denied: Failed to authenticate data access with Workspace system assigned identity. Make sure to add the identity as Reader of the data service.

此外,在详细信息选项卡下Files in datasetUnknownTotal size of files in datasetUnavailable

我在其他任何地方都没有遇到过这个问题。我能够以其他方式生成数据集,所以我怀疑这是代码的问题,因为我正在以非常规的方式处理数据。


附加说明

  • Azure ML 版本是 1.15.0
4

2 回答 2

1

虚拟网络背后的数据有没有机会?

于 2020-10-27T22:39:04.113 回答
1

我的一位同事发现托管身份阻止了预览功能。一旦修改了身份的这一方面,我们就可以检查数据并将其用于数据标记项目。

于 2020-10-28T20:31:26.963 回答