azure-data-factory - 检查存储中是否所有文件都可用 - Azure ADF

Question

在 Azure 数据工厂中，如何检查字符串（文件名）数组是否包含值？

我从获取元数据活动中获取文件名，我需要在继续之前检查我拥有的所有 4 个文件名是否在存储帐户中可用。

我希望存储帐户中有 4 个文件，我需要检查所有 4 个文件是否都可用。我需要明确检查文件名而不是文件数 - 这是一个要求

当我尝试使用 get meta data 中的子项对其进行验证时，出现错误"array elements can only be selected using an integer index."这里的问题是该文件可能出现在下一次加载的任何索引处

有没有更好的方法来验证文件名？

感谢您的帮助，在此先感谢

score 0 · Accepted Answer

我的获取元数据输出如下所示

 "childItems": [
    {
        "name": "1.py",
        "type": "File"
    },
    {
        "name": "SalesData.numbers",
        "type": "File"
    },
    {
        "name": "file1.txt",
        "type": "File"
    }

]

我在设置变量活动中使用以下表达式来检查文件名

@if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"file1.txt"',',','"type":"File"}'))), 

if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"file2.txt"',',','"type":"File"}'))),

if(
contains(activity('Get Metadata1').output.childitems,
json(concat('{"name":"2.py"',',','"type":"File"}'))),'yes','no')
,'no')
,'no')

这会检查我的 blob 是否有 file1.txt、file2.txt 和 2.py

如果是，我将是分配给变量否则否

您也可以使用 if 条件

score 0 · Accepted Answer

你可以试试这个（Python）吗？

import fnmatch
import os
 
rootPath = '/'
pattern = '*.mp3'
 
for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, pattern):
        print( os.path.join(root, filename))

score 0 · Accepted Answer

可以使用数组检查是否存在多个文件，但这有点繁琐。我经常将其传递给管道中的另一个活动，例如存储过程或笔记本活动，具体取决于您在管道中可用的计算（例如 SQL 数据库或 Spark 集群）。但是，如果您确实需要在管道中执行此操作，这可能对您有用。

首先，我有一个具有以下值的数组参数：

参数名称	参数类型	参数值
pFilesToCheck	大批	["json1.json","json2.json","json3.json","json4.json"]

这些是必须存在的文件。接下来，我有一个Get Metadata指向数据湖文件夹的活动，并在字段列表中设置了子项参数：

这将以这种格式返回一些输出，列出给定目录中的所有文件，以及有关执行的一些附加信息：

{
    "childItems": [
        {
            "name": "json1.json",
            "type": "File"
        },
        {
            "name": "json2.json",
            "type": "File"
        },
        {
            "name": "json3.json",
            "type": "File"
        },
        {
            "name": "json4.json",
            "type": "File"
        }
    ],
    "effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Some Region)",
    "executionDuration": 0,
    "durationInQueue": {
        "integrationRuntimeQueue": 1
    },
    "billingReference": {
        "activityType": "PipelineActivity",
        "billableDuration": [
            {
                "meterType": "AzureIR",
                "duration": 0.016666666666666666,
                "unit": "Hours"
            }
        ]
    }
}

为了将输入数组pFilesToCheck（必须存在的文件）与活动的结果（确实Get Metadata存在的文件）进行比较，我们必须将它们放在可比较的格式中。我使用 Array 变量来执行此操作：

变量的名称	变量类型
arr 文件名	大批

接下来是一个以Sequential模式For Each运行的活动，并使用该函数从0循环到3，即数组中每个项目的数组索引。该表达式确定输出中从 0 开始的项目数。Items属性设置为以下表达式：rangechildItemsGet Metadata

@range(0,length(activity('Get Metadata File List').output.childItems))

活动内部For Each是一个Append活动，它将 for each 循环中的当前项附加到数组变量arrFilenames中。它在Value属性中使用此表达式：

@activity('Get Metadata File List').output.childItems[item()].name

range在这种情况下，'@item()' 将是由上述函数生成的介于 0 和 3 之间的数字。循环完成后，数组arrFilenames现在看起来像这样（即与输入数组格式相同）：

["json1.json","json2.json","json3.json","json4.json"]

现在可以使用该intersection函数比较输入数组和实际文件列表。我使用Set Variable带有布尔变量的活动来记录结果：

@equals(
length(variables('arrFilenames')),
length(intersection(variables('arrFilenames'),pipeline().parameters.pFilesMustExist)))

此表达式将包含实际存在的文件的数组的长度与通过交集函数连接的同一数组的长度与应该存在的文件的输入数组的长度进行比较。如果数字匹配，则所有文件都存在。如果数字不匹配，则并非所有文件都存在。

azure-data-factory - 检查存储中是否所有文件都可用 - Azure ADF

3 回答 3

Related

Reference