palantir-foundry - 我可以在 Foundry 的管道中获取同步文本文件的文件名吗？

Question

我有一堆从我的原始系统同步的文本文件，我想要一种简单的方法来在 Foundry 转换中使用下游的文件名（除了文件的内容）。

我知道这可以使用原始文件访问，但这似乎很复杂，我只想要数据旁边的文件名。

score 1 · Accepted Answer

如果您要立即进入代码仓库或代码工作簿，那么您可以使用input_file_name()函数（请参阅下面 proggeo的答案）。这可能比下面的更容易和简单，但如果您要对数据做其他事情，这将不起作用。

模式方法

如果您打开数据集，然后转到详细信息 - > 架构，您可以编辑架构以添加文件路径列，对于每一行，这将具有该行来自的文件的路径值。

关键部分是和under的_filePath成员。第一个是填充文件路径的特殊列，第二个告诉读者填充该列。下面示例中的另一列 ( ) 仅包含每个文件中的所有内容。fieldSchemaList"addFilePath": truecustomMetadataTextDataFrameReadercontent

有关更多详细信息，请参阅平台文档Metadata中的部分。Foundry core backend对于具有不同 Reader 类的 csv 和更结构化的数据，这也是可能的。

完整架构示例

{
"fieldSchemaList": [
    {
        "type": "STRING",
        "name": "content",
        "nullable": null,
        "userDefinedTypeClass": null,
        "customMetadata": {},
        "arraySubtype": null,
        "precision": null,
        "scale": null,
        "mapKeyType": null,
        "mapValueType": null,
        "subSchemas": null
    },
    {
      "type": "STRING",
      "name": "_filePath",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    }
],
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
    "textParserParams": {
      "parser": "SINGLE_COLUMN_PARSER",
      "nullValues": null,
      "nullValuesPerColumn": null,
      "charsetName": "UTF-8",
      "addFilePath": true,
      "addByteOffset": false,
      "addImportedAt": false
    }
}
}

score 1 · Accepted Answer

ollie299792458 的响应仅在 dataFrameReaderClass 为 com.palantir.foundry.spark.input.TextDataFrameReader 时有效。

或者，您可以在使用 Spark input_file_name函数读取代码存储库或工作簿中的数据集时获取文件名：

Creates a string column for the file name of the current Spark task.

palantir-foundry - 我可以在 Foundry 的管道中获取同步文本文件的文件名吗？

2 回答 2

Related

Reference