1

场景:我通过指定加载模式读取了两个 XML 文件。

在模式中,其中一个标签是强制性的。一个 XML 缺少该强制标记。

现在,当我执行以下操作时,我希望过滤掉缺少强制标记的 XML。

dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());

在代码中,当我尝试计算数据集的行数时,我得到的计数为 2(2 个输入 XMLS),但是当我尝试通过 show() 方法打印数据集时,我得到一个 NPE。

当我调试上述行并执行以下操作时,我得到 0 作为计数。

dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();

问题:

谁能回答下面的问题/确认我的理解

  1. 为什么 spark Dataset 不过滤没有强制列的行?
  2. 为什么 count 中没有异常,但 show 方法中没有异常?

对于 2,我相信计数只是在不查看内容的情况下计算行数。为了展示,迭代器实际上通过 Struct Fields 打印它们的值,当它没有找到强制列时,它会出错。

PS 如果我将必填列设为可选,则一切正常。

编辑:

根据要求提供阅读选项

为了加载数据,我正在执行以下操作

Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
                .option("header", "true")
                .option("inferSchema", "false")
                .option("rowTag", rowTag)//rowTag is "body" tag in the XML
                .option("failFast", "true")
                .option("mode", "FAILFAST")
                .schema(schema)
                .load(XMLfilePath);

按要求提供样品

架构:

root
 |-- old: struct (nullable = true)
 |    |-- _beyond: string (nullable = true)
 |    |-- lot: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _chose: string (nullable = true)
 |    |-- real: struct (nullable = true)
 |    |    |-- _eat: string (nullable = true)
 |    |    |-- kill: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _top: string (nullable = true)
 |    |    |-- tool: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _affect: string (nullable = true)
 |-- porch: struct (nullable = true)
 |    |-- _account: string (nullable = true)
 |    |-- cast: string (nullable = true)
 |    |-- vegetable: struct (nullable = true)
 |    |    |-- leg: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _nose: string (nullable = true)
 |    |    |-- now: struct (nullable = true)
 |    |    |    |-- _gravity: string (nullable = true)
 |    |    |    |-- chief: struct (nullable = true)
 |    |    |    |    |-- _VALUE: long (nullable = true)
 |    |    |    |    |-- _further: string (nullable = true)
 |    |    |    |-- field: string (nullable = true)

示例 XML:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
    <body>
    <porch account="something">
        <vegetable>
            <now gravity="wide">
                <field>box</field>
                <chief further="satisfied">-1889614487</chief>
            </now>
            <leg nose="angle">912658017.229279</leg>
        </vegetable>
        <cast>clear</cast>
    </porch>
    <old beyond="continent">
        <real eat="term">
            <kill top="plates">-1623084908.8669372</kill>
            <tool affect="pond">today</tool>
        </real>
        <lot chose="swung">promised</lot>
    </old>
    </body>
</root>

JSON 格式的架构:

{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}

我的场景可以通过将元素“旧”设置为 nullable = false 并从 XML 中删除标签来重现

4

0 回答 0