c# - 在 USQL 中处理具有不同列的文件

Question

我有一个 USQL 脚本和 CSV 提取器来加载我的文件。但是有些月份文件可能包含 4 列，有些月份可能包含 5 列。

如果我使用 4 或 5 个字段的列列表设置我的提取器，我会收到有关文件预期宽度的错误。去检查定界符等。不足为奇。

鉴于 USQL 仍处于新手阶段并且缺少一些基本的错误处理，请解决此问题的方法是什么？

我尝试在提取器中使用静默子句来忽略更宽的列，这对于 4 列很方便。然后使用 IF 条件获取行集的行数，然后具有 5 列的提取器。然而，这会导致行集变量在 IF 表达式中不被用作标量变量。

我还尝试了 C# 样式计数和 sizeof(@AttemptExtractWith4Cols)。都不工作。

代码片段让您了解我正在采用的方法：

DECLARE @SomeFilePath string = @"/MonthlyFile.csv";

@AttemptExtractWith4Cols =
    EXTRACT Col1 string,
            Col2 string,
            Col3 string,
            Col4 string
    FROM @SomeFilePath
    USING Extractors.Csv(silent : true); //can't be good.

//can't assign rowset to scalar variable!
DECLARE @RowSetCount int = (SELECT COUNT(*) FROM @AttemptExtractWith4Cols);

//tells me @AttemptExtractWith4Cols doesn't exist in the current context!
DECLARE @RowSetCount int = @AttemptExtractWith4Cols.Count();

IF (@RowSetCount == 0) THEN
    @AttemptExtractWith5Cols =
        EXTRACT Col1 string,
                Col2 string,
                Col3 string,
                Col4 string,
                Col5 string
        FROM @SomeFilePath
        USING Extractors.Csv(); //not silent
END;


//etc

当然，如果 USQL 中有TRY CATCH块这样的东西，这会容易得多。

这甚至是一个合理的方法吗？

任何投入将不胜感激。

感谢您的时间。

score 4 · Accepted Answer

U-SQL 现在支持OUTER UNION，因此您可以像这样处理它：

// Scenario 1; file has 4 columns
DECLARE @file1 string = @"/input/file1.csv";

// Scenario 2; file has 5 columns
//DECLARE @file1 string = @"/input/file2.csv";


@file =
    EXTRACT col1 string,
            col2 string,
            col3 string,
            col4 string
    FROM @file1
    USING Extractors.Csv(silent : true)

    OUTER UNION ALL BY NAME ON (col1, col2, col3, col4)

    EXTRACT col1 string,
            col2 string,
            col3 string,
            col4 string,
            col5 string
    FROM @file1
    USING Extractors.Csv(silent : true);


@output =
    SELECT *
    FROM @file;


OUTPUT @output
    TO "/output/output.csv"
USING Outputters.Csv();

在我的示例中，file1 有 4 列，file2 有 5 列。该脚本在任何一种情况下都能成功运行。

我的结果：

希望这是有道理的。

score 3 · Accepted Answer

The OUTER UNION is a great solution. Alternatively, you can also write your own generic extractor if you expect your rows in a file to be different. See this blog post for an example.

score 1 · Accepted Answer

这是我发现有用的另一个解决方案。您可以将文件作为单个文本列读取（使用“\t”作为分隔符，因为没有任何分隔符），然后使用 C# 字符串函数动态拆分。我已经在类似的问题上对此进行了测试。这种方法的优点是您可以对任意数量的列使用相同的方法。

SELECT
      (String)(ColList[0])   AS ColA
    , (String)(ColList[1])   AS ColB
    , (String)(ColList[2])   AS ColC
    , (String)(ColList[3])   AS ColD
    , (int?)(NumColumns >= 5 ? (String)(ColList[4]) : (String)null)
                             AS ColE
FROM (
    SELECT ColList
         , ColList.Count AS NumColumns
    FROM (
        SELECT SqlArray.Create(RowText.Split(','))   AS ColList
        FROM (
            EXTRACT RowText string
            FROM @SomeFilePath
            USING Extractors.Text(delimiter: '\t', quoting: false)
        ) AS [T1]
    ) AS [T2]
) AS [T3]

警告：此解决方案不知道文本引用。字段值中的任何逗号都会破坏这个逻辑。

c# - 在 USQL 中处理具有不同列的文件

3 回答 3

Related

Reference