2

我有两个 CSV 文件,其中包含许多 n 列。我必须将这两个 csv 文件与一个 CSV 文件合并,该文件具有两个输入文件中的一个唯一列。

我彻底浏览了所有的博客和网站。所有这些都将导致使用自定义的 .NET Activity。所以我只是浏览了这个网站

但仍然无法弄清楚 C# 编码中的哪一部分。任何人都可以分享如何使用 Azure 数据工厂中的自定义 .NET Activity 合并这两个 CSV 文件的代码。

4

1 回答 1

1

下面是一个示例,说明如何使用 U-SQL 在 Zip_Code 列上连接这两个制表符分隔的文件。此示例假定这两个文件都保存在 Azure Data Lake Storage (ADLS) 中。这个脚本可以很容易地合并到数据工厂管道中:

// Get raw input from file A
@inputA =
    EXTRACT 
        Date_received   string,
        Product string,
        Sub_product string,
        Issue   string,
        Sub_issue   string,
        Consumer_complaint_narrative    string,
        Company_public_response string,
        Company string,
        State   string,
        ZIP_Code    string,
        Tags    string,
        Consumer_consent_provided   string,
        Submitted_via   string,
        Date_sent_to_company    string,
        Company_response_to_consumer    string,
        Timely_response string,
        Consumer_disputed   string,
        Complaint_ID    string

    FROM "/input/input48A.txt"
    USING Extractors.Tsv();


// Get raw input from file B
@inputB =
    EXTRACT Provider_ID string,
            Hospital_Name string,
            Address string,
            City string,
            State string,
            ZIP_Code string,
            County_Name string,
            Phone_Number string,
            Hospital_Type string,
            Hospital_Ownership string,
            Emergency_Services string,
            Meets_criteria_for_meaningful_use_of_EHRs string,
            Hospital_overall_rating string,
            Hospital_overall_rating_footnote string,
            Mortality_national_comparison string,
            Mortality_national_comparison_footnote string,
            Safety_of_care_national_comparison string,
            Safety_of_care_national_comparison_footnote string,
            Readmission_national_comparison string,
            Readmission_national_comparison_footnote string,
            Patient_experience_national_comparison string,
            Patient_experience_national_comparison_footnote string,
            Effectiveness_of_care_national_comparison string,
            Effectiveness_of_care_national_comparison_footnote string,
            Timeliness_of_care_national_comparison string,
            Timeliness_of_care_national_comparison_footnote string,
            Efficient_use_of_medical_imaging_national_comparison string,
            Efficient_use_of_medical_imaging_national_comparison_footnote string,
            Location string

    FROM "/input/input48B.txt"
    USING Extractors.Tsv();


// Join the two files on the Zip_Code column
@output =
    SELECT b.Provider_ID,
           b.Hospital_Name,
           b.Address,
           b.City,
           b.State,
           b.ZIP_Code,
           a.Complaint_ID

    FROM @inputA AS a
         INNER JOIN
             @inputB AS b
         ON a.ZIP_Code == b.ZIP_Code
    WHERE a.ZIP_Code == "36033";


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

这也可以转换为带有文件名和邮政编码参数的 U-SQL 存储过程。

当然有可能实现这一点的方法,每种方法都有自己的优点和缺点。例如,.net 自定义活动可能会让具有 .net 背景的人感觉更舒服,但您需要一些计算才能在其上运行它。对于具有 SQL / 数据库背景和订阅中的 Azure SQL DB 的人来说,将文件导入 Azure SQL 数据库将是一个不错的选择。

于 2017-01-29T20:18:45.420 回答