sql - 基于匹配的字符串部分有效地加入/合并

Question

我正在尝试根据第一个表中的字符串是否包含在第二个表中的长字符串的一部分中来连接两个表。我在 SAS 中使用 PROC SQL，但也可以使用数据步骤而不是 SQL 查询。

这段代码在较小的数据集上运行良好，但很快就会陷入困境，因为它必须进行大量的比较。如果它是一个简单的相等检查会很好，但是必须使用该index()函数会变得很困难。

proc sql noprint;
  create table matched as
  select A.*, B.* 
  from  search_notes as B,
        names as A
  where index(B.notes,A.first) or 
        index(B.notes,A.last)
  order by names.name, notes.id;
quit;
run;

B.notes 是一个 2000 个字符（有时完全填充）的文本块，我正在寻找包含 A 中的名字或姓氏的任何结果。

我认为分两步执行它不会获得任何速度优势，因为它已经必须将 A 的每一行与 B 的每一行进行比较（因此检查名字和姓氏并不是瓶颈）。

当我运行它时，我会NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.进入我的日志。使用 A=4000 观察和 B=100,000 观察运行它需要 30 分钟才能产生约 1000 个匹配。

有没有办法优化这个？

score 0 · Accepted Answer

这是一个部分答案，使其运行速度提高了 4-5 倍，但并不理想（在我的情况下它有帮助，但在优化笛卡尔积连接的一般情况下不一定有效）。

我最初有 4 个单独的 index() 语句，就像在我的示例中一样（我的简化示例有 2 个用于 A.first 和 A.last）。

我能够将所有 4 个 index() 语句（加上我要添加的第 5 个）重构为解决相同问题的正则表达式。它不会返回相同的结果集，但我认为它实际上返回的结果比 5 个单独的索引更好，因为您可以指定单词边缘。

在我清理名称以进行匹配的数据步骤中，我创建了以下模式：

pattern = cats('/\b(',substr(upcase(first_name),1,1),'|',upcase(first_name),').?\s?',upcase(last_name),'\b/');

这应该创建一个正则表达式，该表达式/\b(F|FIRST).?\s?LAST\b/将匹配 F.Last、First Last、flast@email.com 等任何内容（有些组合它无法识别，但我只关心我观察到的组合在我的数据中）。使用 '\b' 也不允许 FLAST 恰好与单词的开头/结尾相同（例如“Edward Lo”与“Eloquent”匹配），我发现使用 index() 很难避免这种情况

然后我像这样执行我的 sql join：

proc sql noprint;
create table matched as
  select  B.*, 
          prxparse(B.pattern) as prxm, 
          A.* 
  from  search_text as A,
        search_names as B
  where prxmatch(calculated prxm,A.notes)
  order by A.id;
quit;
run;

能够为 B 中的每个名称编译一次正则表达式，然后在 A 中的每段文本上运行它似乎比几个索引语句快得多（不确定正则表达式与单个索引的情况）。

使用 A=250,000 Obs 和 B=4,000 Obs 运行它， index() 方法需要大约 90 分钟的 CPU 时间，而使用 prxmatch() 执行相同操作只需要 20 分钟的 CPU 时间。

score 0 · Accepted Answer

笛卡尔积可能最适合您的数据，但可以尝试一下。我正在做的是在数据步骤中使用 CALL EXECUTE() 将步骤匹配构建到数据步骤中。这意味着您只需遍历每个表一次。但是，您将在写入数据步骤中有 4000 个 IF/THEN 子句。这样做会使我的示例数据的运行时间从 55 秒缩短到 40 秒。如果该比率成立，那将比您的 30 分钟减少约 24 分钟。

我会留下这个问题。也许有人可以想出更好的方法。

%let n=50;
data B;
format notes $&n..;
choose = "ABCDEFGHIJLKMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
do j=1 to 9000000;
    notes = "";
    do i=1 to floor(5 + ranuni(123)*(&n-5));
        r = floor(ranuni(123)*62+1);
        notes = catt(notes,substr(choose,r,1));

    end;
    output;
    drop r choose i;
end;
run;

data a;
choose = "ABCDEFGHIJLKMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
format first last $2.;
do i=1 to 62 by 2;
    first = strip(substr(choose,i,1));
    first = catt(first,first);
    last =  strip(substr(choose,i+1,1));
    last = catt(last,last);
    output;
end;
drop choose ;
run;

proc sql noprint;
  create table matched as
  select A.*, B.* 
  from  B as B,
        A as A
  where index(B.notes,A.first) or 
        index(B.notes,A.last)
  order by B.notes, a.i;
quit;

options nosource;
data _null_;
set a end=l;
if _n_ = 1 then do;
    call execute("data matched2; set B;");
    call execute("format First Last $2. i best.;");
end;

format outStr $200.;
outStr = "if index(notes,'" || first || "') or index(notes,'" || last || "') then do;";
call execute(outStr);

outStr = "first = '" || first || "';";
call execute(outStr);
outStr = "last = '" || last || "';";
call execute(outStr);
outStr = "i = " || i || ";";
call execute(outStr);
call execute("output; end;");

if l then do;
    call execute("run;");
end;
run;

proc sort data=matched2;
by notes i;
run;

score 0 · Accepted Answer

这听起来不太适合 PROC SQL。如果我理解正确，您想将每一行search_notes与每一行进行比较names（因此是笛卡尔积）。更传统的数据步骤程序可能更容易理解并且可能更有效：

data matched;
   set search_notes;
   do _i_=1 to nobs;
      set names point=_i_ nobs=nobs;
      if index(notes,first) 
      or index(notes,last) then output;
      end;
   drop _i_;
run;
proc sort data=matched;
   by vendor_name, claimant_id;
run;

sql - 基于匹配的字符串部分有效地加入/合并

3 回答 3

Related

Reference